Friday, January 16, 2015

A tale of two forensic scientists

Curley and Arrow are forensic scientist expert witnesses involved in very similar court cases. Each receives a sum of money from a lawyer for one side in the case. Arrow will spend the money doing repeated tests until the money (which of course also pays for his time) runs out, and then he will present to the court the full test data that she found. Curley, on the other hand, will do tests until such time as either the money runs out or he has reached an evidential level sufficient for the court to come to the decision that the lawyer who hired him wants. Of course, Curley isn't planning to commit perjury: he will truthfully report all the tests he actually did, and hopes that the court won't ask him why he stopped when he did. Curley reasons that his method of proceeding has two advantages over Arrow's:

  1. if he stops the experiments early, his profits are higher and he has more time for waterskiing; and
  2. he is more likely to confirm the hypothesis that his lawyer wants confirmed, and hence he is more likely to get repeat business from this lawyer.

Now here is a surprising fact which is a consequence of the martingale property of Bayesian investigations (or of a generalization of van Fraassen's Reflection Principle). When Curley and Arrow reflect on what their final credences will be with respect to what they are each testing, if they are perfect Bayesian agents, their expectation for their future credence equals their current credence. This thought may lead Curley to think himself quite innocent in his procedures. After all, on average, he expects to end up with the same credence as he would if he followed Arrow's more onerous procedures.

So why do we think Curley crooked? It's because we do not just care about the expected values of credences. We care about whether credences reach particular thresholds. In the case at hand, we care about whether the credence reaches the threshold that correlates with a particular court decision. And Curley's method does increase the probability that that credence level will be reached.

What happens is that Curley, while favoring a particular conclusion, sacrifices the possibility of reaching evidence that confirms that conclusion to a degree significantly higher than his desired threshold, for the sake of increasing the probability of reaching the threshold. For when he stops his experiments once the level of confirmation has reached the desired threshold, he is giving up on the possibility—useless to him or to the side that hired him—that the level of confirmation will go up even higher.

I think it helps that in real life we don't know what the thresholds are. Real-life experts don't know just how much evidence is needed, and so there is some an incentive to try to get a higher level of confirmation, rather than to stop once one has reached a threshold. But of course in the above I stipulated there was a clear and known threshold.


William said...

This effect is both true and quite real.

In the biomedical sciences, we often ideed do have a well defined threshold P value (in a statical analysis such as t testing or ANOVA) of p< 0.05 or, occasionally, p < 0.01, to prove or disprove our chosen hypothesis, with a bias toward proving the hypothesis (since it means we were right in our theory).

So, this scenario, unfortunately, happens really really often, and due to well documented journal publication bias for "significant" results, this bias leads to type 2 errors in lots of studies that fail to have their results subsequently confirmed.

Alexander R Pruss said...

Very interesting.

On reflection, the problem will be around when the reward structure makes the reward for the achieved evidential level fail to be a submartingale.

When there is a sharp threshold of significance, the problem of researchers stopping too soon will be there.

I think (I am just using intuition--haven't checked that the theorems say what I need them to say) could avoid the stopping problem by having a reward structure that is a strictly convex function of the achieved evidential level. For instance, suppose that researchers were paid 1/(1-r) for a result at credence level r (roughly equivalent to 1/p for significance level p) or more symmetrically (r-1/2)^2, and were forced to make all their experimental data public.

Then I think it would always pay to do more experiments.

In fact, I suspect (this can't be hard to prove, but I'm tired and need to run to a meeting soon) that there will be the possibility of perverse rewards for stopping early if and only if the reward fails to be a convex function of the credence value.

And our usual reward functions for scientists are non-convex. They tend to have a bunch of thresholds (publication, tenure, promotion, etc.), and to be fairly constant between the thresholds.

Alexander R Pruss said...

Seems like you could test experimentally for whether the scenario happens. If people are stopping experiments when they achieve "significance", then we would expect the significance values in papers to cluster just past the significance threshold. But if people are using a less biased stopping rule, say because they decide ahead of time which and how many tests they do or because they do experiments until they run out of funds, then we would expect a broader distribution, with a number of results published at higher significance levels.