Tuesday, January 31, 2023

Scoring rules and publication thresholds

One of the most problematic aspects of some science practice is a cut-off, say at 95%, for the evidence-based confidence needed for publication.

I just realized, with the help of a mention of p-based biases and improper scoring rules somewhere on the web, that what is going on here is precisely a problem of a reward structure that does not result in a proper scoring rule, where a proper scoring rule is one where your current probability assignment is guaranteed to have an optimal expected score according to that very probability assignment. Given an improper scoring rule, one has a perverse incentive to change one’s probabilities without evidence.

To a first approximation, the problem is really, really bad. Insofar as publication is the relevant reward, it is a reward independent of the truth of the matter! In other words, the scoring rule has a reward for gaining probability 0.95 (say) in the hypothesis, regardless of whether the hypothesis is true or false.

Fortunately, it’s not quite so bad. Publication is the short-term reward. But there are long-term rewards and punishments. If one publishes, and later it turns out that one was right, one may get significant social recognition as the discoverer of the truth of the hypothesis. And if one publishes, and later it turns out one is wrong, one gets some negative reputation.

However, notice this. Fame for having been right is basically independent of the exact probability of the hypothesis one established in the original paper. As long as the probability was sufficient for publication, one is rewarded for fame. Thus if it turns out that one was right, one’s long-term reward is fame if and only if one’s probability met the threshold for publication and one was right. And one’s penalty is some negative reputation if and only if one’s probability met the threshold for publication and yet one was wrong. But note that scientists are actually extremely forgiving of people putting forward evidenced hypotheses that turn out to be false. Unlike in history, where some people live on in infamy, scientists who turn out to be wrong do not suffer infamy. At worst, some condescension. And it barely varies with your level of confidence.

The long-term reward structure is approximately this:

  • If your probability is insufficient for publication, nothing.

  • If your probability meets the threshold for publication and you’re right, big positive.

  • If your probability meets the threshold for publication and you’re wrong, at worst small negative.

This is not a proper scoring rule. It’s not even close. To make it into a proper scoring rule, the penalty for being wrong at the threshold would need to be way higher than the reward for being right. Specifically, if the threshold is p (say 0.95), then the ratio of reward to penalty needs to be (1−p) : p. If p = 0.95, the reward to penalty ratio would need to be 1:19. If p = 0.99, it would need to be a staggering 1:99, and if p = 0.9, it would need to be a still large 1:9. We are very, very far from that. And when we add the truth-independent reward for publication, things become even worse.

We can see that something is problematic if we think about cases like this. Suppose your current level of confidence is just slightly above the threshold, and a graduate student in your lab proposes to do one last experiment in her spare time, using equipment and supplies that would otherwise go to waste. Given the reward structure, it will likely make sense for you to refuse this free offer of additional information. If the experiment favors your hypothesis, you get nothing out of it—you could have published without it, and you’d still have the same longer term rewards available. But if the experiment disfavors your hypothesis, it will likely make your paper unpublishable (since you were at the threshold), but since it’s just one experiment, it is unlikely to put you into the position of yet being able to publish a paper against the hypothesis. At best it loses you the risk of the small negative reputation for having been wrong, and since that’s a small penalty, and an unlikely one (since most likely your hypothesis is true by your data), so that’s not worth it. In other words, the the structure rewards you for ignoring free information.

How can we fix this? We simply cannot realistically fix it if we have a high probability threshold for publication. The only way to fix it while keeping a high probability threshold would be by having a ridiculously high penalty for being wrong. But we should’t do stuff like sentencing scientists to jail for being wrong (which has happened). Increasing the probability threshold for publication would only require the penalty for being wrong to be increased. Decreasing probability thresholds for publication helps a little. But as long as there is a larger reputational benefit from getting things right than the reputational harm from getting things wrong, we are going to have perverse incentives from a probability threshold for publication bigger than 1/2, no matter where that threshold lies. (This follows from Fact 2 in my recent post, together with the observation that Schervish’s characterization of scoring rules shows implies that any reward function corresponds to a unique up to additive constant penalty function.)

What’s the solution? Maybe it’s this: reward people for publishing lots of data, rather than for the data showing anything interestingly, and do so sufficiently that it’s always worth publishing more data?


Andrew Dabrowski said...

Interesting. But it should be noted the situation is more complicated. For example, there are professional penalties for being caught p-hacking.

Alexander R Pruss said...

Yeah, but note that the situation I discuss -- refusing more experiments once you reach the threshold -- is not exactly p-hacking. First, I don't know that actual p-hacking really yields an evidence-based probability of 0.95. What I am worried about happens even if you get a genuine bona fide evidence-based probability of 1-p.

Second, the particular abuse that I mention probably doesn't fall under p-hacking as one might think it's sound experimental design to set up a stopping point in advance, and then stop. But given a strictly proper scoring rule, it's always worth getting more data.

IanS said...

The 95% threshold doesn’t usually relate to a Bayesian posterior probability.

In the usual (‘classical’, ‘hypothesis testing’, ‘significance test’) approach, to reject the null hypothesis at the 95% confidence level means this: if the null hypothesis (of zero effect) were true, the chance of a result as extreme as that observed would be less than 0.05. This, as Bayesian statisticians routinely complain, says nothing directly about any posterior probability. I think significance tests are best seen as a sort of noise filter - ’The effect we saw was above the typical noise level, so it’s worth looking at’.

Andrew Dabrowski said...

I didn't realize "p-hacking" is defined so precisely. Anyway it was just an example I gave; researchers are, at least in principle, held to high standards.

Alexander R Pruss said...


Right. So one way of thinking about my post is to note that simply switching to Bayesianism, and tweaking the threshold, will not solve the problems.