Alexander Pruss's Blog: scoring rules

Showing posts with label scoring rules. Show all posts

Thursday, February 5, 2026

More on strong open-mindedness

For the last couple of days I have been exploring what I like to call strongly open-minded accuracy scoring rules. It’s well known that every proper scoring rule is open-minded in the sense that it never requires you to reject free information: the expected epistemic utility of updating on the free information is always at least as good as your current expected epistemic utility. It’s strictly open-minded provided that in non-trivial cases (i.e., when the information has a non-zero probability of having statistical relevance to the credences you are scoring) you are required to accept the free information.

Now there are two reasons why one might accept free information about some proposition q. First, you might be wrong about q: your credence may be high while q is false or your credence might be low while q is true. Second, even if you are right about q, the free information may boost your credence in the right direction. I say that a scoring rule is strongly open-minded provided that it licenses you to accept and update on the free information even if you disregard the first consideration. We can then tack on “strictly” if it requires you to do so in non-trivial cases. In the case of a strongly open-minded scoring rule, your acceptance of free information is not a sign of doubt in your propositions—it is not a way of hedging your bets—and thus is arguably compatible with faith in the propositions being evaluated.

A strongly open-minded scoring rule can also be characterized in the following way. There is a more ordinary kind of epistemic paternalism where I might have reason to block another from receiving free information on the grounds that this information could mislead due to the fact that the other has different likelihoods from the ones I think are right. For instance, if too many people have an unjustified mistrust of Dr. Smith such that they are likely to believe the opposite of what Dr. Smith’s experiments reveal, there is reason to give a grant to someone else, because Dr. Smith’s experiments are likely to lead people away from the truth, for no fault of Dr. Smith’s. Call this likelihood-based paternalism. But there is another kind of motivation of the refusal of free information for another, which we might call pure-risk-based paternalism. Even if someone else has the same likelihoods as you do—trusts Dr. Smith just as you do—perhaps the risk that Dr. Smith’s experiments will, by pure chance, provide evidence away from the truth is enough to justify not funding these experiments.

I’ve been collecting results about these issues. Here’s what I seem t have so far, though I have to emphasize that sometimes the proofs are just in my head and I might be wrong. I will specialize on scoring rules for a single proposition, given as a pair of functions T and F, where T(x) is the value of having credence x when the proposition is true and F(x) is the value of having credence x when the proposition is false.

A scoring rule sometimes calls for pure-risk-based paternalism if and only if it is not strongly open-minded.
A scoring rule that’s strongly open-minded is open-minded.
A scoring rule (T,F) is (strictly) strongly open-minded if and only if xT(x) and (1−x)F(1−x) are both (strictly) convex.
The logarithmic scoring rule is strictly strongly open-minded. The Brier and spherical rules are not strongly open-minded.
If a proper scoring rule is generated by the Schervisch-style integral representation T(x) = T(1/2) + ∫_x^1/2(1−t)b(t)dt and F(x) = F(1/2) + ∫_1/2^xtb(t)dt and b is sufficiently differentiable, then the scoring rule is strongly open-minded if and only if the derivative of log b(x) lies between (3x−2)/[x(1−x)] and (3x−1)/[x(1−x)].
A strongly open-minded scoring rule whose logarithm is sufficiently differentiable is unbounded.
If your credence in a hypothesis H is at least (at most) 1/2, then a proper scoring rule will not call for purely-risk-based epistemic paternalism with respect to someone whose credence is equal to or higher (lower) than yours.
If your credence in a hypothesis H is 1/2, then no proper scoring rule calls for purely-risk-based epistemic paternalism for that hypothesis.
For any credences p and r such that 1/2 < r and p < r, there is a strictly proper scoring rule and a situation where the scoring rule calls for the individual with credence r to have purely-risk-based epistemic paternalism for that hypothesis.

Monday, September 29, 2025

Lying and epistemic utility

Epistemic utility is the value of one’s beliefs or credences matching the truth.

Suppose your and my credences differ. Then I am going to think that my credences better match the truth. This is automatic if I am measuring epistemic utilities using a proper scoring rule. But that means that benevolence with respect to epistemic utilities gives me a reason to shift your credences to be closer to mine.

At this point, there are honest and dishonest ways to proceed. The honest way is to share all my relevant evidence with you. Suppose I have done that. And you’ve reciprocated. And we still differ in credences. If we’re rational Bayesian agents, that’s presumably due to a difference in prior probabilities. What can I do, then, if the honest ways are exhausted?

I can lie! Suppose your credence that there was once life on Mars is 0.4 and mine is 0.5. So I tell you that I read that a recent experiment provided a little bit of evidence in favor of there once having been life on Mars, even though I read no such thing. That boosts your credence that there was once life on Mars. (Granted, it also boosts your credence in the falsehood that there was such a recent experiment. But, plausibly, getting right whether there was once life on Mars gets much more weight in a reasonable person’s epistemic utilities than getting right what recent experiments have found.)

We often think of lying as an offense against truth. But in these kinds of cases, the lies are aimed precisely at moving the other towards truth. And they’re still wrong.

Thus, it seems that striving to maximize others’ epistemic utility is the wrong way to think of our shared epistemic life.

Maximizing others’ epistemic utility seems to lead to a really bad picture of our shared epistemic life. Should we, then, think of striving to maximize our own epistemic utility as the right approach to one’s individual epistemic life? Perhaps. For maybe what is apt to go wrong in maximizing others’ epistemic utility is paternalism, and paternalism is rarely a problem in one’s own case.

Thursday, September 11, 2025

Why do we like being confident?

We like being more confident. We enjoy having credences closer to 0 or 1. Even if the proposition we are confident in is one that is such that it is a bad thing that it is true, the confidence itself, abstracted from the badness of the state of affairs reported by the proposition, is something we enjoy.

Here is a potential justification of this attitude in many cases. We can think of the epistemic utility of one’s credence r in a proposition p as measured by an accuracy scoring rule given by two functions T(r) and F(r), where T(r) gives the value of having credence r in p when p is actually true and F(r) gives the value when p is actually false. Most people thinking about scoring rules think they should satisfy the technical condition of being strictly proper. But strict propriety implies that the function V(r) = rT(r) + (1−r)F(r) is strictly convex. Now suppose the scoring rule is also symmetric, so that T(r) = F(1−r). Then V(r) is a strictly convex function that is symmetric about r = 1/2. Such a function has its minimum at r = 1/2, and is strictly decreasing on [0,1/2] and strictly increasing on [1/2,1]. But the function V(r) measures your expectation of your epistemic utility. How happy you are about your credence, perhaps, corresponds to your expectation of your epistemic utility. So you are most unhappy at credence 1/2, and you get happier that closer you are to 0 or 1.

OK, it’s surely not that?!

Monday, September 8, 2025

Epistemic utilities and decision theories

Warning: I worry there may be something wrong in the reasoning below.

Causal Decision Theory (CDT) and Epistemic Decision Theory (EDT) tend to disagree when the payoff of an option statistically depends on your propensity to go for that option. The most example of this phenomenon is Newcomb’s Problem (where money is literally put into a box or not depending on what your propensities are), and there is a large literature of other clever and mind-twisting examples. From the literature, one might get a feeling that these cases are all somehow weird, and normally there is no such dependence.

But here is a family of cases that happens literally almost all the time to us. Pretty much whenever we act we gain information relevant to facts about ourselves, and specifically to facts about our propensities to act. For instance, when you choose chocolate over vanilla ice cream you raise your credence for the hypothesis that you have a greater propensity to choose chocolate ice cream than to choose vanilla ice cream. But truth about oneself is valuable and falsehood about oneself is disvaluable. If in fact you have a greater propensity to choose chocolate ice cream, then by eating chocolate ice cream you gain credence in a truth, which is a good thing. If in fact your propensity for vanilla ice cream is at least as great as for chocolate ice cream, then by eating chocolate ice cream, you gain credence in a falsehood. The payoffs of your decision as to flavor of ice cream thus statistically depend on what your propensities actually are, and so this is exactly the kind of case where we would expect CDT and EDT to disagree.

Let’s be more precise. You have a choice between eating chocolate ice cream (C), eating vanilla ice cream (V) or not eating ice cream at all (N). Let H be the hypothesis that you have a greater propensity for eating chocolate ice cream than for eating vanilla ice cream. Then if you choose C, you will gain evidence for H. If you choose V, you will gain evidence for not-H. And if you choose N, you will (plausibly) gain no evidence for or against H. Your epistemic utility with respect to H is, let us suppose, measured by a single-proposition accuracy scoring rule, which we can think of as a pair of functions T_H and F_H, where T_H(p) is the value of having credence p in H if in fact H is true and F_H(p) is the value of having credence p in H if in fact H is false.

The expected evidential utilities of your three options are:

E_e(C) = P(H|C)T_H(P(H|C)) + (1−P(H|C))F_H(P(H|C))
E_e(V) = P(H|V)T_H(P(H|V)) + (1−P(H|V))F_H(P(H|V))
E_e(N) = P(H|N)T_H(P(H|N)) + (1−P(H|N))F_H(P(H|N)) = P(H)T_H(P(H)) + (1−P(H))F_H(P(H)).

The expected causal utilities are:

E_c(C) = P(H)T_H(P(H|C)) + (1−P(H))F_H(P(H|C))
E_c(V) = P(H)T_H(P(H|V)) + (1−P(H))F_H(P(H|V))
E_c(N) = P(H)T_H(P(H|N)) + (1−P(H))F_H(P(H|N)) = P(H)T_H(P(H)) + (1−P(H))F_H(P(H)).

We can make some quick observations in the case where the scoring rule is strictly proper, given that P(H|V) < P(H) < P(H|C):

E_c(C) < E_c(N)
E_c(V) < E_c(N)
At least one of E_e(C) > E_e(N) and E_e(V) > E_e(N) is true.

Observations 1 and 2 follow immediately from strict propriety and the formulas for E_c. Observation 3 follows from the fact that the expected accuracy score after Bayesian update on evidence is better (in non-trivial cases where the scoring rule is strictly proper) than before update, and the expected accuracy score after update on what you’ve chosen is:

P(C)E_e(C) + P(V)E_e(V) + P(N)E_e(N)

while the expected accuracy score before update is equal to E_e(N). Since P(C) + P(V) + P(N) = 1, it follows from the superiority of the post-update expectation that at least one of E_e(C) and E_e(V) must be bigger than E_e(N).

The above results seem to be a black eye for CDT, which recommends that if what you care about is your epistemic utility with regard to your propensities regarding chocolate and vanilla ice cream, then you should always avoid eating ice cream!

(What about ratifiability? Some CDTers say that only ratifiable options should count. Is N ratifiable? Given that you’ve learned nothing about H from choosing N, I think N should be ratifiable. But I may be missing something. I find the epistemic utility case confusing.)

It also seems to me (I haven’t checked details) that on EDT there are cases where eating either flavor is good for you epistemically, but there are also cases where only one specific flavor is good for you.

Tuesday, June 3, 2025

Combining epistemic utilities

Suppose that the right way to combine epistemic utilities or scores across individuals is averaging, and I am an epistemic act expected-utility utilitarian—I act for the sake of expected overall epistemic utility. Now suppose I am considering two different hypotheses:

Many: There are many epistemic agents (e.g., because I live in a multiverse).
Few: There are few epistemic agents (e.g., because I live in a relatively small universe).

If Many is true, given averaging my credence makes very little difference to overall epistemic utility. On Few, my credence makes much more of a difference to overall epistemic utility. So I should have a high credence for Few. For while a high credence for Few will have an unfortunate impact on overall epistemic utility if Many is true, because the impact of my credence on overall epistemic utility will be small on Many, I can largely ignore the Many hypothesis.

In other words, given epistemic act utilitarianism and averaging as a way of combining epistemic utilities, we get a strong epistemic preference for hypotheses with fewer agents. (One can make this precise with strictly proper scoring rules.) This is weird, and does not match any of the standard methods (self-sampling, self-indication, etc.) for accounting for self-locating evidence.

(I should note that I once thought I had a serious objection to the above argument, but I can't remember what it was.)

Here’s another argument against averaging epistemic utilities. It is a live hypothesis that there are infinitely many people. But on averaging, my epistemic utility makes no difference to overall epistemic utility. So I might as well believe anything on that hypothesis.

One might toy with another option. Instead of averaging epistemic utilities, we could average credences across agents, and then calculate the overall epistemic utility by applying a proper scoring rule to the average credence. This has a different problematic result. Given that there are at least billions of agents, for any of the standard scoring rules, as long as the average credence of agents other than you is neither very near zero nor very near one, your own credence’s contribution to overall score will be approximately linear. But it’s not hard to see that then to maximize expected overall epistemic utility, you will typically make your credence extreme, which isn’t right.

If not averaging, then what? Summing is the main alternative.

Tuesday, February 25, 2025

Being known

The obvious analysis of “p is known” is:

There is someone who knows p.

But this obvious analysis doesn’t seem correct, or at least there is an interesting use of “is known” that doesn’t fit (1). Imagine a mathematics paper that says: “The necessary and sufficient conditions for q are known (Smith, 1967).” But what if the conditions are long and complicated, so that no one can keep them all in mind? What if no one who read Smith’s 1967 paper remembers all the conditions? Then no one knows the conditions, even though it is still true that the conditions “are known”.

Thus, (1) is not necessary for a proposition to be known. Nor is this a rare case. I expect that more than half of the mathematics articles from half a century ago contain some theorem or at least lemma that is known but which no one knows any more.

I suspect that (1) is not sufficient either. Suppose Alice is dying of thirst on a desert island. Someone, namely Alice, knows that she is dying of thirst, but it doesn’t seem right to say that it is known that she is dying of thirst.

So if it is neither necessary nor sufficient for p to be known that someone knows p, what does it mean to say that p is known? Roughly, I think, it has something to do with accessibility. Very roughly:

Somebody has known p, and the knowledge is accessible to anyone who has appropriate skill and time.

It’s really hard to specify the appropriateness condition, however.

Does all this matter?

I suspect so. There is a value to something being known. When we talk of scientists advancing “human knowledge”, it is something like this “being known” that we are talking about.

Imagine that a scientist discovers p. She presents p at a conference where 20 experts learn p from her. Then she publishes it in a journal when 100 more people learn it. Then a Youtuber picks it up and now a million people know it.

If we understand the value of knowledge as something like the sum of epistemic utilities across humankind, then the successive increments in value go like this: first, we have a move from zero to some positive value V when the scientist discovers p. Then at the conference, the value jumps from V to 21V. Then after publication it goes from 21V to 121V. Then given Youtube, it goes from 121V to 100121V. The jump at initial discovery is by far the smallest, and the biggest leap is when the discovery is publicized. This strikes me as wrong. The big leap in value is when p becomes known, which either happens when the scientist discovers it or when it is presented at the conference. The rest is valuable, but not so big in terms of the value of “human knowledge”.

Monday, February 24, 2025

More on averaging to combine epistemic utilities

Suppose that the right way to combine epistemic utilities across people is averaging: the overall epistemic utility of the human race is the average of the individual epistemic utilities. Suppose, further, that each individual epistemic utility is strictly proper, and you’re a “humanitarian” agent who wants to optimize overall epistemic utility.

Suppose you’re now thinking about two hypotheses about how many people exist: the two possible numbers are m and n, which are not equal. All things considered, you have credence 0 < p₀ < 1 in the hypothesis H_m that there are m people and 1 − p₀ in the hypothesis H_n that there are n people. You now want to optimize overall epistemic utility. On an averaging view, if H_m is true, if your credence is p₁, your contribution to overall epistemic utility will be:

(1/m)T(p₁)

and if H_m is false, your contribution will be:

(1/n)F(p₁),

where your strictly proper scoring rule is given by T, P. Since your credence is p₁, by your lights the expected value after your changing your credence to p₀ will be:

p₀(1/m)T(p₁) + (1−p₀)(1/n)F(p₁) + Q(p₀)

where Q(p₀) is the contribution of other people’s credences, which I assume you do not affect with your choice of p₁. If m ≠ n and T, F is strictly proper, the expected value will be maximized at

p₁ = (p₀/m)/(p₀/m+(1−p₀)/n) = np₀/(np₀+m(1−p₀)).

If m > n, then p₁ < p₀ and if m < n, then p₁ > p₀. In other words, as long as n ≠ m, if you’re an epistemic humanitarian aiming to improve overall epistemic utility, any credence strictly between 0 and 1 will be unstable: you will need to change it. And indeed your credence will converge to 0 if m > n and to 1 if m < n. This is absurd.

I conclude that we shouldn’t combine epistemic utilities across people by averaging the utilities.

Idea: What about combining them by computing the epistemic utilities of the average credences, and then applying a strictly proper scoring rule, in effect imagining that humanity is one big committee and that a committee’s credence is the average of the individual credences?

This is even worse, because it leads to problems even without considering hypotheses on which the number of people varies. Suppose that you’ve just counted some large number nobody cares about, such as the number of cars crossing some intersection in New York City during a specific day. The number you got is even, but because the number is big, you might well have made a mistake, and so your credence that the number is even is still fairly low, say 0.7. The billions of other people on earth all have credence 0.5, and because nobody cares about your count, you won’t be able to inform them of your “study”, and their credences won’t change.

If combined epistemic utility is given by applying a proper scoring rule to the average credence, then by your lights the expected value of the combined epistemic utility will increase the bigger you can budge the average credence, as long as you don’t get it above your credence. Since you can really only affect your own credence, as an epistemic humanitarian your best bet is to set your credence to 1, thereby increasing overall human credence from 0.5 to around 0.5000000001, and making a tiny improvement in the expected value of the combined epistemic utility of humankind. In doing so, you sacrifice your own epistemic good for the epistemic good of the whole. This is absurd!

I think the idea of averaging to produce overall epistemic utilities is just wrong.

Friday, February 21, 2025

Adding or averaging epistemic utilities?

Suppose for simplicity that everyone is a good Bayesian and has the same priors for a hypothesis H, and also the same epistemic interests with respect to H. I now observe some evidence E relevant to H. My credence now diverges from everyone else’s, because I have new evidence. Suppose I could share this evidence with everyone. It seems obvious that if epistemic considerations are the only ones, I should share the evidence. (If the priors are not equal, then considerations in my previous post might lead me to withhold information, if I am willing to embrace epistemic paternalism.)

Besides the obvious value of revealing the truth, here are two ways to reason for this highly intuitive conclusion.

First, good Bayesians will always expect to benefit from more evidence. If my place and that of some other agent, say Alice, were switched, I’d want the information regarding E to be released. So by the Golden Rule, I should release the information.

Second, good Bayesians’ epistemic utilities are measured by a strictly proper scoring rule. But if Alice’s epistemic utilities for H are measured by a strictly proper (accuracy) scoring rule s that assigns an epistemic utility s(p,t) to a credence p when the actual truth value of H is t, which can be zero or one. By definition of strict propriety, the expectation by my lights of what Alice’s epistemic utility for a given credence should be is strictly maximized when that credence equals my credence. Since Alice shares the priors I had before I observed E, if I can make E evident to her, her new posteriors will match my current ones, and so revealing E to her will maximize my expectation of her epistemic utility.

So far so good. But now suppose that the hypothesis H = H_N is that there exist N people other than me, and my priors assign probability 1/2 to there being N and 1/2 to its being n, where N is much larger than n. Suppose further that my evidence E ends up significantly supporting hypothesis H_n, so that my posterior p in H_N is smaller than 1/2.

Now, my expectation of the total epistemic utility of other people if I reveal E is:

U_R = pNs(p,1) + (1−p)ns(p,0).

And if I conceal E, my expectation is:

U_C = pNs(1/2,1) + (1−p)ns(1/2,0).

If we had N = n, then it would be guaranteed by strict propriety that U_R > U_C, and so I should reveal. But we have N > n. Moreover, s(1/2,1) > s(p,1): if some hypothesis is true, a strictly proper accuracy scoring rule increases strictly monotonically with the credence. If N/n is sufficiently large, the first terms of U_R and U_C will dominate, and hence we will have U_C > U_R, and thus I should conceal.

The intuition behind this technical argument is this. If I reveal the evidence, I decrease people’s credence in H_N. If it turns out that the number of people other than me actually is N, I have done a lot of harm, because I have decreased the credence of a very large number N of people. Since N is much larger than n, this consideration trumps considerations of what happens if the number of people is n.

I take it that this is the wrong conclusion. On epistemic grounds, if everyone’s priors are equal, we should release evidence. (See my previous post for what happens if priors are not equal.)

So what should we do? Well, one option is to opt for averaging rather than summing of epistemic utilities. But the problem reappears. For suppose that I can only communicate with members of my own local community, and we as a community have equal credence 1/2 for the hypothesis H_n that our local community of n people contains all agents, and credence 1/2 for the hypothesis H_n + N that there is also a number N of agents outside our community much greater than n. Suppose, further, that my priors are such that I am certain that all the agents outside our community know the truth about these hypotheses. I receive a piece of evidence E disfavoring H_n and leading to credence p < 1/2. Since my revelation of E only affects the members of my own commmunity, depending on which hypothesis is true, if p is my credence after updating on E, the relevant part of the expected contribution to the utility of revealing E with regard to hypothesis H_n is:

U_R = p((n−1)/n)s(p,1) + (1−p)((n−1)/(n+N))s(p,0).

And if I conceal E, my expectation contribution is:

U_C = p((n−1)/n)s(1/2,1) + (1−p)((n−1)/(n+N))s(p,0).

If N is sufficiently large, again U_C will beat U_R.

I take it that there is something wrong with epistemic utilitarianism.

Bayesianism and epistemic paternalism

Suppose that your priors for some hypothesis H are 3/4 while my priors for it are 1/2. I now find some piece of evidence E for H which raises my credence in H to 3/4 and would raise yours above 3/4. If my concern is for your epistemic good, should I reveal this evidence E?

Here is an interesting reason for a negative answer. For any strictly proper (accuracy) scoring rule, my expected value for the score of a credence is uniquely maximized when the credence is 3/4. I assume your epistemic utility is governed by a strictly proper scoring rule. So the expected epistemic utility, by my lights, of your credence is maximized when your credence is 3/4. But if I reveal E to you, your credence will go above 3/4. So I shouldn’t reveal it.

This is epistemic paternalism. So, it seems, expected epistemic utility maximization (which I take it has to employ a strictly proper scoring rule) forces one to adopt epistemic paternalism. This is not a happy conclusion for expected epistemic utility maximization.

Wednesday, January 29, 2025

More on experiments

We all perform experiments very often. When I hear a noise and deliberately turn my head, I perform an experiment to find out what I will see if I turn my head. If I ask a question not knowing what answer I will hear, I am engaging in (human!) experimentation. Roughly, experiments are actions done in order to generate observations as evidence.

There are typically differences in rigor between the experiments we perform in daily life and the experiments scientists perform in the lab, but only typically so. Sometimes we are rigorous in ordinary life and sometimes scientists are sloppy.

The epistemic value to one of an experiment depends on multiple factors in a Bayesian framework.

The set of questions towards answers to which the experiment’s results are expected to contribute.
Specifications of the value of different levels of credence regarding the answers to the questions in Factor 1.
One’s prior levels of credence for the answers.
The likelihoods of different experimental outcomes given different answers.

It is easiest to think of Factor 2 in practical terms. If I am thinking of going for a recreational swim but I am not sure whether my swim goggles have sprung a leak, it may be that if the probability of the goggles being sound is at least 50%, it’s worth going to the trouble of heading out for the pool, but otherwise it’s not. So an experiment that could only yield a 45% confidence in the goggles is useless to my decision whether to go to the pool, and there is no difference in value between an experiment that yields a 55% confidence and one that yields a 95% confidence. On the other hand, if I am an astronaut and am considering performing a non-essential extravehicular task, but I am worried that the only available spacesuit might have sprung a leak, an experiment that can only yield 95% confidence in the soundness of the spacesuit is pointless—if my credence in the spacesuit’s soundness is only 95%, I won’t use the spacesuit.

Factor 3 is relevant in combination with Factor 4, because these two factors tell us how likely I am to end up with different posterior probabilities for the answers to the Factor 1 questions after the experiment. For instance, if I saw that one of my goggles is missing its gasket, my prior credence in the goggle’s soundness is so low that even a positive experimental result (say, no water in my eye after submerging my head in the sink) would not give me 50% credence that the goggle is fine, and so the experiment is pointless.

In a series of posts over the last couple of days, I explored the idea of a somewhat interest-independent comparison between the values of experiments, where one still fixes a set of questions (Factor 1), but says that one experiment is at least as good as another provided that it has at least as good an expected epistemic utility as the other for every proper scoring rule (Factor 2). This comparison criterion is equivalent to one that goes back to the 1950s. This is somewhat interest-independent, because it is still relativized to a set of questions.

A somewhat interesting question that occurred to me yesterday is what effect Factor 3 has on this somewhat interest-independent comparison of experiments. If experiment E₂ is at least as good as experiment E₁ for every scoring rule on the question algebra, is this true regardless of which consistent and regular priors one has on the question algebra?

A bit of thought showed me a somewhat interesting fact. If there is only one binary (yes/no) question under Factor 1, then it turns out that the somewhat interest-independent comparison of experiments does not depend on the prior probability for the answer to this question (assuming it’s regular, i.e., neither 0 nor 1). But if the question algebra is any larger, this is no longer true. Now, whether an experiment is at least as good as another in this somewhat interest-independent way depends on the choice of priors in Factor 3.

We might now ask: Under what circumstances is an experiment at least as good as another for every proper scoring rule and every consistent and regular assignment of priors on the answers, assuming the question algebra has more than two non-trivial members? I suspect this is a non-trivial question.

Tuesday, January 28, 2025

And one more post on comparing experiments

In my last couple of posts, starting here, I’ve been thinking about comparing the epistemic quality of experiments for a set of questions. I gave a complete geometric characterization for the case where the experiments are binary—each experiment has only two possible outcomes.

Now I want to finally note that there is a literature for the relevant concepts, and it gives a characterization of the comparison of the epistemic quality of experiments, at least in the case of a finite probability space (and in some infinite cases).

Suppose that Ω is our probability space with a finite number of points, and that F_Q is the algebra of subsets of Ω corresponding to the set of questions Q (a question partitions Ω into subsets and asks which partition we live in; the algebra F_Q is generated by all these partitions). Let X be the space of all probability measures on F_Q. This can be identified with an (n−1)-dimensional subset of Euclidean Rⁿ consisting of the points with non-negative coordinates summing to one, where n is the number of atoms in F_Q. An experiment E also corresponds to a partition of Ω—it answers the question where in that partition we live. The experiment has some finite number of possible outcomes A₁, ..., A_m, and in each outcome A_i our Bayesian agent will have a different posterior P_{A_i} = P(⋅∣A_i). The posteriors are members of X. The experiment defines an atomic measure μ_E on X where μ_E(ν) is the probability that E will generate an outcome whose posterior matches ν on F_Q. Thus:

μ_E(ν) = P(⋃{A_i:P_{A_i}|_{F_Q}=ν}).

Given the correspondence between convex functions and proper scoring rules, we can see that experiment E₂ is at least as good as E₁ for Q just in case for every convex function c on X we have:

∫_Xcdμ_E₂ ≥ ∫_Xcdμ_E₁.

There is an accepted name for this relation: μ_E₂ convexly dominates μ_E₁. Thus, we have it that experiment E₂ is at least as good as experiment E₁ for Q provided that there is a convex domination relation between the distributions the experiments induce on the possible posteriors for the questions in Q. And it turns out that there is a known mathematical characterization of when this happens, and it includes some infinite cases as well.

In fact, the work on this epistemic comparison of experiments turns out to go back to a 1953 paper by Blackwell. The only difference is that Blackwell (following 1950 work by Bohnenblust, Karlin and Sherman) uses non-epistemic utility while my focus is on scoring rules and epistemic utility. But the mathematics is the same, given that non-epistemic decision problems correspond to proper scoring rules and vice versa.

Monday, January 27, 2025

Comparing binary experiments for binary questions

In my previous post I introduced the notion of an experiment being better than another experiment for a set of questions, and gave a definition in terms of strictly proper (or strictly open-minded, which yields the same definition) scoring rules. I gave a sufficient condition for E₂ to be at least as good as E₁: E₂’s associated partition is essentially at least as fine as that of E₁.

I then ended with an open question as to what the necessary and sufficient conditions for a binary (yes/no) experiment to be at least as good as another binary one for a binary question.

I think I now have an answer. For a binary experiment E and a hypothesis H, say that E’s posterior interval for H is the closed interval joining P(H∣E) with P(H∣∼E). Then, I think:

Given the binary question whether a hypothesis H is true, and binary experiments E₁ and E₂, experiment E₂ is at least as good as E₁ if and only if its posterior interval for H contains the E₁’s posterior interval for H.

Let’s imagine that you want to be confident of H, because H is nice. Then the above condition says that an experiment that’s better than another will have at least as big potential benefit (i.e., confidence in H) and at least as big potential risk (i.e., confidence in ∼ H). No benefits without risks in the epistemic game!

The proof (which I only have a sketch of) follows from expressing the expected score after an experiment using formula (4) here, and using convexity considerations.

The above answer doesn’t work for non-binary experiments. The natural analogue to the posterior interval is the convex hull of the set of possible posteriors. But now imagine two experiments to determine whether a coin is fair or double-headed. The first experiment just tosses the coin and looks at the answer. The second experiment tosses an auxiliary independent and fair coin, and if that one comes out heads, then the coin that we are interested in is tossed. The second experiment is worse, because there is probability 1/2 that the auxiliary coin is tails in which case we get no information. But the posterior interval is the same for both experiments.

I don’t know what to say about binary experiments and non-binary questions. A necessary condition is containment of posterior intervals for all possible answers to the question. I don’t know if that’s sufficient.

Comparing experiments

When you’re investigating reality as a scientist (and often as an ordinary person) you perform experiments. Epistemologists and philosophers of science have spent a lot of time thinking about how to evaluate what you should do with the results of the experiments—how they should affect your beliefs or credences—but relatively little on the important question of which experiments you should perform epistemologically speaking. (Of course, ethicists have spent a good deal of time thinking about which experiments you should not perform morally speaking.) Here I understand “experiment” in a broad sense that includes such things as pulling out a telescope and looking in a particular direction.

One might think there is not much to say. After all, it all depends on messy questions of research priorities and costs of time and material. But we can at least abstract from the costs and quantify over epistemically reasonable research priorities, and define:

E₂ is epistemically at least as good an experiment as E₁ provided that for every epistemically reasonable research priority, E₂ would serve the priority at least as well as E₁ would.

That’s not quite right, however. For we don’t know how well an experiment would serve a research priority unless we know the result of the experiment. So a better version is:

E₂ is epistemically at least as good an experiment as E₁ provided that for every epistemically reasonable research priority, the expected degree to which E₂ would serve the priority is at least as high as the expected degree to which E₁ would.

Now we have a question we can address formally.

Let’s try.

A reasonable epistemic research priority is a strictly proper scoring rule or epistemic utility, and the expected degree to which an experiment would serve that priority is equal to the expected value of the score after Bayesian update on the result of the experiment.

(Since we’re only interested in expected values of scores, we can replace “strictly proper” with “strictly open-minded”.)

And we can identify an experiment with a partition of the probability space: the experiment tells us where we are in that partition. (E.g., if you are measuring some quantity to some number of significant digits, the cells of the partition are equivalence classes under equality of the quantity up to those many significant digits.) The following is then easy to prove:

Proposition 1: On definitions (2) and (3), an experiment E₂ is epistemically at least as good as experiment E₁ if and only if the partition associated with E₂ is essentially at least as fine as the partition associated with E₁.

A partition R₂ is essentially at least as fine as a partition R₁ provided that for every event A in R₁ there is an event B in R₂ such that with probability one B happens if and only if A happens. The definition is relative to the current credences which are assumed to be probabilistic. If the current credences are regular—all non-empty events have non-zero probability—then “essentially” can be dropped.

However, Proposition 1 suggests that our choice of definitions isn’t that helpful. Consider two experiments. On E₁, all the faculty members from your Geology Department have their weight measured to the nearest hundred kilograms. On E₂, a thousand randomly chosen individiduals around the world have their weight measured to the nearest kilogram. Intuitively, E₁ is better. But Proposition 1 shows that in the above sense neither experiment is better than the other, since they generate partitions neither of which is essentially finer than the other (the event of there being a member of the Geology Department with weight at least 150 kilograms is in the partition of E₂ but nothing coinciding with that event up to probability zero is in the partition of E₁). And this is to be expected. For suppose that our research priority is to know whether any members of your Geology Department are at least than 150 kilograms in weight, because we need to know if for a departmental cave exploring trip the current selection of harnesses all of which are rated for users under 150 kilograms are sufficient. Then E₁ is better. On the other hand, if our research priority is to know the average weight of a human being to the nearest ten kilograms, then E₂ is better.

The problem with our definitions is that the range of possible research priorities is just too broad. Here is one interesting way to narrow it down. When we are talking about an experiment’s epistemic value, we mean the value of the experiment towards a set of questions. If the set of questions is a scientifically typical set of questions about human population weight distribution, then E₁ seems better than E₂. But if it is an atypical set of questions about the Geology Department members’ weight distribution, then E₂ might be better. We can formalize this, too. We can identify a set Q of questions with a partition of probability space representing the possible answers. This partition then generates an algebra F_Q on the probability space, which we can call the “question algebra”. Now we can relativize our definitions to a set of questions.

E₂ is epistemically at least as good an experiment as E₁ for a set of questions Q provided that for every epistemically reasonable research priority on Q, the expected degree to which E₂ would serve the priority is at least as high as the expected degree to which E₁ would.
A reasonable epistemic research priority on a set of questions Q is a strictly proper scoring rule or epistemic utility on F_Q, and the expected degree to which an experiment would serve Q is equal to the expected value of the score after Bayesian update on the result of the experiment.

We recover the old definitions by being omnicurious, namely letting Q be all possible questions.

What about Proposition 1? Well, one direction remains: if E₂’s partition is essentially at least as fine as E₁’s, then E₂ is better with regard any set of questions, an in particular better with regard to Q. But what about the other direction? Now the answer is negative. Suppose the question is what the average weight of the six members of the Geology Department is up to the nearest 100 kg. Consider two experiments: on the first, the members are ordered alphabetically by first name, and a fair die is rolled to choose one (if you roll 1, you choose the first, etc.), and their height is measured. On the second, the same is done but with the ordering being by last name. Assuming the two orderings are different, neither experiment’s partition is essentially at least as fine as the other’s, but the expected contributions of both experiments towards our question is equal.

Is there a nice characterization in terms of partitions of when E₂ is at least as good as E₁ with regard to a set of questions Q? I don’t know. It wouldn’t surprise me if there was something in the literature. A nice start would be to see if we can answer the question in the special case where Q is a single binary question and where E₁ and E₂ are binary experiments. But I need to go for a dental appointment now.

Friday, January 17, 2025

Knowledge and anti-knowledge

Suppose knowledge has a non-infinitesimal value. Now imagine that you continuously gain evidence for some true proposition p, until your evidence is sufficient for knowledge. If you’re rational, your credence will rise continuously with the evidence. But if knowledge has a non-infinitesimal value, your epistemic utility with respect to p will have a discontinuous jump precisely when you attain knowledge. Further, I will assume that the transition to knowledge happens at a credence strictly bigger than 1/2 (that’s obvious) and strictly less than 1 (Descartes will dispute this).

But this leads to an interesting and slightly implausible consequence. Let T(r) be the epistemic utility of assigning evidence-based credence r to p when p is true, and let F(r) be the epistemic utility of assigning evidence-based credence r to p when p is false. Plausibly, T is a strictly increasing function (being more confident in a truth is good) and F is a strictly decreasing function (being more confident in a falsehood is bad). Furthermore, the pair T and F plausibly yields a proper scoring rule: whatever one’s credence, one doesn’t have an expectation that some other credence would be epistemically better.

It is not difficult to see that these constraints imply that if T has a discontinuity at some point 1/2 < r_K < 1, so does F. The discontinuity in F implies that as we become more and more confident in the falsehood p, suddenly we have a discontinuous downward jump in utility. That jump occurs precisely at r_K, namely when we gain what we might call “anti-knowledge”: when one’s evidence for a falsehood becomes so strong that it would constitute knowledge if the proposition were true.

Now, there potentially are some points where we might plausibly think that epistemic utility of having a credence in a falsehood takes a discontinuous downward jump. These points are:

1, where we become certain of the falsehood
r_B, the threshold of belief, where the credence becomes so high that we count as believing the falsehood
1/2, where we start to become more confident in the falsehood p than the truth not-p
1 − r_B, where we stop believing not-p, and
0, where the falsehood p becomes an epistemic possibility.

But presumably r_K is strictly between r_B and 1, and hence r_K is no one of these points. Is it plausible to think that there is a discontinuous downward jump in epistemic utility when we achieve anti-knowledge by crossing the threshold r_K in a falsehood.

I am incline to say not. But that forces me to say that there is no discontinuous upward jump in epistemic utility once we gain knowledge.

On the other hand, one might think that the worst kind of ignorance is when you’re wrong but you think you have knowledge, and that’s kind of like the anti-knowledge point.

Monday, August 5, 2024

Natural reasoning vs. Bayesianism

A typical Bayesian update gets one closer to the truth in some respects and further from the truth in other respects. For instance, suppose that you toss a coin and get heads. That gets you much closer to the truth with respect to the hypothesis that you got heads. But it confirms the hypothesis that the coin is double-headed, and this likely takes you away from the truth. Moreover, it confirms the conjunctive hypothesis that you got heads and there are unicorns, which takes you away from the truth (assuming there are no unicorns; if there are unicorns, insert a “not” before “are”). Whether the Bayesian update is on the whole a plus or a minus depends on how important the various propositions are. If for some reason saving humanity hangs on you getting it right whether you got heads and there are unicorns, it may well be that the update is on the whole a harm.

(To see the point in the context of scoring rules, take a weighted Brier score which puts an astronomically higher weight on you got heads and there are unicorns than on all the other propositions taken together. As long as all the weights are positive, the scoring rule will be strictly proper.)

This means that there are logically possible update rules that do better than Bayesian update. (In my example, leaving the probability of the proposition you got heads and there are unicorns unchanged after learning that you got heads is superior, even though it results in inconsistent probabilities. By the domination theorem for strictly proper scoring rules, there is an even better method than that which results in consistent probabilities.)

Imagine that you are designing a robot that maneouvers intelligently around the world. You could make the robot a Bayesian. But you don’t have to. Depending on what the prioritizations among the propositions are, you might give the robot an update rule that’s superior to a Bayesian one. If you have no more information than you endow the robot with, you won’t be able to expect to be able to design such an update rule. (Bayesian update has optimal expected accuracy given the pre-update information.) But if you know a lot more than you tell the robot—and of course you do—you might well be able to.

Imagine now that the robot is smart enough to engage in self-reflection. It then notices an odd thing: sometimes it feels itself pulled to make inferences that do not fit with Bayesian update. It starts to hypothesize that by nature it’s a bad reasoner. Perhaps it tries to change its programming to be more Bayesian. Would it be rational to do that? Or would it be rational for it to stick to its programming, which in fact is superior to Bayesian update? This is a difficult epistemology question.

The same could be true for humans. God and/or evolution could have designed us to update on evidence differently from Bayesian update, and this could be epistemically superior (God certainly has superior knowledge; evolution can “draw on” a myriad of information not available to individual humans). In such a case, switching from our “natural update rule” to Bayesian update would be epistemically harmful—it would take us further from the truth. Moreover, it would be literally unnatural. But what does rationality call on us to do? Does it tell us to do Bayesian update or to go with our special human rational nature?

My “natural law epistemology” says that sticking with what’s natural to us is the rational thing to do. We shouldn’t redesign our nature.

Wednesday, June 19, 2024

Entropy

If p is a discrete probability measure, then the Shannon entropy of p is H(p) = − ∑_xp({x})log p({x}). I’ve never had any intuitive feeling for Shannon entropy until I noticed the well-known fact that H(p) is the expected value of the logarithmic inaccuracy score of p by the lights of p. Since I’ve spent a long time thinking about inaccuracy scores, I now get some intuitions about entropy for free.

Entropy is a measure of the randomness of p. But now I am thinking that there are other measures: For any strictly proper inaccuracy scoring rule s, we can take E_ps(p) to be some sort of a measure of the randomness of p. These won’t have the nice connections with information theory, though.

Wednesday, May 15, 2024

Very open-minded scoring rules

An accuracy scoring rule is open-minded provided that the expected value of the score after a Bayesian update on a prospective observation is always greater than or equal to the current expected value of the score.

Now consider a single-proposition accuracy scoring rule for a hypothesis H. This can be thought of as a pair of functions T and F where T(p) is the score for assigning credence p when H is true and F(p) is the score for assigning credence p when H is false. We say that the pair (T,F) is very open-minded provided that the conditional-on-H expected value of the T score after a Bayesian update on a prospective observation is greater than or equal to the current expected value of the T score and provided that the same is true for the F score with the expected value being conditional on not-H.

An example of a very open-minded scoring rule is the logarithmic rule where T(p) = log p and F(p) = log (1−p). The logarithmic rule has some nice philosophical properties which I discuss in this post, and it is easy to see that any very open-minded scoring rule has these properties. Basically, the idea is that if I measure epistemic utilities using a very open-minded scoring rule, then I will not be worried about Bayesian update on a prospective observation damaging other people’s epistemic utilities, as long as these other people agree with me on the likelihoods.

One might wonder if there are any other non-trivial proper and very open-minded scoring rules besides the logarithmic one. There are. Here’s a pretty easy to verify fact (see the Appendix):

A scoring rule (T,F) is very open-minded if and only if the functions xT(x) and (1−x)F(1−x) are both convex.

Here’s a cute scoring rule that is proper and very open-minded and proper:

T(x) = − ((1−x)/x)^1/2 and F(x) = T(1−x).

(For propriety, use Fact 1 here. For open-mindedness, note that the graph of xT(x) is the lower half of the semicircle with radius 1/2 and center at (1/2,0), and hence is convex.)

What’s cute about this rule? Well, it is symmetric (F(x) = T(1−x)) and it has the additional symmetry property that xT(x) = (1−x)T(1−x) = (1−x)F(x). Alas, though, T is not concave, and I think a good scoring rule should have T concave (i.e., there should be diminishing returns from getting closer to the truth).

Appendix:

Suppose that the prospective observation is as to which cell of the partition E₁, ..., E_n we are in. The open-mindedness property with respect to T then requires:

∑_iP(E_i|H)T(P(H|E_i)) ≥ T(P(H)).

Now P(E_i|H) = P(H|E_i)P(E_i)/P(H). Thus what we need is:

∑_iP(E_i)P(H|E_i)T(P(H|E_i)) ≥ P(H)T(P(H)).

Given that P(H) = ∑_iP(E_i)P(H|E_i), this follows immediately from the convexity of xT(x). The converse is easy, too.

Monday, May 13, 2024

A feature of the logarithmic scoring rule

Accuracy scoring rules measure the epistemic utility of having some credence assignment. For simplicity, let’s assume that all credence assignments are probabilistically coherent. A strictly proper scoring rule has the property that always by one’s own lights, the expected value of one’s actual credence assignment is better than that of any other credence assignment.

A well-known fact is that a strictly proper scoring rules always makes it rational to update on non-trivial evidence. I.e., by one’s present lights, the expected epistemic utility after examining and updating on non-trivial evidence will be higher than the expected epistemic utility of ignoring that evidence. We might put this by saying that a strictly proper scoring rule is strictly open-minded.

The logarithmic scoring rule makes the score of assigning credence r be log r when the hypothesis is true and log (1−r) when the hypothesis is false. It is strictly proper and hence strictly open-minded.

The logarithmic scoring rule, however, satisfies a condition even stronger than strict open-mindedness. This condition is easiest to describe in a binary case where one is simply evaluating the score of one’s credence in a single hypothesis H. Assuming some non-triviality assumptions, it turns out that not only is the expected epistemic utility increased by examining evidence, but the expected epistemic utility conditional on H is increased by examining evidence. (This is a pretty easy calculation.)

So what?

Well, there are several reasons this matters. First, on my recent account of what it is to have a no-hedge commitment to a hypothesis H, if your epistemic utilities are measured by some scoring rules (e.g., Brier) and you have a no-hedge commitment to H but you do not have credence 1 in H, then you will sometimes have reason to refuse to look at evidence. But the above fact about the logarithmic scoring rule shows that this is not so for the logarithmic scoring rule. With the logarithmic scoring rule, it makes sense to look at the evidence even if you have a no-hedge commitment to H—i.e., even if all your betting behavior is “as if H”.

Second, let’s imagine that I run a funding agency and you come to me with an interest in doing some experiment relevant to a hypothesis H. Let’s suppose that the relevant epistemic community agrees on the relevant likelihoods with respect to the evidence obtainable from the experiment, and is perfectly rational, but differs with regard to the priors of H. I might then have this paternalistic worry about funding the experiment. Even though updating on the results of the experiment by my lights is expected to benefit me epistemically, if a strictly proper scoring rule is the appropriate measure of benefit, it may not be true that by my lights other members of the community will benefit epistemically from updating on the results of the experiment. I may, for instance, be close to certain of H, and think that some members of the community have credences that are sufficiently high that the benefit to them of getting a boost in credence in H from the experiment is outweighed by the risk of misleading evidence. If it is my job to watch out for the epistemic good of the community, this could give me reason to refuse funding.

But not so if I think the logarithmic rule is the right way to evaluate epistemic utility. If everyone shares likelihoods, and we differ only in priors for H, and everyone is rational, then when we measure epistemic utility with the logarithmic rule, I have a positive expectation of the epistemic utility effect of examining the experiment’s results on each member of the community. This is easily shown to follow from my above observation about the logarithmic scoring rule. (By my lights the expectation of a fellow community member’s epistemic utility after updating on the experimental results is a weighted sum of an expectation given H and an expectation given not-H. Each improves given the experiment.)

Saturday, May 11, 2024

What is it like not to be hedging?

Plausibly, a Christian commitment prohibits hedging. Thus in some sense even if one’s own credence in Christianity is less than 100%, one should act “as if it is 100%”, without hedging one’s bets. One shouldn’t have a backup plan if Christianity is false.

Understanding what this exactly means is difficult. Suppose Alice has Christian commitment, but her credence in Christianity is 97%. If someone asks Alice her credence in Christianity, she should not lie and say “100%”, even though that is literally acting “as if it is 100%”.

Here is a more controversial issue. Suppose Alice has a 97% credence in Christianity, but has the opportunity to examine a piece of evidence which will settle the question one way or the other—it will make her 100% certain Christianity is true or 100% certain it’s not. (Maybe she has an opportunity for a conversation with God.) If she were literally acting as if her credence were 100%, there would be no point to looking at any more evidence. But that seems the wrong answer. It seems to be a way of being scared that the evidence will refute Christianity, but that kind of a fear is opposed to the no-hedge attitude.

Here is a suggestion about how no-hedge decision-making should work. When I think about my credences, say in the context of decision-making, I can:

think about the credences as psychological facts about me, or
regulate my epistemic and practical behavior by the credences (use them to compute expected values, etc.).

The distinction between these two approaches to my credences is really clear from a third-person perspective. Bob, who is Alice’s therapist, thinks about Alice’s credences as psychological facts about her, but does not regulate his own behavior by these credences: Alice’s credences have a psychologically descriptive role for Bob but not a regulative role for Bob in his actions. In fact, they probably don’t even have a regulative role for Bob when he thinks about what actions are good for Alice. If Alice has a high credence in the danger of housecats, and Bob does not, Bob will not encourage Alice to avoid housecats—on the contrary, he may well try to change Alice’s credence, in order to get Alice to act more normally around them.

So, here is my suggestion about no-hedging commitments. When you have a no-hedging commitment to a set of claims, you regulate your behavior by them as if the claims had credence 100%, but when you take the credences into account as psychological facts about you, you give them the credence they actually have.

(I am neglecting here a subtle issue. Should we regulate our behavior by our credences or by our opinion about our credences? I suspect that it is by our credences—else a regress results. If that’s right, then there might be a very nice way to clarify the distinction between taking credences into account as psychological facts and taking them into account as regulative facts. When we take them into account as psychological facts, our behavior is regulated by our credences about the credences. When we take them into account regulatively, our behavior is directly regulated by the credences. If I am right about this, the whole story becomes neater.)

Thus, when Alice is asked what her credence in Christianity is, her decision of how to answer depends on the credence qua psychological fact. Hence, she answers “97%”. But when Alice decides whether or not to engage in Christian worship in a time of persecution, her decision on how to answer would normally depend on the credence qua regulative, and so she does not take into account the 3% probability of being wrong about Christianity—she just acts as if Christianity were certain.

Similarly, when Alice considers whether to look at a piece of evidence that might raise or lower her credence in Christianity, she does need to consider what her credence is as a psychological fact, because her interest is in what might happen to her actual psychological credence.

Let’s think about this in terms of epistemic utilities (or accuracy scoring rules). If Alice were proceeding “normally”, without any no-hedge commitment, when she evaluates the expected epistemic value of examining some piece of evidence—after all, it may be practically costly to examine it (it may involve digging in an archaeological site, or studying a new language)—she needs to take her credences into account in two different ways: psychologically when calculating the potential for epistemic gain from her credence getting closer to the truth and potential for epistemic loss from her credence getting further from the truth, and regulatively when calculating the expectations as well as when thinking about what is or is not true.

Now on to some fun technical stuff. Let ϕ(r,t) be the epistemic utility of having credence r in some fixed hypothesis of interest H when the truth value is t (which can be 0 or 1). Let’s suppose there is no as-if stuff going on, and I am evaluating the expected epistemic value of examining whether some piece of evidence E obtains. Then if P indicates my credences, the expected epistemic utility of examining the evidence is:

V_E = P(H)(P(E|H)ϕ(P(H|E),1)+P(∼E|H)ϕ(P(H|E),1)) + P(∼H)(P(E|∼H)ϕ(P(H|E),0)+P(∼E|∼H)ϕ(P(H|∼E),0)).

Basically, I am partitioning logical space based on whether H and E obtain.

Now, in the as-if case, basically the agent has two sets of credences: psychological credences and regulative credences, and they come apart. Let Ψ and R be the two. Then the formula above becomes:

V_E = R(H)(R(E|H)ϕ(Ψ(H|E),1)+R(∼E|H)ϕ(Ψ(H|∼E),1)) + R(∼H)(R(E|∼H)ϕ(Ψ(H|E),0)+R(∼E|∼H)ϕ(Ψ(H|∼E),0)).

The no-hedging case that interests us makes R(H) = 1: we regulatively ignore the possibility that the hypothesis is false. Our expected value of examining whether E obtains is then:

V_E = R(E|H)ϕ(Ψ(H|E),1) + R(∼E|H)ϕ(Ψ(H|∼E),1).

Let’s make a simplifying assumption that the doctrines that we are as-if committed to do not affect the likelihoods P(E|H) and P(E|∣H) (granted the latter may be a bit fishy if P(H) = 1, but let’s suppose we have Popper functions or something like that to take care of that), so that R(E|H) = Ψ(E|H) and R(E|∣H) = Ψ(E|∣H).

We then have:

Ψ(H|E) = Ψ(H)R(E|H)/(R(E|H)Ψ(H)+R(E|∼H)Ψ(∼H)).
Ψ(H|∼E) = Ψ(H)R(∼E|H)/(R(∼E|H)Ψ(H)+R(∼E|∼H)Ψ(∼H)).

Assuming Alice has a preferred scoring rule, we now have a formula that can guide Alice what evidence to look at: she can just check whether V_E is bigger than ϕ(Ψ(H),1), which is her current score regulatively evaluated, i.e., evaluated in the as-if H is true way. If V_E is bigger, it’s worth checking whether E is true.

One might hope for something really nice, like that if the scoring rule ϕ is strictly proper, then it’s always worth looking at the evidence. Not so, alas.

It’s easy to see that V_E beats the current epistemic utility when E is perfectly correlated with H, assuming ϕ(x,1) is strictly monotonic increasing in x.

Surprisingly and sadly, numerical calculations with the Brier score ϕ(x,t) = − (x−t)² show that if Alice’s credence is 0.97, then unless the Bayes’ factor is very far from 1, current epistemic utility beats V_E, and so no-hedging Alice should not look at the evidence, except in rare cases where the evidence is extreme. Interestingly, though, if Alice’s current credence were 0.5, then Alice should always look at the evidence. I suppose the reason is that if Alice is at 0.97, there is not much room for her Brier score to go up assuming the hypothesis is correct, but there is a lot of room for her score to go down. If we took seriously the possibility that the hypothesis could be false, it would be worth examining the evidence just in case the hypothesis is false. But that would be a form of hedging.

Wednesday, July 26, 2023

Committee credences

Suppose the members of a committee individually assign credences or probabilities to a bunch of propositions—maybe propositions about climate change or about whether a particular individual is guilty or innocent of some alleged crimes. What should we take to be “the committee’s credences” on the matter?

Here is one way to think about this. There is a scoring rule s that measures the closeness of a probability assignment to the truth that is appropriate to apply in the epistemic matter at hand. The scoring rule is strictly proper (i.e., such that an individual by their own lights is always prohibited from switching probabilities without evidence). The committee can then be imagined to go through all the infinitely many possible probability assignments q, and for each one, member i calculates the expected value E_{p_i}s(q) of the score of q by the lights of the member’s own probability assignment p_i.

We now need a voting procedure between the assignments q. Here is one suggestion: calculate a “committee score estimate” for q in the most straightforward way possible—namely, by adding the individuals’ expected scores, and choose an assignment that maximizes the committee score estimate.

It’s easy to prove that given that the common scoring rule is strictly proper, the probability assignment that wins out in this procedure is precisely the average p̄ = (p₁+...+p_n)/n of the individuals’ probability assignments. So it is natural to think of “the committee’s credence” as the average of the members’ credences, if the above notional procedure is natural, which it seems to be.

But is the above notional voting procedure the right one? I don’t really know. But here are some thoughts.

First, there is a limitation in the above setup: we assumed that each committee member had the same strictly proper scoring rule. But in practice, people don’t. People differ with regard to how important they regard getting different propositions right. I think there is a way of arguing that this doesn’t matter, however. There is a natural “committee scoring rule”: it is just the sum of the individual scoring rules. And then we ask each member i when acting as a committee member to use the committee scoring rule in their voting. Thus, each member calculates the expected committee score of q, still by their own epistemic lights, and these are added, and we maximize, and once again the average will be optimal. (This uses the fact that a sum of strictly proper scoring rules is strictly proper.)

Second, there is another way to arrive at the credence-averaging procedure. Presumably most of the reason why we care about a committee’s credence assignments is practical rather than purely theoretical. In cases where consequentialism works, we can model this by supposing a joint committee utility assignment (which might be the sum of individual utility assignments, or might be consensus utility assignment), and we can imagine the committee to be choosing between wagers so as to maximize the agreed-on committee utility function. It seems natural to imagine doing this as follows. The committee expectations or previsions for different wagers are obtained by summing individual expectations—with the individuals using the agreed-on committee utility function, albeit with their own individual credences to calculate the expectations. And then the committee chooses a wager that maximizes its prevision.

But now it’s easy to see that the above procedure yields exactly the same result as the committee maximizing committee utility calculated with respect to the average of the individuals’ credence assignments.

So there is a rather nice coherence between the committee credences generated by our epistemic “accuracy-first”
procedure and what one gets in a pragmatic approach.

But still all this depends on the plausible, but unjustified, assumption that addition is the right way to go, whether for epistemic or pragmatic utility expectations. But given this assumption, it really does seem like the committee’s credences are reasonably taken to be the average of the members’ credences.