Monday, January 27, 2025

Comparing experiments

When you’re investigating reality as a scientist (and often as an ordinary person) you perform experiments. Epistemologists and philosophers of science have spent a lot of time thinking about how to evaluate what you should do with the results of the experiments—how they should affect your beliefs or credences—but relatively little on the important question of which experiments you should perform epistemologically speaking. (Of course, ethicists have spent a good deal of time thinking about which experiments you should not perform morally speaking.) Here I understand “experiment” in a broad sense that includes such things as pulling out a telescope and looking in a particular direction.

One might think there is not much to say. After all, it all depends on messy questions of research priorities and costs of time and material. But we can at least abstract from the costs and quantify over epistemically reasonable research priorities, and define:

  1. E2 is epistemically at least as good an experiment as E1 provided that for every epistemically reasonable research priority, E2 would serve the priority at least as well as E1 would.

That’s not quite right, however. For we don’t know how well an experiment would serve a research priority unless we know the result of the experiment. So a better version is:

  1. E2 is epistemically at least as good an experiment as E1 provided that for every epistemically reasonable research priority, the expected degree to which E2 would serve the priority is at least as high as the expected degree to which E1 would.

Now we have a question we can address formally.

Let’s try.

  1. A reasonable epistemic research priority is a strictly proper scoring rule or epistemic utility, and the expected degree to which an experiment would serve that priority is equal to the expected value of the score after Bayesian update on the result of the experiment.

(Since we’re only interested in expected values of scores, we can replace “strictly proper” with “strictly open-minded”.)

And we can identify an experiment with a partition of the probability space: the experiment tells us where we are in that partition. (E.g., if you are measuring some quantity to some number of significant digits, the cells of the partition are equivalence classes under equality of the quantity up to those many significant digits.) The following is then easy to prove:

Proposition 1: On definitions (2) and (3), an experiment E2 is epistemically at least as good as experiment E1 if and only if the partition associated with E2 is essentially at least as fine as the partition associated with E1.

A partition R2 is essentially at least as fine as a partition R1 provided that for every event A in R1 there is an event B in R2 such that with probability one B happens if and only if A happens. The definition is relative to the current credences which are assumed to be probabilistic. If the current credences are regular—all non-empty events have non-zero probability—then “essentially” can be dropped.

However, Proposition 1 suggests that our choice of definitions isn’t that helpful. Consider two experiments. On E1, all the faculty members from your Geology Department have their weight measured to the nearest hundred kilograms. On E2, a thousand randomly chosen individiduals around the world have their weight measured to the nearest kilogram. Intuitively, E1 is better. But Proposition 1 shows that in the above sense neither experiment is better than the other, since they generate partitions neither of which is essentially finer than the other (the event of there being a member of the Geology Department with weight at least 150 kilograms is in the partition of E2 but nothing coinciding with that event up to probability zero is in the partition of E1). And this is to be expected. For suppose that our research priority is to know whether any members of your Geology Department are at least than 150 kilograms in weight, because we need to know if for a departmental cave exploring trip the current selection of harnesses all of which are rated for users under 150 kilograms are sufficient. Then E1 is better. On the other hand, if our research priority is to know the average weight of a human being to the nearest ten kilograms, then E2 is better.

The problem with our definitions is that the range of possible research priorities is just too broad. Here is one interesting way to narrow it down. When we are talking about an experiment’s epistemic value, we mean the value of the experiment towards a set of questions. If the set of questions is a scientifically typical set of questions about human population weight distribution, then E1 seems better than E2. But if it is an atypical set of questions about the Geology Department members’ weight distribution, then E2 might be better. We can formalize this, too. We can identify a set Q of questions with a partition of probability space representing the possible answers. This partition then generates an algebra FQ on the probability space, which we can call the “question algebra”. Now we can relativize our definitions to a set of questions.

  1. E2 is epistemically at least as good an experiment as E1 for a set of questions Q provided that for every epistemically reasonable research priority on Q, the expected degree to which E2 would serve the priority is at least as high as the expected degree to which E1 would.

  2. A reasonable epistemic research priority on a set of questions Q is a strictly proper scoring rule or epistemic utility on FQ, and the expected degree to which an experiment would serve Q is equal to the expected value of the score after Bayesian update on the result of the experiment.

We recover the old definitions by being omnicurious, namely letting Q be all possible questions.

What about Proposition 1? Well, one direction remains: if E2’s partition is essentially at least as fine as E1’s, then E2 is better with regard any set of questions, an in particular better with regard to Q. But what about the other direction? Now the answer is negative. Suppose the question is what the average weight of the six members of the Geology Department is up to the nearest 100 kg. Consider two experiments: on the first, the members are ordered alphabetically by first name, and a fair die is rolled to choose one (if you roll 1, you choose the first, etc.), and their height is measured. On the second, the same is done but with the ordering being by last name. Assuming the two orderings are different, neither experiment’s partition is essentially at least as fine as the other’s, but the expected contributions of both experiments towards our question is equal.

Is there a nice characterization in terms of partitions of when E2 is at least as good as E1 with regard to a set of questions Q? I don’t know. It wouldn’t surprise me if there was something in the literature. A nice start would be to see if we can answer the question in the special case where Q is a single binary question and where E1 and E2 are binary experiments. But I need to go for a dental appointment now.

1 comment:

Alexander R Pruss said...

The notion of comparing experiments relative to a question explored in this post goes back to the 1950s. https://www.jstor.org/stable/2236332