Alexander Pruss's Blog: experiments

Showing posts with label experiments. Show all posts

Wednesday, January 29, 2025

More on experiments

We all perform experiments very often. When I hear a noise and deliberately turn my head, I perform an experiment to find out what I will see if I turn my head. If I ask a question not knowing what answer I will hear, I am engaging in (human!) experimentation. Roughly, experiments are actions done in order to generate observations as evidence.

There are typically differences in rigor between the experiments we perform in daily life and the experiments scientists perform in the lab, but only typically so. Sometimes we are rigorous in ordinary life and sometimes scientists are sloppy.

The epistemic value to one of an experiment depends on multiple factors in a Bayesian framework.

The set of questions towards answers to which the experiment’s results are expected to contribute.
Specifications of the value of different levels of credence regarding the answers to the questions in Factor 1.
One’s prior levels of credence for the answers.
The likelihoods of different experimental outcomes given different answers.

It is easiest to think of Factor 2 in practical terms. If I am thinking of going for a recreational swim but I am not sure whether my swim goggles have sprung a leak, it may be that if the probability of the goggles being sound is at least 50%, it’s worth going to the trouble of heading out for the pool, but otherwise it’s not. So an experiment that could only yield a 45% confidence in the goggles is useless to my decision whether to go to the pool, and there is no difference in value between an experiment that yields a 55% confidence and one that yields a 95% confidence. On the other hand, if I am an astronaut and am considering performing a non-essential extravehicular task, but I am worried that the only available spacesuit might have sprung a leak, an experiment that can only yield 95% confidence in the soundness of the spacesuit is pointless—if my credence in the spacesuit’s soundness is only 95%, I won’t use the spacesuit.

Factor 3 is relevant in combination with Factor 4, because these two factors tell us how likely I am to end up with different posterior probabilities for the answers to the Factor 1 questions after the experiment. For instance, if I saw that one of my goggles is missing its gasket, my prior credence in the goggle’s soundness is so low that even a positive experimental result (say, no water in my eye after submerging my head in the sink) would not give me 50% credence that the goggle is fine, and so the experiment is pointless.

In a series of posts over the last couple of days, I explored the idea of a somewhat interest-independent comparison between the values of experiments, where one still fixes a set of questions (Factor 1), but says that one experiment is at least as good as another provided that it has at least as good an expected epistemic utility as the other for every proper scoring rule (Factor 2). This comparison criterion is equivalent to one that goes back to the 1950s. This is somewhat interest-independent, because it is still relativized to a set of questions.

A somewhat interesting question that occurred to me yesterday is what effect Factor 3 has on this somewhat interest-independent comparison of experiments. If experiment E₂ is at least as good as experiment E₁ for every scoring rule on the question algebra, is this true regardless of which consistent and regular priors one has on the question algebra?

A bit of thought showed me a somewhat interesting fact. If there is only one binary (yes/no) question under Factor 1, then it turns out that the somewhat interest-independent comparison of experiments does not depend on the prior probability for the answer to this question (assuming it’s regular, i.e., neither 0 nor 1). But if the question algebra is any larger, this is no longer true. Now, whether an experiment is at least as good as another in this somewhat interest-independent way depends on the choice of priors in Factor 3.

We might now ask: Under what circumstances is an experiment at least as good as another for every proper scoring rule and every consistent and regular assignment of priors on the answers, assuming the question algebra has more than two non-trivial members? I suspect this is a non-trivial question.

Tuesday, January 28, 2025

And one more post on comparing experiments

In my last couple of posts, starting here, I’ve been thinking about comparing the epistemic quality of experiments for a set of questions. I gave a complete geometric characterization for the case where the experiments are binary—each experiment has only two possible outcomes.

Now I want to finally note that there is a literature for the relevant concepts, and it gives a characterization of the comparison of the epistemic quality of experiments, at least in the case of a finite probability space (and in some infinite cases).

Suppose that Ω is our probability space with a finite number of points, and that F_Q is the algebra of subsets of Ω corresponding to the set of questions Q (a question partitions Ω into subsets and asks which partition we live in; the algebra F_Q is generated by all these partitions). Let X be the space of all probability measures on F_Q. This can be identified with an (n−1)-dimensional subset of Euclidean Rⁿ consisting of the points with non-negative coordinates summing to one, where n is the number of atoms in F_Q. An experiment E also corresponds to a partition of Ω—it answers the question where in that partition we live. The experiment has some finite number of possible outcomes A₁, ..., A_m, and in each outcome A_i our Bayesian agent will have a different posterior P_{A_i} = P(⋅∣A_i). The posteriors are members of X. The experiment defines an atomic measure μ_E on X where μ_E(ν) is the probability that E will generate an outcome whose posterior matches ν on F_Q. Thus:

μ_E(ν) = P(⋃{A_i:P_{A_i}|_{F_Q}=ν}).

Given the correspondence between convex functions and proper scoring rules, we can see that experiment E₂ is at least as good as E₁ for Q just in case for every convex function c on X we have:

∫_Xcdμ_E₂ ≥ ∫_Xcdμ_E₁.

There is an accepted name for this relation: μ_E₂ convexly dominates μ_E₁. Thus, we have it that experiment E₂ is at least as good as experiment E₁ for Q provided that there is a convex domination relation between the distributions the experiments induce on the possible posteriors for the questions in Q. And it turns out that there is a known mathematical characterization of when this happens, and it includes some infinite cases as well.

In fact, the work on this epistemic comparison of experiments turns out to go back to a 1953 paper by Blackwell. The only difference is that Blackwell (following 1950 work by Bohnenblust, Karlin and Sherman) uses non-epistemic utility while my focus is on scoring rules and epistemic utility. But the mathematics is the same, given that non-epistemic decision problems correspond to proper scoring rules and vice versa.

Comparing binary experiments for non-binary questions

In my last two posts (here and here), I introduced the notion of an experiment being epistemically at least as good as another for a set of questions. I then announced a characterization of when this happens in the special case where the set of questions consists of a single binary (yes/no) question and the experiments are themselves binary.

The characterization was as follows. A binary experiment will result in one of two posterior probabilities for the hypothesis that our yes/no question concerns, and we can form the “posterior interval” between them. It turns out that one experiment is at least as good as another provided that the first one’s posterior interval contains the second one’s.

I then noted that I didn’t know what to say for non-binary questions (e.g., “How many mountains are there on Mars?”) but still binary experiments. Well, with a bit of thought, I think I now have it, and it’s almost exactly the same. A binary experiment now defines a “posterior line segment” in the space of probabilities, joining the two possible credence outcomes. (In the case of a probability space with a finite number n of points, the space of probabilities can be identified as the set of points in n-dimensional Euclidean space all of whose coordinates are non-negative and add up to 1.) A bit of thought about convex functions makes it pretty obvious that E₂ is at least as good as E₁ if and only if E₂’s posterior line segment contains E₁’s posterior line segment. (The necessity of this geometric condition is easy to see: consider a convex function that is zero everywhere on E₂’s posterior line segment but non-zero on one of E₁’s two possible posteriors, and use that convex function to generate the scoring rule.)

This is a pretty hard to satisfy condition. The two experiments have to be pretty carefully gerrymandered to make their posterior line segments be parallel, much less to make one a subset of the other. I conclude that when one’s interest is in more than just one binary question, one binary experiment will not be overall better than another except in very special cases.

Recall that my notion of “better” quantified over all proper scoring rules. I guess the upshot of this is that interesting comparisons of scoring rules are not only relative to a set of questions but to a specific proper scoring rule.

Monday, January 27, 2025

Comparing binary experiments for binary questions

In my previous post I introduced the notion of an experiment being better than another experiment for a set of questions, and gave a definition in terms of strictly proper (or strictly open-minded, which yields the same definition) scoring rules. I gave a sufficient condition for E₂ to be at least as good as E₁: E₂’s associated partition is essentially at least as fine as that of E₁.

I then ended with an open question as to what the necessary and sufficient conditions for a binary (yes/no) experiment to be at least as good as another binary one for a binary question.

I think I now have an answer. For a binary experiment E and a hypothesis H, say that E’s posterior interval for H is the closed interval joining P(H∣E) with P(H∣∼E). Then, I think:

Given the binary question whether a hypothesis H is true, and binary experiments E₁ and E₂, experiment E₂ is at least as good as E₁ if and only if its posterior interval for H contains the E₁’s posterior interval for H.

Let’s imagine that you want to be confident of H, because H is nice. Then the above condition says that an experiment that’s better than another will have at least as big potential benefit (i.e., confidence in H) and at least as big potential risk (i.e., confidence in ∼ H). No benefits without risks in the epistemic game!

The proof (which I only have a sketch of) follows from expressing the expected score after an experiment using formula (4) here, and using convexity considerations.

The above answer doesn’t work for non-binary experiments. The natural analogue to the posterior interval is the convex hull of the set of possible posteriors. But now imagine two experiments to determine whether a coin is fair or double-headed. The first experiment just tosses the coin and looks at the answer. The second experiment tosses an auxiliary independent and fair coin, and if that one comes out heads, then the coin that we are interested in is tossed. The second experiment is worse, because there is probability 1/2 that the auxiliary coin is tails in which case we get no information. But the posterior interval is the same for both experiments.

I don’t know what to say about binary experiments and non-binary questions. A necessary condition is containment of posterior intervals for all possible answers to the question. I don’t know if that’s sufficient.

Comparing experiments

When you’re investigating reality as a scientist (and often as an ordinary person) you perform experiments. Epistemologists and philosophers of science have spent a lot of time thinking about how to evaluate what you should do with the results of the experiments—how they should affect your beliefs or credences—but relatively little on the important question of which experiments you should perform epistemologically speaking. (Of course, ethicists have spent a good deal of time thinking about which experiments you should not perform morally speaking.) Here I understand “experiment” in a broad sense that includes such things as pulling out a telescope and looking in a particular direction.

One might think there is not much to say. After all, it all depends on messy questions of research priorities and costs of time and material. But we can at least abstract from the costs and quantify over epistemically reasonable research priorities, and define:

E₂ is epistemically at least as good an experiment as E₁ provided that for every epistemically reasonable research priority, E₂ would serve the priority at least as well as E₁ would.

That’s not quite right, however. For we don’t know how well an experiment would serve a research priority unless we know the result of the experiment. So a better version is:

E₂ is epistemically at least as good an experiment as E₁ provided that for every epistemically reasonable research priority, the expected degree to which E₂ would serve the priority is at least as high as the expected degree to which E₁ would.

Now we have a question we can address formally.

Let’s try.

A reasonable epistemic research priority is a strictly proper scoring rule or epistemic utility, and the expected degree to which an experiment would serve that priority is equal to the expected value of the score after Bayesian update on the result of the experiment.

(Since we’re only interested in expected values of scores, we can replace “strictly proper” with “strictly open-minded”.)

And we can identify an experiment with a partition of the probability space: the experiment tells us where we are in that partition. (E.g., if you are measuring some quantity to some number of significant digits, the cells of the partition are equivalence classes under equality of the quantity up to those many significant digits.) The following is then easy to prove:

Proposition 1: On definitions (2) and (3), an experiment E₂ is epistemically at least as good as experiment E₁ if and only if the partition associated with E₂ is essentially at least as fine as the partition associated with E₁.

A partition R₂ is essentially at least as fine as a partition R₁ provided that for every event A in R₁ there is an event B in R₂ such that with probability one B happens if and only if A happens. The definition is relative to the current credences which are assumed to be probabilistic. If the current credences are regular—all non-empty events have non-zero probability—then “essentially” can be dropped.

However, Proposition 1 suggests that our choice of definitions isn’t that helpful. Consider two experiments. On E₁, all the faculty members from your Geology Department have their weight measured to the nearest hundred kilograms. On E₂, a thousand randomly chosen individiduals around the world have their weight measured to the nearest kilogram. Intuitively, E₁ is better. But Proposition 1 shows that in the above sense neither experiment is better than the other, since they generate partitions neither of which is essentially finer than the other (the event of there being a member of the Geology Department with weight at least 150 kilograms is in the partition of E₂ but nothing coinciding with that event up to probability zero is in the partition of E₁). And this is to be expected. For suppose that our research priority is to know whether any members of your Geology Department are at least than 150 kilograms in weight, because we need to know if for a departmental cave exploring trip the current selection of harnesses all of which are rated for users under 150 kilograms are sufficient. Then E₁ is better. On the other hand, if our research priority is to know the average weight of a human being to the nearest ten kilograms, then E₂ is better.

The problem with our definitions is that the range of possible research priorities is just too broad. Here is one interesting way to narrow it down. When we are talking about an experiment’s epistemic value, we mean the value of the experiment towards a set of questions. If the set of questions is a scientifically typical set of questions about human population weight distribution, then E₁ seems better than E₂. But if it is an atypical set of questions about the Geology Department members’ weight distribution, then E₂ might be better. We can formalize this, too. We can identify a set Q of questions with a partition of probability space representing the possible answers. This partition then generates an algebra F_Q on the probability space, which we can call the “question algebra”. Now we can relativize our definitions to a set of questions.

E₂ is epistemically at least as good an experiment as E₁ for a set of questions Q provided that for every epistemically reasonable research priority on Q, the expected degree to which E₂ would serve the priority is at least as high as the expected degree to which E₁ would.
A reasonable epistemic research priority on a set of questions Q is a strictly proper scoring rule or epistemic utility on F_Q, and the expected degree to which an experiment would serve Q is equal to the expected value of the score after Bayesian update on the result of the experiment.

We recover the old definitions by being omnicurious, namely letting Q be all possible questions.

What about Proposition 1? Well, one direction remains: if E₂’s partition is essentially at least as fine as E₁’s, then E₂ is better with regard any set of questions, an in particular better with regard to Q. But what about the other direction? Now the answer is negative. Suppose the question is what the average weight of the six members of the Geology Department is up to the nearest 100 kg. Consider two experiments: on the first, the members are ordered alphabetically by first name, and a fair die is rolled to choose one (if you roll 1, you choose the first, etc.), and their height is measured. On the second, the same is done but with the ordering being by last name. Assuming the two orderings are different, neither experiment’s partition is essentially at least as fine as the other’s, but the expected contributions of both experiments towards our question is equal.

Is there a nice characterization in terms of partitions of when E₂ is at least as good as E₁ with regard to a set of questions Q? I don’t know. It wouldn’t surprise me if there was something in the literature. A nice start would be to see if we can answer the question in the special case where Q is a single binary question and where E₁ and E₂ are binary experiments. But I need to go for a dental appointment now.

Alexander Pruss's Blog