Suppose a number *Z* is
chosen uniformly randomly in (0, 1]
(i.e., 0 is not allowed but 1 is) and an independent fair coin is
flipped. Then the number *X* is
defined as follows. If the coin is heads, then *X* = *Z*; otherwise, *X* = 2*Z*.

At

*t*_{0}, you have no information about what*Z*,*X*and the coin toss result are, but you know the above setup.At

*t*_{2}, you learn the exact value of*X*.

Here’s the puzzling thing. At *t*_{2}, when you are informed
that *X* = *x* (for some
specific value of *x*) your
total evidence since *t*_{0} is:

*E*_{x}: Either the coin landed heads and*Z*=*x*, or the coin landed tails and*Z*=*x*/2.

Now, if *x* > 1, then when
you learn *E*_{x}, you know for
sure that the coin was tails.

On the other hand, if *x* ≤ 1, then *E*_{x} gives you no
information about whether the coin landed heads or tails. For *Z* is chosen uniformly and
independently of the coin toss, and so as long as both *x* and *x*/2 are within the range of
possibilities for *Z*, learning
*E*_{x} seems
to tell you nothing about the coin toss. For instance, if you learn:

*E*_{1/4}: Either the coin landed heads and*Z*= 1/4, or the coin landed tails and*Z*= 1/8,

that seems to give you no information about whether the coin landed heads or tails.

Now add one more stage:

- At
*t*_{1}, you are informed whether*x*≤ 1 or*x*> 1.

Suppose that at *t*_{1} what you learn is that
*x* ≤ 1. That is clearly
evidence for the heads hypothesis (since *x* > 1 would conclusively prove
the tails hypothesis). In fact, standard Bayesian reasoning implies you
will assign probability 2/3 to heads
and 1/3 to tails at this point.

But now we have a puzzle. For at *t*_{1}, you assign credence
2/3 to heads, but the above reasoning
shows you that at *t*_{2}, you will assign
credence 1/2 to heads. For at *t*_{2} your total eidence
since *t*_{0} will be
summed up by *E*_{x} for some
specific *x* ≤ 1 (*E*_{x} already
includes the information given to you at *t*_{1}). And we saw that if
*x* ≤ 1, then *E*_{x} conveys no
evidence about whether the coin was heads or tails, so your credence in
heads at *t*_{2} must be
the same as at *t*_{1}.

So at *t*_{1} you
assign 2/3 to heads, but you know that
when you receive further more specific evidence, you will move to assign
1/2 to heads. This is counterintuitive,
violates van Fraassen’s reflection principle, and lays you open to a
Dutch Book.

What went wrong? I don’t really know! This has been really puzzling me. I have four solutions, but none makes me very happy.

The first is to insist that *E*_{x} has zero
probability and hence we simply cannot probabilistically update on it.
(At most we can take *P*(*H*|*E*_{x})
to be an almost-everywhere defined function of *x*, but that does not provide a
meaningful result for any particular value of *x*.)

The second is to say that true uniformity of distribution is
impossible. One can have the kind of uniformity that measure theorists
talk about (basically, translation invariance), but that’s not enough to
yield non-trivial comparisons of the probabilities of individual values
of *Z* (we assumed that *x* and *x*/2 were equally likely options for
*Z* if *X* ≤ 1).

The third is some sort of finitist thesis that rules out
probabilistic scenarios with infinitely many possible outcomes, like the
choice of *Z*.

The fourth is to bite the bullet, deny the reflection principle, and accept the Dutch Book.

## 24 comments:

I don't see why we're obligated to allow the reasoning at t_2 to completely supercede that at t_1. If you must frame it in terms of E_X, there's only a 3/4 chance that two Z values are possible, so one should use the information that provides.

Well, the order in which you receive evidence should be irrelevant. Suppose you first get the evidence from t_2: you learn that X=1/4 (say). You learn nothing relevant to the heads/tails question. Next you get the evidence from t_1: you learn that is X is at most 1. Does that change anything? No! You already knew that X=1/4. Adding that X is at most 1 doesn't give you anything more. (I am assuming, as is usual, logical omniscience.)

My point is that we shouldn't be restricted to just one way of looking at the data we have. The calculation at t_1 is still valid, while the calculation at t_2 fails to take into account that two distinct Z values are possible, which is only true 3/4 of the time. So the latter seems faulty.

Look at it as a revision of probabilities using Bayes' rule. At t_1, the probability of H is 2/3, so that is the prior prob at t_2. Let f(z)=1 be the density of Z.

P(H|X=1/4) = P(H & Z=1/4)/(P(H & Z=1/4) + P(T & Z=1/8))

=2/3 / (2/3 + 1/3) = 2/3.

So the added information at t_2 does not give us any more useful info. If a calculation at t_2 gives P(H|X=1/4)=1/2, it has actually

discardedinfo.It's interesting that doing a similar ratio of densities calculation at t_2, but in terms of X rather than of Z, gives the correct answer. It seems that the difficulty is tied to the change of variables which occurs only on T. Is it in some sense slightly "harder" to get X=1/4 with T as opposed to heads, because multiplying by 2 spreads out the inevitable error? That at least seems to be the philosophy behind densities.

Btw, do you know any statisticians or probabilists? The ones I know have stopped answering my email...

Alex: Yes, I was wondering about that…

Your next post, The Joy of Error Bars, gives one response. Conditionalizing on an interval, however narrow, straightforwardly gives the right conditional probabilities. You could argue that you won’t in practice know x exactly, or perhaps that you can’t know it even in principle, because causal finitism or your own finiteness or some such would rule out learning an infinite number of decimal places. Note that this isn’t quite your third option – it doesn’t necessarily deny that x could be generated, just that you could come to know its precise value.

Another response: When you are told the setup (which includes the fact that you will be told the precise value of x), you adopt a policy: if the precise value of x is greater than 1, take P(H) as zero; if it’s less than or equal to 1, take P(H) as 2/3. (Take the probabilities for T as 1 minus these probabilities). Evaluate bets using these probabilities.

This may look ad hoc but isn’t. The values in the policy are those of the conditional distribution function for H given x. Operationally, they are used to decide a betting policy – for which values of x, if any, should you accept a bet? To evaluate such a policy, you calculate your expected gain by integrating p(x) * ((p(H|x) * payoff on H + p(T|x) * payoff on T)) over the values of x for which you accept the bet. [p(x), p(H|x) and p(T|x) are the distribution functions and conditional distribution functions.] Clearly, the best betting policy is to accept for those values of x for which the conditional expectation (as calculated above from the conditional distribution functions) is positive.

Two policies and an integral may look like a rigmarole, but the result is to justify using the conditional distribution functions in the obvious way, and without trying to conditionalize on zero probability events.

Andrew:

"two distinct Z values are possible".

We want to be careful with that line of reasoning.

Here's a variant case. Let Z be the same as before. On heads, still let X=Z, but now on tails do the following. If Z≤1/2, let X=2Z; otherwise, let X=2Z-1. Let's say you observe X=1/4. Now, you say: on heads there was only one way of getting 1/4: Z=1/4, but on tails there were two ways of getting 1/4: Z=1/8 and Z=5/8. So we probably had tails.

Density based reasoning is also potentially problematic. Consider the following variant situation. On heads, you let X=Z. On tails, you let X=Z unless Z=1/4, in which case you let X=0. The conditional density on tails and the conditional density on heads are exactly the same, because densities don't care about a change at one point (or on any set of measure zero). Suppose you learn that X=1/4. Density-based reasoning would tell you that you should have equal credence in heads as in tails. But obviously you know that it was heads. This is a case where the exact information rightly overrides the inexact.

Ian:

You can also run the same argument using a strictly proper scoring rule and get a non-pragmatic argument.

That said, there are infinitely many policies that have the same expected gain as your recommended one. Any policy that differs from your recommended policy on a set of measure zero has exactly the same expected gain as your policy.

--

The problem in all of this stuff is that conditional probabilities are only defined modulo sets of measure zero. As a result, in cases where measure zero stuff is relevant (as when you update on zero probability observations) it's really up for grabs what is recommended to be done. We might opt for the expected-value maximizing policy that has the nicest continuity properties--that is basically what Andrew and you are recommending--but there does not seem to be a justification for that within decision theory.

And there are cases where opting for the policy that has the nicest continuity properties is wrong. Take my last example in my response to Andrew. The expected-value maximizing policy with the nicest continuity properties is to bet as if the probability of heads were 1/2. But it would be really stupid to do that if you learned that X=1/4.

We might try an ad hoc modification. Opt for the expected-value maximizing policy with the nicest continuity properties and that gets right cases where we would be _certain_ of the relevant information. That takes care of the example in my response to Andrew. But I can modify my example. Suppose that on tails, if you have Z=1/4, an independent fair d1000 is rolled, and on 1 we let X=Z, and on everything else we let X=1/8. Then the policy to bet as if the probability of heads were 1/2 maximizes expected value, is maximally continuous, and gets right all the cases of certainty, because X=1/4 and X=1/8 are not cases of certainty. But it is clear that on X=1/4, you shouldn't bet as if the probability of heads were 1/2.

OF course, one can make further ad hoc modifications.

Here's another way to see the issue at t2 in my original case. When you learn that X=1/4, it is tempting to say: OK, I can't conditionalize on that, but I can conditionalize on X being in (1/4-u,1/4+u) for a small positive u, and then I can take the limit as u goes to zero. And then indeed you get the intuitive result. That's basically what the density based solution does.

But why conditionalize on X being in (1/4-u,1/4+u) instead of conditionalizing on some other event that contains 1/4? We could conditionalize on X being in (1/4-u,1/4+u) union (5/4,(5/4)+2u), and then take the limit as u goes to zero. In both cases, we are taking a limit of conditionalizations on events that contain 1/4, and whose intersection is 1/4. But we get different answers in the two cases. One is simpler, admittedly, but it is not clear why this should matter here.

Alex: I think you missed my point about there being two possible Z values. I meant that this happens iff X<=1, and P(H|X<=1) = 2/3. There's no reason that we have to forget that upon learning additional information. Bad calculations based on incomplete analysis of a situation do not have to be respected by decision theory.

Seriously, do you know any PhDs in probability or statistics? We're the blind leading the blind, and it seems like a real expert could clear all this up in seconds.

Btw, this is like the Monty Hall problem in reverse. There, additional useful information is learned but most people miss it; here, no additional usseful info is learned at t_2, but the specificity of the useless info makes people fixate on it.

Alex: I agree that it seems strange that densities make everything come out right, even when we're dealing with singleton events. But one aspect of densities is that making a change of variables Y = aX means that f_y = f_x / a, so density encodes the inherently lower chance of an RV with greater range taking a specific value.

And when you calculate the conditional probability using Z, without the the factor of 1/2 from the change of variable, you are forgetting a key feature of the experiment: that on T, Z is doubled. That calculation represents an experiment in which a coin is flipped and a Z value is generated, and then the event

H&Z=1/2 or T&Z=1/4

is reported. But that's not the same experiment. Information has been made to disappear under our very eyes.

Andrew:

"Seriously, do you know any PhDs in probability or statistics?"

To some degree myself? Sorry for having to flex, but my math dissertation has some stuff on Brownian motion and on random walks, and more than half of the publications arising from my math grad student research are in probability (mostly on sums of random variables). :-)

"density encodes the inherently lower chance of an RV with greater range taking a specific value": But it always doesn't have an inherently lower chance. Take the version from my earlier post, where Z ranges over [0,1) instead of (0,1], and suppose we learn that X=0. Then on tails we have a greater range, namely [0,2), while on heads we have the smaller range [0,1). But X=0 if and only if Z=0 if and only if 2Z=0, and Z=0 and 2Z=0 had better have the same probability.

In my opinion, the problem is not mathematical but philosophical. The mathematical stuff seems all straightforward---though with the conditional probabilities only defined up to measure zero sets. The problem is normative: What *should* an agent do when faced with the information at t1 and t2?

"Sorry for having to flex"

No worries! I know you were trained as a mathematician, as was I, but having written papers that touch on a subject is not quite the same as devoting oneself to it for years. I'm really curious what a working statistician would say about these questions. I suspect they already have answers worked out - these questions are not deep.

"...Z=0 and 2Z=0 had better have the same probability."

Yes, but when you change variables from X to Z you must replace f_X by f_Z/2 in the case of T. If you don't, you are computing the probability for a

differentexperiment."What *should* an agent do when faced with the information at t1 and t2?"

That's obvious, isn't it? Bet on H. We know that P(H|X<1)=2/3 from multiple other arguments, including the density ratio calculation using X. Clearly the calculation using Z is incorrect. I don't think an incorrect calculation is a normative "problem".

Alex: Another response: where you have non-conglomerabilty, you have to take care to specify the setup precisely.

Suppose the setup was as in the post, except that instead of being told the precise value of X, you would be told whether X was (for example) precisely 0.4 or not. Then, if you were told that it was (which would happen with at most infinitesimal probability), and you thought that each precise value of Z of equally (infinitesimally) likely, you would be right to conditionalize as in the post and say P(H)=0.5. There would be no Dutch book – you might have a sure loss on X=0.4, but that would have only infinitesimal probability.

But the setup in the post is different: you will be told the precise value of X,

whatever it turns out to be. It’s true that, conditional on each possible value of X (in the sense of the previous paragraph), P(H)=0.5. But you can’t assume conglomerability, so you can’t deduce that P(H)=o.5. You would have no reason to change from your previous P(H)=2/3.I've changed my mind - I'm not sure a working statistician would have anything useful to say. Because they work with quantities derived from the real world, which always have error bars.

Another way to look at this experiment is that, apart from whether X<=1 or not, the precise value of X, and therefore of Z, gives no useful information about H v. T, because the Z value has no causal connection to the coin flip. And by focusing on the Z values, we're distracting ourselves from the information we do have. I'd suggest that a sharper example of the problem would an experiment where the

distributionsof X under H and T are different but have the samerange.This is also relevant to the lottery examples in Alex's book on infinity. I've been working on a paper in response that.

To help see what I'm getting at in my thinking here, I see a great disconnect between the pointwise and intervalwise behavior of probabilities. Distributions and densities have to do with intervalwise behavior. But the two are quite different.

Suppose X is uniformly distributed over [0,1) in the classic mathematical sense--this is "intervalwise" behavior, namely that the probability of X being in any interval of the same length is the same, at least if we neglect infinitesimal differences (which classical probability does).

But this classical uniform distribution is quite compatible with all sorts of non-uniform hypotheses about X's pointwise behavior. For instance, it could be that X simply *cannot* have the value 1/2 (maybe X is generated with a spinner, but when the spinner lands on 1/2, X is defined to be 1/4; otherwise, X just gets the spinner value).

However, let's now add that X is uniform in the pointwise sense: namely, all single points are exactly on par probabilistically. I mean this latter statement in a way that's stronger than just the classical observation that every point has zero probability. Maybe every point has the same non-zero infinitesimal probability. Or maybe we have a primitive conditional probability function P according to which P({x}|{x,y}) = 1/2 whenever x and y are distinct points. Or maybe we have a qualitative probability comparison on which all singletons are on par. Or maybe it's just an intuitive statement. Whatever it is, the pointwise uniformity is supposed to rule out cases such as the one where X cannot have the value 1/2.

Now, we can tweak the intervalwise behavior of X and the pointwise behavior of X independently. Let Y = X^2. Then intervalwise, Y is non-uniform (it's more likely to be in [0,1/2) than in [1/2,1)). But every point of [0,1) is on par for Y just as every point was on par for X. Indeed, for any point y of [0,1), the event Y=y is equivalent to the event X=y^(1/2), and these events are all on par no matter what the value of y is.

Conversely, we can change the pointwise behavior of X without changing the intervalwise behavior of X. One example would be to say that if X=1/2, then a fair coin is tossed, and if it's heads, then Z=1/2, and if it's tails, then Z=1/4, and if X is other than 1/2, then Z=X. Then Z is intervalwise uniform, but not pointwise (1/2 is half as likely as any other point), even though every point of [0,1) is possible.

Or consider this fun construction. If X is less than 1/2, let W=2X; otherwise, let W=2X-1. Then W is uniformly distributed both pointwise AND intervalwise. But W's pointwise distribution differs from X's, because each point in [0,1) is twice as likely to be hit by W than by X. (You get W=1/4 when X=1/8 or X=5/8, etc.)

Mathematical and statistical probability theorists ignore what I am calling the "pointwise distribution". I am inclined to think that in the final analysis they may be right to do so.

'Mathematical and statistical probability theorists ignore what I am calling the "pointwise distribution". I am inclined to think that in the final analysis they may be right to do so.'

Agreed. For example, a permutation of the points in [0,1] will not in general preserve the original distribution; but under the pointwise point of view, it should. It's interesting that the uniform distribution on the interval is dependent on the geometry.

One could imagine permutation invariant distributions, but they would be very strange. Actually your r.n.g. for a uniform distribution on N seems to be permutation invariant.

The geometry point is important. Pointwise uniformity doesn't care about geometry.

BTW, no finitely additive probability measure on N and defined on all subsets is permutation invariant.

Proof: Let E, O, Q0 and Q2 be the even numbers, the odd numbers, the numbers divisible by four, and the numbers whose remainder is two when divided by four. For any pair of these sets, there is a permutation of N that maps each member of the pair onto the other. Thus a finitely additive permutation invariant probability measure P on N must assign equal probabilities to all four sets. Since P(E)+P(O)=1, we must have P(E)=1/2, and hence P(Q0)=1/2=P(Q2). But this is impossible since P(E)=P(Q0)+P(Q2).

All we can have is invariance under some more restricted collection of symmetries.

See also: https://arxiv.org/abs/2010.07366

Thanks for the link, I'll check out that paper.

Yes, a probability distribution or even charge on N which is permutation invariant would have to a very restricted sigma-algebra. However that seems to be what you produced in your book.

Alex,

Are you thinking what I'm thinking? That if your construction really gives a permutation invariant r.n.g., then that could be a knock down argument against infinitary causality?

I'm worried though that the case for _full_ permutation invariance isn't firm. That is, it's clear that a permutation would not affect the behavior of individual n's; but it could still conceivably change the behavior of _subsets_ of N. It's hard to pin down.

Which random number generator is it, anyway? I give several in the book. And all I argue is for pointwise invariance, and in some cases not even that (because to get the paradoxes, I don't need pointwise invariance, but only that each point have infinitesimal probability).

I'm thinking of the one for selecting a natural number with uniform probability, where you start by generating a random real and then find its equivalence class according the tail of the binary expansion (similar to Vitali sets). Superficially it seemed like it might be permutation invariant, but upon further reflection it seems unlikely.

Post a Comment