This builds on, but also goes back over the ground of, my previous post.
I've been playing with voting methods, or as I might prefer to call them "utility estimate aggregation methods." My basic model is there are n options (say, candidates) to choose between and m evaluators ("voters"). The evaluators would like to choose the option that has the highest utility. Unfortunately, the actual utilities of the options are not known, and all we have are estimates of the utilities by all the evaluators.
A standard method for this is the Condorcet method. An option is a Condorcet winner provided that it "beats" every other option, when an option x "beats" an option y provided that a majority of the evaluators estimates x more highly than y. If there is no Condorcet winner, there are further resolution methods, but I will only be looking at cases where there is a Condorcet winner.
My first method is
- Method A: Estimate each option's utility with the arithmetical average of the reported utilities assigned to it by all the evaluators, and choose the option with the highest utility.
(I will be ignoring tie-resolution in this post, because all the utilities I will work with are real-numbered, and the probability of a tie will be zero.) This method can be proved to maximize epistemically expected utility under the
- Basic Setup: Each evaluator's reported estimate of each option's utility is equal to the actual utility plus an error term. The error terms are (a) independent of the actual utilities and (b) normally distributed with mean zero. Moreover, (c) our information as to the variances of the error terms is symmetric between the evaluators, but need not be symmetric between the options (thus, we may know that option 3 has a higher variance in its error terms than option 7; we may also know that some evaluators have a greater variance in their error terms; but we do not know which evaluators have a greater variance than which).
Unfortunately, it is really hard to estimate absolute utility numbers. It is a lot easier to rank order utilities. And that's all Condorcet needs. So in that way at least, Condorcet is superior to Method A. To fix this, modify the Basic Setup to:
- Modified Setup: Just like the Basic Setup, except that what is reported by each evaluator is not the actual utility plus error term, but the rank order of the actual utility plus error term.
In particular, we still assume that beneath the surface—perhaps implicitly—there is a utility estimate subject to the same conditions. Our method now is
- Method B: Replace each evaluator's rank ordering with roughly estimated Z-scores by using the following algorithm: a rank of k (between 1 and n) is transformed to f((n+1/2−k)/n), where f is the inverse of the cumulative normal distribution function. Each option's utility is then estimated as the arithmetical average of the roughly estimated Z-scores across the evaluators, and the option with the highest estimate utility is chosen.
Now time for some experiments. Add to the Basic Setup the assumptions that (d) the actual utilities in the option pool are normally distributed with mean zero and variances one, and (e) the variances of all the evaluators' error terms are equal to 1/4 (i.e., standard deviation 1/2). All the experiments use 2000 runs. Because I developed this when thinking about grad admissions, the cases that interest me most are ones with a small number of evaluators and a large number of options, which is the opposite of how political cases work (though unlike in admissions, I am simplifying by looking for just the best option). Moreover, it doesn't really matter whether we choose the optimal option. What matters is how close the actual utility of the chosen option is to the actual utility of the optimal option. The difference in these utilities will be called the "error". If the error is small enough, there is no practically significant difference. Given the normal distribution of option utilities, about 95% of actual utilities are between -2 and 2, so if we have about 20 option, we can expect the best option to have a utility of somewhere of the order of magnitude of 2. Choosing at random would then give us an average error of the order of magnitude of 2. The tables below give the average errors for the 2000 runs of the experiments. Moreover, so as to avoid between different choices of resolution methods, I am discarding data from runs during which there was no Condorcet winners, and hence comparing Method A and Method B to Condorcet at its best (interestingly, Method A and Method B also work less well when there was no Condorcet winner). Discarded runs were approximately 2% of runs. Source code is available on request.
Experiment 1: 3 evaluators, 50 options.
Condorcet | 0.030 |
Method A | 0.023 |
Method B | 0.029 |
So, with a small number of evaluators and a large number of options, Method A significantly beats Condorcet. Method B slightly beats Condorcet.
Experiment 2: 50 evaluators, 50 options.
Condorcet | 0.0017 |
Method A | 0.0011 |
Method B | 0.0015 |
So we have a similar distribution of values, but of course with a larger number of evaluators, the error is smaller. It is interesting, however, that even with only three evaluators, the error was already pretty small, about 0.03 sigma for all the methods.
Experiment 3: 3 evaluators, 3 options.
Condorcet | 0.010 |
Method A | 0.007 |
Method B | 0.029 |
Method B is much worse than Condorcet and Method A in this case. That's because with three options, the naive Z-score estimation method in Method B fails miserably. With 3 options Method B is equivalent to a very simple method we might call Method C where we simply average the rank order numbers of the options across the evaluators. At least with 3 options, that is a
bad way to go. Condorcet is much better, and Method A is even better if it is workable.
Experiment 4: 50 evaluators, 3 options.
Condorcet | 0.0003 |
Method A | 0.0002 |
Method B | 0.0159 |
The badness of Method B for a small number of options really comes across here. Condorcet and Method A really benefit from boosting the number of evaluators, but with only 3 options, Method B works miserably.
So, one of the interesting consequences is that Method B is strongly outperformed by Condorcet when the number of options is small. How small? A bunch of experiments suggests that it's kind of complicated. For three evaluators, Method B catches up with Condorcet at around 12 options. Somewhat surprisingly, for a greater number of evaluators, it needs more options for Method B to catch up with Condorcet. I conjecture that Method B works better than Condorcet when the number of options is significantly greater than the number of evaluators. In particular, in political cases where the opposite inequality holds, Condorcet far outperforms Method B.
One could improve on Method B, whose Achilles heel is the Z-score estimation, by having the evaluators include in their rankings options that are not presently available. One way to do that would be to increase the size of the option pool by including fake options. (In the case of graduate admissions, one could include a body of fake applications generated by a service.) Another way would be by including options from past evaluations (e.g., applicants from previous years). Then these would enter into the Z-score estimation, thereby improving Method B significantly. Of course, the down side of that is that it would be a lot more work for the evaluators, thereby making this unworkable.
Method A is subject to extreme evaluator manipulation, i.e., "strategic voting". Any evaluator can produce any result she desires by just reporting her utilities to swamp the utilities set by others. (The Basic Setup's description of the errors rules this out.) Method B is subject to more moderate evaluator manipulation. Condorcet, I am told, does fairly well. If anything like Method A is used, what is absolutely required is a community of justified mutual trust and reasonableness. Such mutual trust does, however, make possible noticeably better joint choices, which is an interesting result of the above.
So, yes, in situations of great trust where all evaluators can accurately report their utility estimates, we can beat Condorcet by adopting Method A. But that's a rare circumstance. In situations of moderate trust and where the number of candidates exceeds the number of evaluators, Method B might be satisfactory, but its benefits over Condorcet are small.
One interesting method that I haven't explored numerically would be this:
- Method D: Have each evaluator assign a numerical evaluations to the options on a fixed scale (say, integers from 1 to 50). Adjust the numerical evaluations to Z-scores, using data from the evaluator's present and past evaluations using some good statistical method. Average these estimated Z-scores across evaluators and choose the option with the highest average.
Under appropriate conditions, this method should converge to Method A over time in the Modified Setup. There would be possibilities for manipulation, but they would require planning ahead, beyond the particular evaluation (e.g., one could keep all one's evaluations in a small subset of the scale, and then when one really wants to make a difference, one jumps outside of that small subset).