Anda di halaman 1dari 11

Exploring Simpsons Paradox

Consider the case where Jane is better at task 1 than Kevin. And Jane is also better at
task 2 than Kevin. It would seem impossible for Kevin to actually be better than Jane when
the results of the two tasks are combined and yet it is possible!
Let us explore this paradoxical situation mathematically. For each task we will count the
number of successes and failures for each person on each task. We will use the obvious
notation:
S= Successes;
F= Failures
J= Jane;
K= Kevin
1= task 1;
2= task 2
Thus SK2 is the number of Successes of Kevin in task 2.
To set up the first condition of Jane being better at task 1 than Kevin we could write
SJ1
SK1

SJ1 FJ1 SK1 FK1

Effectively we are comparing the per-unit success rate of each person. But we could invert
the formula to
SJ1 FJ1 SK1 FK1

SJ1
SK1

Then simplify by division, followed by subtraction of 1 from each side to give


FJ1 FK1

SJ1 SK1

And finally invert again.


SJ1 SK1

FJ1 FK1

We could have set up our initial condition this way in the first place, but we have seen that
the two forms are equivalent in any case. Clearly a similar condition applies to the second
task as well.
SJ2 SK2

FJ2 FK2

The paradox says that given the two initial conditions


SJ1 SK1
SJ2 SK2

and
FJ1 FK1
FJ2 FK2

it is still possible to find values such that

SJ1 SJ2 SK1 SK2

FJ1 FJ2 FK1 FK2

which we will refer to as a Simpson Event. At this point a computer simulation can give a
quick answer as to whether or not this is possible.

Leslie Green CENG MIEE

1 of 11

Nov 2014

Because we have 8 variables the overall event space is unpleasantly large. In order to get
an idea of what is going on we can take SJ1, SJ2, SK1, SK2, FJ1, FJ2, FK1 and FK2 as
uniformly distributed random variables to sample the event space. It should be evident by
symmetry that the test
Likewise the test

SJ1 SK1

will reject half of all sets of values.


FJ1 FK1

SJ2 SK2

will independently reject another half of all sets of values.


FJ2 FK2

Therefore, on average, only of the random tests will fulfil our initial conditions. However,
what we are actually interested in is the proportion of results which, having fulfilled our
initial conditions, go on to create Simpson Events. The answer, after hundreds of millions
of samples, is around 1.9% (see an example C++ program in Appendix 1).
The computer simulation is not very satisfying in terms of explaining what is going on, but
a graphical explanation is helpful at this point. We can consider the result of the first task
for Jane as the directed line segment from (0,0) to (FJ1, SJ1). We do the same for the
second task and for Kevins results as well. The slope of the line segments measures the
successfulness on the task, a steeper slope being more successful.

successes

J1

K1

J2
K2
failures

Leslie Green CENG MIEE

2 of 11

Nov 2014

The directed line segments can be summed as vectors. The dotted line resultants clearly
show that the blue slope is steeper than the red slope, meaning that Kevin is more
successful than Jane overall.

successes

J2

J1
K2

K1

failures
We can now do a second computer simulation (see Appendix 2), based along the lines of
the graphical representation above. We can define J1, J2, K1 and K2 by uniformly
distributed random angles (measured anti-clockwise from the failures axis) and uniformly
distributed random lengths.
The initial conditions are then

J1
J2

K1
K2

(K1 K2)

where J1 is read as the angle of J1.


A Simpson Event occurs when

( J1 J2)

The successes and failures are now encoded within the vectors so we have to extract their
values using elementary trigonometry. The successes are the magnitude of the vector
times the sine of the angle; the failures are the magnitude times the cosine of the angle.

Leslie Green CENG MIEE

3 of 11

Nov 2014

This time, after hundreds of millions of tests, we find that the Simpson Events now occur
3.8% of the time after the initial conditions are satisfied. This is almost double our previous
result of 1.9%. What happened?
In the original testing the length of the vector was determined by the square root of the
sum of the squares of two uniformly distributed random variables. The resulting distribution
is fairly peaky.
Distribution of Pythagorean Length of two uniformly distributed Random Variables
25

20

15
HD m
10

10

20

30

40

50

60

70

80

90

100

In the original testing the vector angle was determined from the arctangent of the ratio of
two uniformly distributed random variables. Again the distribution is quite peaky.
Distribution of ArcTan of ratio of two uniformly distributed Random Variables
2.5

1.5

H p

0.5

10

20

30

40

50

60

70

80

90

p
2

angle (degrees)

Leslie Green CENG MIEE

4 of 11

Nov 2014

But we still have no intuition about what is needed to create a Simpson Event. The next
simulation uses a brute-force search of J1, J2, K1 and K2 values, then samples
random magnitudes to see how J1 and J2 affect the results. The C++ source code is
shown in Appendix 3 and creates the data file simpson.txt used by gnuplot.
Simpson's Paradox

25
20
15
% Simpson Events
10
5
0
90
80
70
60
50
angle J1

40
30
20
10
0

10

20

30

40

50

60

70

80

90

angle J2

The above plot was created with gnuplot 4.6.6 using the commands:
reset
cd 'c:\'
set xyplane 0
set view equal xy
set view 44,345
set xlabel "angle J2"
set ylabel "angle J1"
set zlabel "% Simpson Events" offset -3
set grid xtics ytics
set style data lines
set hidden3d
set lmargin at screen 0.12
splot [0:90][0:90][0:25] 'simpson.txt' title "Simpson's Paradox"

Now we can really see how J1 and J2 affect the probabilities. The chance of seeing a
Simpson Event is maximised when one of the trials is very successful and the other is very
unsuccessful.
We have gone from a Simpson Event being very unlikely (1.9%) to actually being quite
likely (up to 20%) depending on the success rates of the trials.
At the moment the data is still encoded by vector angles, a somewhat unusual method of
rating trial success rates. For completeness the data has been recalculated for Janes win
(success) rate on both trials (see Appendix 4).
Leslie Green CENG MIEE

5 of 11

Nov 2014

The resulting graph is not a great deal different to the vector success angle plot, but we
have come full circle to the original percentage success relation 2 the most natural way to
present the data.
Simpson's Paradox

25
20
15
% Simpson Events
10
5
0
100
80
60
Jane %win 1

40
20
0

60

40

20

80

100

Jane %win 2

At the start of this brief exploration it may have seemed unlikely, if not impossible, for
Simpson Events to occur. Now, having reduced the 8 variable problem down to a 3D
graph, it is hopefully clear that Simpson Events are not really that rare after all, depending
on the initial conditions. Any time results are combined from an easy task (high success
rate) and a difficult task (low success rate) Simpson Events are not at all unlikely.
Any results-combination methodology which can result in a Simpson Event is
fundamentally flawed as it is introducing an unfairness in the comparison. In the source
code for Appendix 4, if JT1 and KT1 are made equal, and JT2 and KT2 are also made
equal, Simpson Events no longer occur. To keep a level playing field for Jane and Kevin
the results for task 1 need to be scaled to ensure the tested populations are the same.
Likewise for task 2.
Whilst we have been comparing abstract successes for Jane and Kevin, this trial
methodology actually has serious real world implications when we are talking about
employee evaluations, comparisons of munitions, and effectiveness of medical treatment
programs. Inappropriate aggregation techniques can inadvertently (or deliberately!) skew
the findings in an unscientific and unhelpful (for the user) manner.

Actually it was a per-unit success relation presented on the first page.

Leslie Green CENG MIEE

6 of 11

Nov 2014

Whilst it is true that scaling the populations for task 1 and task 2 fixes the Simpson Event
issue, that does not automatically make a trial comparison fair. Consider this data
Trial 1
success
Drug
on 1,000,000
people
J
K

36%
30%

Trial 2
success
on
10,000
people
30%
46%

Mean
33%
38%

Although this is not a Simpson setup, it would be absurd to take the scientific finding that
drug K has a better mean success rate than drug J. The success rates could be weighted
to get a better statistical measure, but a better approach in this case would be to figure out
why drug K is so much better for the trial 2 population.

Leslie Green CENG MIEE

7 of 11

Nov 2014

Appendix 1: C++ source code example 1: sampling successes and failures

Leslie Green CENG MIEE

8 of 11

Nov 2014

Appendix 2: C++ source code example 2: sampling success and failure vectors

Leslie Green CENG MIEE

9 of 11

Nov 2014

Appendix 3: C++ source code example 3: brute force vector success condition, followed by sampling

Leslie Green CENG MIEE

10 of 11

Nov 2014

Appendix 4: C++ source code example 3: brute force percentage successes, followed by sampling

Leslie Green CENG MIEE

11 of 11

Nov 2014

Anda mungkin juga menyukai