# Simpson's Paradox

*First published Mon Feb 2, 2004; substantive revision Fri Apr 1, 2016*

Consider the following story:

The label of each packet of FIXIT-Y-capsuls from the (imaginary) Globalfixit Pharmacuticals Ltd. carries the following recommended usage:

Recommended for males and females with Condition Y, but not recommended for people with Condition Y.

As the fine print on the label goes on to explain:

clinical trials using FIXIT-Y showed a higher percentage of recoveries from Y when men took it compared with the men who took placebos, and similarly for women. But the group taking placebos in the total population had higher recovery rates overall. You can trust FIXIT to deliver evidence based pharmachology.

The company also markets FIXIT-Z-capsuls. The label on these carry the advice that Z-capsuls are recommended for people who suffer from Z, but not for males and not for females. As the fine print on the label goes on to explain:

Clinical trials using FIXIT-Z showed that people taking it had higher recovery rates compared with those who took placebos. But both males and females who took placebos had higher recovery rates compared with the males and females who took FIXIT-Z. You can trust FIXIT to deliver evidence based pharmachology.

While no capsul can be good for men and women, yet bad for people, or good for people while being bad for men and women, the imagined data (see below) on which FIXIT based its recommendations exhibit patterns that are both arithmatically possible and turn up in actual data sets. While there is nothing paradoxical about the existence of such data from the standpoint of arithmatic, they do pose problems for practical decision making, e.g. would you want to be treated with Fixit’s capsuls in light of the reported clinical trials?, for heuristics used in intuitive reasoning about probabilities, for inferences from data to causal relations, and more generally, for philosophical programs that aim to eliminate or reduce causation to regularities and relations between probabilities.

The arithmetic, on which examples like FIXIT’s ill-judged recommendations are based, is unproblematic. In summary it is based on the fact that

An association between a pair of variables can consistently be inverted in each subpopulation of a population when the population is partitioned, and conversely, associations in each subpopulation can be inverted when data are aggregated.

Call this principle Simpson’s Reversal of Inequalities. Failure to recognise such reversals can lead to the abovementioned pitfalls about what to do, what to believe, what to infer, and what causes what. Even when actual and possible reversals are recognized, pitfalls remain. On a positive note, once the possibilities of Simpson’s Reversals are recognized, they provide a rich resource for constructing causal models that help to explain many facts that appear at first to be anomolous. Moreover, there is a test called the “back-door criterion” (Pearl 1993) which can be used to help resolve the question of whether one should base a decision on the statistics from the aggregate population or from the partitioned subpopulations.

Section 1 provides a brief history of Simpson’s Paradox, a statement and diagnosis of the arithmetical structures that give rise to it, and the boundary conditions for its occurrence. Section 2 examines patterns of invalid reasoning that have their sources in Simpson’s Paradox and possible ways of countering its effects. A particularly important case where Simpson’s Paradox has been invalidly employed is discussed in Section 3. It has been mooted that paradoxical data provide counter-examples to the Sure Thing Principle in theories of rational choice. Why such data appear to provide counter-examples to the Sure Thing Principle is explained, and the appearance that they do so is dispelled. Section 4 discusses the roles and implications of paradoxical data for theories of causal inference and for analyses of causal relations in terms of probabilities. While the conclusions of this section are largely negative, Section 5 illustrates how apparently paradoxical data can support causal models for the evolution of traits that at first appear to be incompatible with a setting in which natural selection disadvantages individuals that exhibit the traits.

- 1. Simpson’s Paradox: Its History, Diagnosis, and Boundary Conditions
- 2. Simpson’s Reversals of Inequalities as Sources of Invalid Reasoning
- 3. Do Paradoxical Data Provide Counter-examples to the Sure Thing Principle?
- 4. Simpson’s Reversals of Inequalities, Correlations, and Causation
- 5. Simpson’s Reversal of Inequalities in Evolutionary Settings
- Bibliography
- Academic Tools
- Other Internet Resources
- Related Entries

## 1. Simpson’s Paradox: Its History, Diagnosis, and Boundary Conditions

### 1.1 History

In a seminal paper published in 1951, E. H. Simpson drew attention to
a simple fact about fractions that has a wide variety of surprising
applications (Simpson 1951). The applications arise from the close
connections between proportions, percentages, probabilities, and their
representations as fractions. While statisticians in the early 20th
Century had known of the problems for statistics to which Simpson drew
attention, it was his witty and surprising illustrations of them that
earned them the title of being paradoxical (cf. Yule 1903). In 1934,
Morris Cohen and Ernst Nagel introduced philosophers to one aspect of
the problems posed by paradoxical data. They cited actual death rates
in 1910 from tuberculosis in Richmond, Virginia and New York, New
York that verified the following propositions (Cohen &
Nagel
1934):^{[1]}

The death rate for African Americans was

lowerin Richmond than in New York.The death rate for Caucasians was

lowerin Richmond than in New York.The death rate for the total combined population of African Americans and Caucasians was

higherin Richmond than in New York.

They next posed two questions about the data concerning mortality
rates: “Does it follow that tuberculosis *caused* [italics
added] a greater mortality in Richmond than in New York…” and
“…are the two populations that are compared really
*comparable*, that is, homogeneous?” (Cohen & Nagel 1934).
After posing the questions, they left it as an exercise for the reader
to answer them. Following the publication of Simpson’s paper,
statisticians initiated a lively debate about the significance of facts
like those that are verified by the tables Cohen and Nagel cited. The
debate sought constraints on statistical practice that would avoid
conundrums arising from actual and possible paradoxical data. However,
this debate did not address the first question posed by Cohen and Nagel
concerning causal inference. As Judea Pearl notes in his survey of the
statistical literature on Simpson’s paradox, statisticians had an
aversion to talk of causal relations and causal inference that was
based on the belief that the concept of causation was unsuited to and
unnecessary for scientific methods of inquiry and theory construction
(Pearl 2000, 173–181).

Philosophical interest in Simpson’s paradox was rekindled by Nancy Cartwright’s use of it in support of her claims that appeals to causal laws and causal capacities are required by scientific inquiry and by theories of rational choice (Cartwright 1979). She aimed to show that reliance on regularities and frequencies on which probability judgments can be based are not sufficient for representing causal relations. In particular, tests of scientific theories and philosophical analyses of causation and causal inference need to provide answers to questions like those posed by Cohen and Nagel: e.g., is it possible that tuberculosis caused greater mortality in Richmond than New York even if the mortality rates for each sub-population classified by race appears to suggest otherwise? If causal relations track regularities, what system of causal relations can achieve such effects? Once representations of causal relations that provide answers to questions like those posed by Cohen and Nagel are at hand, the representations turn out to have interpretations that provide causal models for a range of interesting and puzzling phenomena. These include causal models for the evolution of altruism as a stable trait in a population even though altruistic acts disadvantage those who perform them and advantage their competitors. (See Sober 1993, and Sober & Wilson 1998, which develop these themes in detail in the areas of population biology and sociobiology.) Examples of such models are formulated and discussed in Section 5.

### 1.2 What is Simpson’s Paradox: A Diagnosis

For some whole numbers we may have:

\[\begin{align} a/b &\lt A/B, \\ c/d &\lt C/D, \text{ and}\\ (a + c)/(b + d) &\gt (A + C)/(B + D). \end{align}\]
Call this a *Simpson’s Reversal* *of Inequalities*.
Below is an instructive illustration. The arithmetical inequalities on which
it is based are:

The following interpretation of the structure illustrates why it can
give rise to perplexity. The example is loosely based on a
discrimination suit that was brought against the University of
California, Berkeley (see Bickle *et al*., 1975).

Suppose that a University is trying to discriminate in favour of women when hiring staff. It advertises positions in the Department of History and in the Department of Geography, and only those departments. Five men apply for the positions in History and one is hired, and eight women apply and two are hired. The success rate for men is twenty percent, and the success rate for women is twenty-five percent. The History Department has favoured women over men. In the Geography Department eight men apply and six are hired, and five women apply and four are hired. The success rate for men is seventy-five percent and for women it is eighty percent. The Geography Department has favoured women over men. Yet across the University as a whole 13 men and 13 women applied for jobs, and 7 men and 6 women were hired. The success rate for male applicants is greater than the success rate for female applicants.

Men Women History 1/5 \(\lt\) 2/8 Geography 6/8 \(\lt\) 4/5 University 7/13 \(\gt\) 6/13

How can it be that each Department favours women applicants, and yet overall men fare better than women? There is a ‘bias in the sampling’, but it is not easy to see exactly where this bias arises. There were 13 male and 13 female applicants: equal sample sizes for both groups. Geography and History had 13 applicants each: equal sample sizes again. Nor does the trouble lie in the fact that the samples are small: multiply all the numbers by 1000 and the puzzle remains. Then the reversal of inequalities becomes fairly robust: you can add or subtract quite a few from each of those thousands without disturbing the Simpson’s Reversal.

The key to this puzzling example lies in the fact that *more
women are applying for jobs that are harder to get*. It is harder
to make your way into History than into Geography. (To get into
Geography you just have to be born; to get into History you have to do
something memorable.) Of the women applying for jobs, more are applying
for jobs in History than in Geography, and the reverse is true for men.
History hired only 3 out of 13 applicants, whereas Geography hired 10
out of 13 applicants. Hence the success rate was much higher in
Geography, where there were more male applicants.

### 1.3 Boundary Conditions for Simpson’s Reversals

Simpson’s Reversal of Inequalities occurs for a wide range of values that can be substituted for \(a\), \(b\), \(c\), \(d\), \(A\), \(B\), \(C\), \(D\) in the above schema. The values fall within a broad band that lies between two extremes:

On one extreme, *slightly* more women are applying for jobs
that are *much* harder to get.

Men | Women | ||

History | 1/45 | \(\lt\) | 5/55 |

Geography | 50/55 | \(\lt\) | 45/45 |

University | 51/100 | \(\gt\) | 50/100 |

On the other extreme, *many* more women are applying for jobs
that are *slightly* harder to get.

Men | Women | ||

History | 4/5 | \(\lt\) | 90/95 |

Geography | 94/95 | \(\lt\) | 5/5 |

University | 98/100 | \(\gt\) | 95/100 |

Further, the numerators and denominators of fractions that instantiate the schematic pattern can be uniformly multiplied by any positive number without perturbing the relations between the fractions. Fractions that exhibit these patterns correspond to percentages and probabilities. In their probabilistic form, Colin Blyth provides the following boundary conditions for Simpson’s Reversals (Blyth 1972). Let ‘\(P\)’ represent a probability function, and take conditional probabilities to be ratios of unconditional probabilities in accordance with their orthodox definition; i.e., reading the ‘/’ in the context \(P(-\mid\ldots)\) as ‘given that’,

\[ P(A\mid B) = P(A \amp B)/P(B), \text{ provided that } P(B) \text{ is positive.} \]Blyth notes that from a mathematical standpoint, subject to the conditions

\[\begin{align} P(A\mid B\amp C) &\ge \delta \cdot P(A\mid {\sim}B\amp C) \\ P(A\mid B\amp {\sim}C) &\ge \delta \cdot P(A\mid {\sim}B\amp {\sim}C) \end{align}\]with \(\delta \ge 1\), it is possible to have

\[ P(A\mid B) \approx 0 \text{ and } P(A\mid {\sim}B) \approx 1/\delta . \]On the assumption that the propositions of arithmetic are necessary, these possibilities are tantamount to existence conditions in arithmetic. The schema:

[If it is possible that \(A\) is necessary, then \(A\)]

is valid in a large family of modal logics. The boundary conditions for Simpson’s Reversals allow that any probabilistic association between \(A\) and \(B\) can be inverted in some further partition of \(B\). From the standpoint of arithmetic there is a partition \(\{\)C,\({\sim}\)C\(\}\) within which associations between \(A\) and \(B\) are inverted. An important related consequence is that it is always mathematically possible to provide some condition or factor \(C\) that renders \(A\) probabilistically independent of \(B\) when \(C\) is conjoined with \(B\) as a condition on \(A\) and with \({\sim}B\) as a condition on \(A\). These facts of arithmetic carry no empirical significance by themselves. However, they do have methodological significance insofar as substantive empirical assumptions are required to identify salient partitions for making inferences from statistical and probability relationships.

The need for substantive empirical assumptions arises in settings where there are instances of arithmetical possibilities that are marked out by Simpson’s Reversals in urn models and in possible and actual empirical settings. For example, consider an urn model for our story about the success rates for job applicants. The model consists of twenty-six balls. Each ball is labeled with one of the elements from the sets \(\{M, {\sim}M\}, \{H, {\sim}H\}\), and \(\{S, {\sim}S\}\), e.g., a given ball might be labeled \([{\sim}M, H, {\sim}S]\) Assume that the labels are distributed to correspond to the distributions of job applicants. In trials of drawing balls from the urn with replacement, the associations between the \(M\)’s, \(H\)’s, and \(S\)’s in the sub-populations, and the reverse association between \(M\)’s and \(S\)’s in the overall population, are resilient. The resilient associations are due only to the structure of the model and do not have any causal significance. By way of contrast, substantive assumptions are required to draw inferences in other cases.

Patterns in data that fall within the boundary conditions for
Simpson’s Reversals of Inequalities can raise problems for testing and
evaluating empirical hypotheses, e.g., testing the effectiveness and
safety of medical procedures. A course of treatment for a malady that
affects the staff of History and Geography can be correlated with a
*lower* death rate for treated compared with untreated patients
in History, and a *lower* death rate for treated compared with
untreated patients in Geography; yet, the course of treatment may
nevertheless correlate with a *higher* death rate when treated
patients are compared with untreated patients overall. Conversely, a
treatment can be correlated with higher mortality rates in each
sub-population, while it is correlated with a lower mortality rate in
the total population. In such cases it is far from clear what, if
anything, to conclude from the correlations about the effectiveness and
safety of the
treatment.^{[2]}
Moreover, with patterns like those surmised
for this example, different ways of partitioning the *same* data
can produce different correlations that appear to be incompatible with
the correlations under the initial way of partitioning the data. E.g.,
under a partition by academic discipline, patients appear to fare worse
when treated, even though there can be a positive correlation in the
total population between treatments and recoveries. This is consistent
with a positive correlation between treatments and recoveries when the
population is partitioned by gender. While Historians and Geographers
each fare worse given the treatment, males and females from the two
Departments can each fare better given the treatment, and these facts
are consistent with the combined population faring better, or with the
combined population faring
worse.^{[3]}

The aforementioned possibilities are due to the fact that the following formulae are collectively consistent. Take ‘\(P\)’ to be a probability function. Probability models can be provided that verify the consistency of the set consisting of the following formulae:

\[\begin{align} P(A\mid B) &\gt P(A\mid {\sim}B) \\ P(A\mid B \amp C) &\lt P(A\mid {\sim}B \amp C) \\ P(A\mid B \amp {\sim}C) &\lt P(A\mid {\sim}B \amp {\sim}C) \\ P(A\mid B \amp D) &\gt P(A\mid {\sim}B \amp D) \\ P(A\mid B \amp {\sim}D) &\gt P(A\mid {\sim}B \amp {\sim}D) \\ \end{align}\]Similar inequalities are possible with signs reversed, and equalities that represent probabilistic independence are consistent with positive and/or negative associations in partitions of the populations. These facts are not paradoxical from an arithmetical point of view. However, regularities that can be represented by them cannot all be assigned causal significance, and probabilistic equalities that are sufficient for probabilistic independence cannot all be taken to represent causal independence.

Standard statistical methods for significance testing offer no
insurance against conflicting results when data are partitioned or
consolidated. In a setting where the effectiveness of a new medical
treatment is under test, the following data support rejecting the null
hypothesis, at the .05 level, that treatment \(T\) makes no difference
to recovery \(R\), where the alternative to the null hypothesis is that
treatment is *favorable* for
recovery.^{[4]}

\(R\) | \({\sim}R\) | |

\(T\) | 369 | 340 |

\({\sim}T\) | 152 | 176 |

However, in this model, when the population is further partitioned by gender, the opposite recommendation for males and for females is supported at the .05 level of significance.

\(RM\) | \({\sim}RM\) | \(R{\sim}M\) | \({\sim}R{\sim}M\) | |

\(T\) | 48 | 152 | 321 | 188 |

\({\sim}T\) | 73 | 145 | 79 | 31 |

Take the null hypothesis to be that there is no association between
treatments and recoveries, and the alternative to the null hypothesis
that treatment is *less favorable* for recovery than no
treatment. Rejecting the null hypothesis falls within the .05 level of
significance for both the \(M\)-tables and the \({\sim}M\)-tables. So,
when the consolidated data are considered, treatment is favored, but
when the population is partitioned by gender, no treatment is favored
for both males and females. A further partition, e.g., a partition by
age groups, can reverse the associations within partitions by
gender. So treatments can be positively correlated with recoveries in
the total population, negatively correlated with recoveries when the
population is partitioned by gender, and positively correlated with
recoveries when the population is partitioned by age. The generality
of the boundary conditions for Simpson’s reversals of
inequalities guarantees that there always are models in arithmetic
that accomodate data and support conflicting
recommendations. Arithmetic is silent on which partitions to take as
the basis for evaluating conflicts between hypotheses given data and
the ways data can be partitioned.

## 2. Simpson’s Reversals of Inequalities as Sources of Invalid Reasoning

Intuitive reasoning about percentages and probability relations is notoriously accident prone. The example that was based on the suit brought against Berkeley illustrated how a bias in hiring practices in each department of a university can be inverted when the data are pooled. But many people at least initially would deem it impossible that a higher percentage of males were successful in a setting where females had higher success rates in each department in which appointments were made. One way to view the flaw in intuitive reasoning that arises from Simpson’s Reversals is by noting that the representation of data from partitions of a population as fractions and the uses to which the fractions are put when data are pooled to get statistics on total populations is not guaranteed to maintain the relations between fractions within the partitions. Proper fractions have infinitely many equivalent representations. For example, 1/\(2 = 2/4 = 4/8 =\ldots\). Now recall the form of relations between fractions in terms of which Simpson’s Reversals were illustrated, i.e.,

\[\begin{align} a/b &\lt A/B, \\ c/d &\lt C/D, \text{ and}\\ (a + c)/(b + d) &\gt (A + C)/(B + D). \end{align}\]Now, treating terms as proper fractions, we can have \(a/b = 2a/2b\), and \(A/B = 5A/5B\); \(c/d = 3c/3d\), and \(C/D = 4C/4D\). However, when these equivalent representations are pooled, the resulting relations between fractions will often differ from the original relations. E.g., \((2a + 3c)/(2b + 3d)\) can be more or less than \((a + c)/(b + d)\). Hence, it is invalid to conclude that relations between percentages or ratios when data are pooled will conform to the regularities that are exhibited by the sets that comprise partitions of the data. Equivalent representations of ratios make different contributions when data are pooled.

One way to arithmetically counter this difficulty is by ‘normalizing’ the representations of data from sub-populations and only pooling the normalized representations of the data. Normalizing data counters the effects of skewing by providing constant denominators for the fractions that represent the data, and by representing the sub-populations that are compared as if they were of equal sizes in the relevant respects in terms of which they are compared. However, Simpson’s Reversals show that there are numerous ways of partitioning a population that are consistent with associations in the total population. A partition by gender might indicate that both males and females fared worse when provided with a new treatment, while a partition of the same population by age indicated that patients under fifty, and patients fifty and older both fared better given the new treatment. Normalizing data from different ways of partitioning the same population will provide incompatible conclusions about the associations that hold in the total population.

A related point comes out even more vividly when fractions are interpreted as probabilities. It was noted above that a Simpson’s Reversal can take the following probabilistic form: It is possible to have

\[\begin{align} P(A\mid B) &\gt P(A\mid {\sim}B), \text{ where} \\ (A\mid B \amp C) &\lt P(A\mid {\sim}B \amp C) \text{ and}\\ P(A\mid B \amp {\sim}C) &\lt P({\sim}B \amp {\sim}C). \end{align}\]One way for intuitive reasoning to overlook this possibility is by overlooking the so-called law of total probability and its relevance to this setting. From the probability calculus we have the following equivalences that represent probabilities as weighted averages.

\[\begin{align} P(A\mid B) &= P(A\mid B \amp C)P(B\mid C) + P(A\mid B \amp {\sim}C)P(B\mid {\sim}C) \\ P(A\mid {\sim}B) &= P(A\mid {\sim}B \amp C)P({\sim}B\mid C) + P(A\mid {\sim}B \amp {\sim}C)P({\sim}B \mid {\sim}C) \end{align}\]Skewed weights for \(P(B\mid C)\), \(P(B\mid {\sim}C)\), \(P({\sim}B\mid C)\), and \(P({\sim}B\mid {\sim}C)\) create the range of possibilities that are marked out by the boundary conditions for Simpson’s Reversals. E.g., let

\[\begin{align} P(A\mid B) &= .54 \text{ and} \\ P(A\mid {\sim}B) &= .44 \end{align}\]So, \(B\) is positively relevant to \(A\). Let the weights that feature in the representation of these probabilities in terms of a factor \(C\) be as follows:

\[\begin{align} P(B\mid C) &= .28, \\ P({\sim}B\mid C) &= .72, \\ P(B\mid {\sim}C) &= .66, \text{ and} \\ P({\sim}B\mid {\sim}C) &= .34 \end{align}\]
Given these weightings, \(B\) will be positively relevant to
\(A\), but it will be negatively relevant to \(A\) in each of the
cells provided by the partition \(\{C, {\sim}C\}\).
I.e.,^{[5]}

If intuitive reasoners generally ignore the roles that weights play or fail to play in their reasoning about probability, they are apt to be taken aback when Simpson’s Reversals turn up in actual or possible data. A disposition to ignore weightings in intuitive reasoning could arise from ignorance, habit, or as a defeasible heuristic when reasoning about probability relations. Of course it is an empirical question whether such oversight is the source of invalid reasoning, or whether another hypothesis better explains why many people find Simpson’s Reversals to be impossible at first, and why the reversals continue to be surprising even after their source has been explained to them.

## 3. Do Paradoxical Data Provide Counter-examples to the Sure Thing Principle?

The so-called Sure Thing Principle (hereafter STP) is fundamental for theories of rational decision. L. J. Savage provides the following formulation of it:

If you would definitely prefer \(g\) to \(f\), either knowing that the event \(C\) obtained, or knowing that the event \(C\) did not obtain, then you definitely prefer \(g\) to \(f\) (Savage 1954, 21–2).

In theories of rational choice in which preferences are ordered by the rule of maximizing expected utility, STP is a consequence of the fact that the expected utility of an option can be represented as a probabilistically weighted average of the expected utilities of mutually exclusive and collectively exhaustive ways the world could be on the assumption that the option is chosen. E.g., with ‘EU’ representing a function that assigns expected utilities and ‘P’ a probability function,

\[ EU(A) = EU(A\amp B)P(B) + EU(A\amp {\sim}B) P({\sim}B). \]When you know that \(B\) holds, it becomes a parameter for the expected utility of \(A\), and similarly when you know that \({\sim}B\) holds. So if the expected value that is assigned to \(C\) is less than \(A\) on the assumption you know that \(B\) obtains, and similary on the assumption that \(B\) does not obtain, then the expected value of \(C\) is unconditionally less than the expected value of \(A\).

Now suppose that you are offered bets on applicants gaining jobs in the example concerning the two departments. Your options are to bet on a randomly drawn successful applicant being male, or to bet on a randomly drawn successful applicant being female. Let \(C\) be the event of applying for a job in History, and \({\sim}C\) be the event of applying for a job in Geography. (Every person in the relevant domain applies for exactly one position.) Given that the success rates for females were greater than that for males in both departments, does the STP recommend that you should back females as the bettor’s choice? One might (invalidly) reason as follows: given that females have a greater chance of success in their applications given \(C\) and given \({\sim}C\), STP recommends a preference for bets on females in a lottery in which you are betting on the gender of successful applicants. Of course, this would be bad advice in the setting of the example, as the success rate for males was greater overall. Given a suitably large number of bets, a clever bookie could be assured of a handsome profit if bettors backed females in the competitions for jobs. Their success rate was lower than their male competitors’ success rate overall despite being higher in each department.

To see what has gone awry in the attempt to apply STP in this
setting it suffices to note that a random draw from successful
applicants is made from the *mixture* that contains males and
females, and there are more males in the mixture. (Recall that females
were applying in greater numbers for jobs that were harder to get.) It
is insufficient for the applicability of the Principle that
probabilities line up with females having a greater chance of success
in each department. The Principle applies to *preferences* ,
taken as weighted averages of utilities with probabilities supplying
the weights. The presented options are

- (1) A randomly drawn successful applicant is female.
- (2) A randomly drawn successful applicant is male.

To be told that a selected applicant applied for a position in History (C) or in Geography \(({\sim}\)C) does not affect the probabilities of success in the mixture. This is evident when the expected utilities of the options are explicitly represented as weighted averages. Using ‘M’ for male, ‘\({\sim}\)M’ for female, ‘S’ for successful, and ‘C’ and ‘\({\sim}\)C’ as above, the expected utilities for the options are as follows.

\[\begin{align*}\tag{1} EU({\sim}M\amp S) &= EU({\sim}M\amp S\amp C)P(C\mid S\amp {\sim}M) \\ &\quad + EU({\sim}M\amp S\amp {\sim}C) P({\sim}C \mid S\amp {\sim}M) \\ \tag{2} EU(M\amp S) &= EU(M\amp S\amp C)P(C\mid S\amp M) \\ & \quad + EU(M\amp S\amp {\sim}C) P({\sim}C\mid S\amp M) \end{align*}\]Given the figures that were used in the example, the probability relations between the weightings are as follows:

\[\begin{align} P(C\mid S\amp {\sim}M) &\gt P(C\mid S\amp M) \text{ and }\\ P({\sim}C\mid S\amp {\sim}M) &\gt P({\sim}C\mid S\amp M). \end{align}\]
It is these relations that are the source of the illusion that STP
selects Option 1. The probability of a successful female applicant
having applied for a position in History is greater than that of her
male competitor among the applicants in History, and similarly for
females in Geography. If the candidates had been sorted by their
applications to the respective departments, where females had higher
success rates, and the drawing was done from a randomly chosen
department (with repeated draws and replacement until a successful
applicant is drawn) rather than from the mixture of successful
applicants, then the best choice would be for the gender with the
higher success rates in the respective departments, i.e.,
females. Such an arrangement would not be affected by the fact that
more women applied for jobs that were harder to get. But that is not
the arrangement that has been stipulated for the bets where selection
is made from the pooled successful applicants. The chances of
selecting a male (or a female) from *that* mixture are
independent of the department to which the successful applicants had
applied. Accordingly, rational bettors will find STP to be
inapplicable in the setting, because they will not have the
preferences that its application requires, i.e., a preference for
females, given that they applied for a job in History (C), and a
preference for females, given that they applied for a job in Geography
\(({\sim}C)\). For rational bettors,

and similarly for \(M\)’s, while, on the figures provided in the example,

\[ EU({\sim}M\amp S) \lt EU(M\amp S). \]While Simpson’s Reversals do not support decisions that conflict with the Sure Thing Principle, they do pose problems of practical significance when decisions have to be taken about what to do. Should the associations in the total population of people guide decision making in a trial like that conducted by Fixit? Or should the associations in the subpopulations of males and females guide decisions about whether to take the medication? Recall that a different partition of the total population, e.g. by age, can exhibit associations like those in the total population, and the reverse of those in the partition based on gender. There are no a priori methods that answer questions about whether associations in aggregated data, or associations in partitions of aggregated data, are good bases for inference from causes to effects or for making decisions about what to do. Contingent hypotheses about the logical and causal structure of particular practical problems best serve as the decision maker’s guide. Given appropriate background information, the relations between, e.g. treatments and recoveries in the total population, might be the indicated basis for making treatment decisions. Given different background information, the relations between treatments and recoveries in a salient partition of the population may be idicated, contra the associations in the total population. In the absence of some contingent assumptions about logical and causal structures in particular cases, mere associations are not helpful in deciding what to do. So, while Simpson’s Reversals are not paradoxical from a logical point of view, they do point to conflicting associations that become genuinely paradoxical if they are all given causal significance.

## 4. Simpson’s Reversals of Inequalities, Correlations, and Causation

It is a commonplace that correlations between variables do not entail that they stand in causal relations. While some correlations are purely accidental, others can be lawful even when no causal connection obtains between the correlated variables—e.g., the correlation between falling barometers and rain is lawful because they are joint effects of a common cause, i.e., falling air pressure. Controlled experiments seek to expose correlations that are merely accidental. What then of robust correlations between variables that do not causally interact? Hans Reichenbach proposed that a robust correlation between variables is spurious [acausal] when there is a factor that ‘screens off’ the correlation and serves as a common cause of the associated variables (Reichenbach 1971, Ch. 4). Say that \(A\) is associated with B if and only if they are not probabilistically independent, i.e., \(P(A\mid B) \ne P(A)\). Reichenbach proposed that such an association is spurious provided that there is a factor \(C\) such that \(P(A\mid B\amp C) = P(A\mid C)\).

Simpson’s Reversal of Inequalities illustrates that from an arithmetical point of view, there always is a factor or proposition \(C\) that ‘screens off’ any correlation. The existence of such a factor cannot be sufficient for a correlation to be spurious. For example, suppose that the probability of \(A\) given \(B\) is greater than without \(B\). The following diagram illustrates this possibility with probabilities corresponding to the proportional sizes of enclosed spaces with all of \(A\) represented by the enclosed rectangle that is intersected by the line dividing \(B\) from \({\sim}B\).

Figure 1. \(P(A\mid B) \gt P(A\mid {\sim}B)\)

The boundary conditions for Simpson’s Reversals guarantee that there is a \(C\) that intersects equal parts of \(A\amp B\) and \(A\amp {\sim}B\). In Section 1 it was noted that arithmetical possibilities are tantamount to existence conditions for arithmetical facts. Provided that a sample space can be partitioned sufficiently finely, the probabilistic relevance between \(A\) and \(B\) can be “washed out” by some arbitrary factor \(C\) within which the probabilities of \(A\amp B\) and \(A\amp {\sim}B\) are equal. The following diagram illustrates this arithmetical possibility:

Figure 2. \(P(A\mid B\amp C) = P(A\mid {\sim}B\amp C)\)

where \(C\) is represented by the parallelogram that is bisected by the boundary between \(B\) and \({\sim}B\) and comprises equal parts of \(A\amp B\) and \(A\amp {\sim}B\). \(C\) is an arbitrary proposition or factor. As enclosed spaces correspond to probabilities, \(P(A\mid B\amp C) = P(A\mid {\sim}B\amp C)\). So, \(C\) ‘screens off’ \(A\) from \(B\); however, its existence is clearly insufficient to show that the correlation between \(A\) and \(B\) is spurious. While ‘screening off’ may provide a necessary condition for showing that a correlation between variables is due to a common cause, this necessary condition is guaranteed to be fulfilled by the underlying arithmetic of the probability calculus. Further substantive conditions have to be provided over and above the probability relations between \(A\), \(B\), and \(C\) in order to identify \(C\) as a common cause of \(A\) and \(B\).

The inference that lawfully correlated variables are causally
independent of each other if the correlation is due to a common cause
is a special case of a more general view that causes increase the
chances of their
effects.^{[6]}
When there is a common cause \(C\) of a correlation between variables
\(B\) and \(A\), \(B\) does not cause \(A\); the raising of
\(A\)’s chances is due to \(C\), and while \(B\) might be a
symptom of \(A\), it is so by virtue of being a separate effect of
\(C\) that precedes \(A\). The following diagram illustrates these
relationships. (Arrows represent the directions of causal
connections.)

Figure 3. \(B\) precedes \(A\) and \(C\) is a common cause of \(B\) and \(A\)

Given \(C\), \(B\) does not raise \(A\)’s chances. The underlying idea behind analyses of causation in terms of chance raising is that causes promote their effects. In deterministic settings, chances take only extreme values, and causes do not ‘raise’ an effects’ chances of occurring except in the degenerate sense that they raise the chances of their effects from zero without them to one with them (excluding cases of deterministic overdetermination). However, it is a contingent matter whether the world we inhabit is deterministic or indeterministic, and concepts of causation need to accommodate the latter possibility as well as the former. Then, representations of deterministic causation can be viewed as a special case of probabilistic causation in which causes are sufficient and necessary for their effects.

In view of Simpson’s Reversals of Inequalities, probability relations between variables will vary widely under different partitions of populations or state spaces. This fact about probability relations provides an invaluable resource for the representation in probabilistic terms of the complex relations that hold between networks of causes and their effects. Causes not only can promote effects, but they can promote the absence of or inhibit effects that might occur in their absence. E.g., regular exercise inhibits or reduces the chances of cardiovascular disorders. Accordingly, whatever promotes regular exercise also promotes cardiovascular health even if it also promotes cardiovascular disease. Cartwright gives the following example. Smoking causes heart disease, but it also could cause smokers to take up exercise in greater numbers than non-smokers. In that case smoking could indirectly cause cardiovascular health while directly causing disease. With plus and minus signs indicating whether a cause promotes or inhibits an effect, the following diagram represents a causal set-up in which smoking could promote cardiovascular health while directly promoting disease.

Figure 4.

E.g., if smoking increases the chances of heart disease by 25%, but
also increases the chances of regular exercise by 40% while exercise
decreases the chances of disease by 70%, smokers will on balance
benefit from their habit with respect to cardiovascular health. In this
set-up, there could be a Simpson’s Reversal where smokers who exercise
fare worse than non-smokers who exercise, and similarly for smokers who
do not exercise compared with non-smokers, while the smokers’
rates of disease are lower overall. The net causal effect of smoking on
health is positive in the example due to the contribution of a third
variable, exercising, that is an effect of smoking. It is the causal
contributions of further variables that are the sources of Simpson’s
Reversals in other causal set-ups where the effects of direct causal
links are modified by the additional variables’ contributions.
These include cases where direct effects are nullified by inhibitory
effects of an accompanying factor, e.g., substances that are separately
poisonous, acid and alkali, can interact to have no deleterious effect
when they are taken together. Each acts as an antidote for the
other.^{[7]}
Further entanglements include cases where a
cause that promotes an effect is accompanied by an inhibitory cause of
the effect and they are both effects of a common cause. E.g.,

Figure 5. \(E\)’s chance is unperturbed by \(CC\), a common cause.

An interpretation of this diagram: thrombosis can be an effect of
pregnancy and it can also be an effect of some of the ingredients of
birth control pills. Both pregnancy and the pills increase the chances
of thrombosis. However, the pills decrease the chances of pregnancy,
and the net effect on populations of women who take the pills could
show no change in the frequency of thrombosis. Examples such as those
that have been canvassed show that it is neither necessary nor
sufficient for a causal relation between two variables that one raise
the chances of the other. Cartwright (2001, 271) puts the matter in the
following terms: ‘Causes *can* increase the probability of
their effects; but they need not. And for the other way around: an
increase in probability *can* be due to a causal connection; but
lots of other things can be responsible as well.’

Is Cartwright’s observation cause for pessimism about the program of
analyzing causation and causal relevance in probabilistic terms? Not
necessarily. It sets a problem about causal entanglements that are not
tracked by probability relations and probabilistic entanglements that
are not due to causal relations. The program of providing probabilistic
representations of causal relations needs to provide conditions that
disentangle causal networks. What is required is a way of locating the
*right* partitions of populations, where the right ones are the
ones whose probability relations do track causal connections while
holding relevant background factors fixed. A number of different
proposals have been put forward in the literature on probabilistic
causation that aim to provide criteria for locating the right
partitions of data for the purpose of identifying causal
connections.

The proposals fall into two broad categories: (1) Reductive
proposals: these do not appeal to causal concepts and they aim to
provide a filter on correlations that identifies which correlations are
spurious. Correlations that are not spurious are meant to conform to
intuitions about causal relations and to implement the roles that are
intuitively assigned to causal relations.^{[8]}
(2) Non-reductive
proposals: these are unabashed about using causal concepts to
distinguish between spurious and causal correlations. Proposals from
this second group are generally skeptical about the Humean program that
motivates reductive proposals, and set-ups that are instances of
Simpson’s Reversals are one of their main critical scalpels (Cartwright
1979, and especially Dupre & Cartwright 1988). Nevertheless, they
too face the problem of providing a filter on correlations that marks
out which of them are spurious, but they do not feel constrained to
avoid reference to causal relations in providing criteria for selecting
partitions that provide reliable data for causal inferences. In sum,
both reductionists and anti-reductionists who endorse the program of
representing causal relations in terms of probability relations propose
that

\(C\) causes \(E\) if and only if the probability of \(E\) is greater given \(C\) than given not \(C\), provided that \(\ldots X\ldots\).

The proviso is needed to filter cases where probability relations between \(C\)-type events and \(E\)-type events do not track causal relations. Their opinions divide on whether causal concepts need to or can be used without vicious circularity in spelling out the content of the proviso \(\ldots X\ldots\). Reductionists seek ways of spelling out the proviso in terms of homogenous reference classes, where homogeneity is spelled out in terms of robust correlations conditional on a set of factors that are held fixed. Anti-reductionists are quick to ask: which factors? To take all possible factors to be relevant is not only epistemologically intractable, but it can lead to silly conclusions insofar as all but absolutely fundamental causal processes can be manipulated by introducing some intervening factors. E.g., the probability of death given a heart attack is greater than without the heart attack, but the contribution of the heart attack is ‘screened off’ in cases where the heart attack coincides with being run down by a truck. In this example, the chances of death are overdetermined. Cases of causal overdetermination are extreme examples of causal networks in which probabilistic relevance is washed out or inverted by the causal contributions of an exogenous variable. In the experimental sciences, attempts at isolating interactions between factors from intervening variables are standard procedure. However, what is achievable even in the best laboratory conditions will fall short of the ideal of showing that there are no intervening factors on which a correlation is dependent. To show the latter would require showing that a negative existential proposition is true.

Anti-reductionists have a ready answer to the question of which
factors have to be held fixed when evaluating probabilistic
dependencies and probabilistic independence. They want all potentially
*causally relevant factors that are of interest* to be held
fixed for the purposes of identifying the probability relations between
C and E that are due to and are apt for representing causal
connections. According to this approach, reference classes that are
causally homogenous provide the proper basis for evaluating probability
relations. One then looks to background scientific theories and other
knowledge of causal relations to determine whether reference classes
are causally homogenous.^{[9]}
In many cases, however, our curiosity about
causal relations outstrips our current knowledge of causally relevant
variables that need to be held fixed. Then, inferences to causal
relations from statistical data that can always be counter-posed with
reversed regularities in different partitions of the data can lead to
inconsistent claims concerning causal relations.

That said, reversals in data occur, researchers face the question of whether the associations in the aggregated data are spurious, or whether the associations in the partitioned data are spurious. Different causal models (represented by different directed acyclic graphs) will be apt to represent different answers in different cases (see the entry on probabilistic causation). These models can be tested by interventions that isolate and control the values taken by variables that are ostensible causes of effects that are of interest to the researcher. Properly conducted experiments will isolate variables to be manipulated and then read off the effects of the manipulations (see the entry on causation and manipulability). The so-called “back-door criterion” (Pearl 1993) states precisely what is required for some variable to be suitably isolated for manipulation. So, the problems posed by Simpson reversals can be solved by testing different causal hypotheses that are consistent with the observed data where the tests by interventions provide a basis, over and above mere observations, for accepting some causal models as correct representations of causal connections and for rejecting others that have merely spurious associations. Simpson’s “paradox” is thus resolved in the sense that it is possible to test different causal hypotheses that reveal which associations are spurious. (For more on this method see Pearl 2014.)

## 5. Simpson’s Reversal of Inequalities in Evolutionary Settings

Simpson’s Reversals of Inequalities have applications in economic theory and population genetics, especially in cases involving competition among businesses or organisms. In the above example of differential hiring of men and women, imagine that we were to map the women onto, say, ‘lemmings’ and the men onto, say, ‘rats’. Imagine the lemmings to be altruistic and self-sacrificing, or alternatively imagine them to be irrational, inefficient or lazy—either way, by one means or another, imagine that they behave in ways that benefit their neighbours at their own expense. Imagine the rats to be selfish, rational and efficient, and regularly to gain benefits at the expense of their neighbours.

Next, map the History Department onto Norway during a very severe winter in Norway, and suppose there are more rats than lemmings in Norway. Then life is tough for everyone in Norway, and it is even tougher for lemmings than for rats. Map the Geography Department onto Sweden which is in the midst of a very mild winter, and suppose there to be more lemmings than rats in Sweden. Then life is easier for everyone in Sweden, though it is even easier for free-riding and opportunistic rats than it is for lemmings. Finally, consider the reproductive rates for rats and lemmings in the total land mass of the two countries. (Or, if these ‘rats’ and ‘lemmings’ were businesses, consider their relative bankruptcy rates.) The numbers might then display the same pattern that we described for hiring rates of men and women at the University of California:

Lemmings | Rats | ||

Norway | \((1\times 10^9)/(5\times 10^9)\) | \(\lt\) | \((2\times 10^9)/(8\times 10^9)\) |

Sweden | \((6\times 10^9)/(8\times 10^9)\) | \(\lt\) | \((4\times 10^9)/(5\times 10^9)\) |

Scandinavia | \((7\times 10^9)/(13\times 10^9)\) | \(\gt\) | \((6\times 10^9)/(13\times 10^9)\) |

Lemmings are losing ground in Norway, and they are losing ground in Sweden; yet they are gaining ground in combined areas that constitute the two countries.

The reason that lemmings are gaining ground in the combined area of the two countries is that
*more of the lemmings are living where the survival rate is
higher*. Note that the survival rate is higher there precisely
because that is where more of the lemmings are living. Thus, if rats
congregate together, the selfish efficiency of each rat will be bad not
only for the poor lemmings in the neighborhood but also for other rats.
Even if only *slightly* more of the rats are living in one
region rather than another, if the benefits they gain at their
neighbors’ expense become *too extreme* then this will
reduce the survival rate of everyone in that neighborhood, rats
included; this will precipitate a Simpson’s Reversal, and the number of
rats will begin to go down globally when compared with lemmings.

In both Darwinian evolutionary theory and much of economic theory, it
is hard to see how ‘altruism’ (or, for that matter,
systematic inefficiency) could evolve, or be sustained over the long
term. That is, it is hard to see how a population could sustain
heritable patterns of behaviour that *benefit the competitors*
of an individual business or organisms at the expense of the long-term
chances of survival or reproductive success for those individuals and
others with the same dispositions. For this reason it is of
considerable theoretical significance to explore the applications of
Simpson’s Paradox, to see whether this might help to explain not only
the altruism but also the irrationality, inefficiency, laziness and
other vices that may prevail in populations, and that can cause a
population to fall short of the economic rationalist’s or Darwinian’s
ideal of the ruthlessly efficient pursuit by each individual of its
own profits or long-term reproductive success. On balance, this is
probably cheerful news.

## Bibliography

- Axelrod, R., 1984,
*The Evolution of Cooperation*, New York: Basic Books. - Bickel, P. J., Hjammel, E. A., and O’Connell, J. W., 1975,
“Sex Bias in Graduate Admissions: Data From Berkeley”,
*Science*, 187: 398–404. - Blyth, C. R., 1972, “On Simpson’s Paradox and the Sure Thing
Principle”,
*Journal of the American Statistical Association*, 67: 364–366. - Cartwright, N., 1979, “Causal laws and effective
strategies”,
*Noûs*, 13 (4): 419–437. - –––, 2001, “What is wrong with Bayes Nets?”,
*The Monist*, 84 (2): 242–265. Reprinted in*Probability is the Very Guide of Life*, H. E. Kyburg, Jr. and M. Thalos (eds.), Chicago and La Salle, IL: Open Court, 2003, 253–275. - Cohen, M. R., and Nagel, E., 1934,
*An Introduction to Logic and Scientific Method*, New York: Harcourt, Brace and Co. - Dawid, A. P., 1979, “Conditional independence in statistical
theory,”
*Journal of the Royal Statistical Society*(Series B), 41: 1–15. - Dupre, J. and Cartwright, N., 1988, “Probability and
causality: Why Hume and indeterminism don’t
mix”,
*Noûs*, 22: 521–536. - Eells, E., 1987, “Cartwright and Otte on Simpson’s
Paradox,”
*Philosophy of Science*, 54: 233–243. - Glymour, C. and Meek, C., 1994, “Conditioning and
Intervening”,
*British Journal for the Philosophy of Science*, 45: 1001–1021. - Hardcastle, V.G., 1991, “Partitions, probabilistic causal
laws, and Simpson’s Paradox,”
*Synthese*, 86: 209–228. - Hesslow, G., 1976, “Discussion: Two notes on the
probabilistic approach to causality,”
*Philosophy of Science*, 43: 290–292. - Lindly, D. V., and Novick, M. R., 1981, “The role of
exchangeability in inference”,
*Journal of the American Statistical Association*, 9: 45–58. - Malinas, G., 1997, “Simpson’s Paradox and the wayward
researcher”,
*Australasian Journal of Philosophy*, 75: 343–359. - –––, 2001, “Simpson’s Paradox: A logically benign,
empirically treacherous hydra”,
*The Monist*, 84 (2): 265–284. Reprinted in*Probability Is the Very Guide of Life*, Henry E. Kyburg, Jr. and Mariam Thalos (eds.), Chicago and La Salle, IL: Open Court, 2003, 165–182. - Mittal, Y., 1991, “Homogeneity of subpopulations and
Simpson’s Paradox”,
*Journal of the American Statistical Association*, 86: 167–172. - Otte, R., 1985, “Probabilistic causality and Simpson’s
Paradox”,
*Philosophy of Science*, 52: 110–125. - Pearl, J., 1988,
*Probabilistic Reasoning in Intelligent Systems*, San Mateo, CA: Morgan Kaufman. - –––, 1993, “Comment: Graphical Models,
Causality, and Intervention”,
*Statistical Science*, 8: 266–269. - –––, 2000,
*Causality: Models, Reasoning, and Inference*, New York, Cambridge: Cambridge University Press. [Second Edition, 2009.] - –––, 2014, “Comment: Understanding
Simpson’s Paradox”,
*The American Statistician*, 68: 8–13. - Reichenbach, H., 1971,
*The Direction of Time*, Berkeley: University of California Press. - Savage, L. J., 1954,
*The Foundations of Statistics*, New York: John Wiley and Sons. - Simpson, E.H., 1951, “The interpretation of interaction in
contingency tables”,
*Journal of the Royal Statistical Society*(Series B), 13: 238–241. - Skyrms, B., 1980,
*Causal Necessity*, New Haven; Yale University Press. - Sober, E., 1993,
*The Nature of Selection*, Chicago: University of Chicago Press. - –––, 1993,
*Philosophy of Biology*, Oxford: Oxford University Press. - Sober, E. and D. S. Wilson, 1998,
*Unto Others: The Evolution and Psychology of Unselfish Behaviour*, Cambridge, MA: Harvard University Press. - Spohn, W., 2001, “Bayesian nets are all there is to
causality”, in
*Stochastic Dependence and Causality*, D. Constantini, M. C. Galavotti, and P. Suppes (eds.), Stanford: CSLI Publications. - Sunder, S., 1983, “Simpson’s reversal paradox and cost
allocations”,
*Journal of Accounting Research*, 21: 222–233. - Suppes, P., 1970,
*A Probabilistic Theory of Causality*, Amsterdam; North-Holland Publishing Co.. - Thalos, M., 2003, “The Reduction of Causation”, in
H. Kyburg and M. Thalos (eds.),
*Probability is the Very Guide of Life: The Philosophical Uses of Chance*, Chicago: Open Court. - Thornton, R. J., and Innes, J. T., 1985, “On Simpson’s
Paradox in economic statistics”,
*Oxford Bulletin of Economics and Statistics*, 47: 387–394. - Van Frassen, B. C., 1989,
*Laws and Symmetry*, Oxford: Clarendon. - Yule, G. H., 1903, “Notes on the theory of association of
attributes in Statistics”,
*Biometrika*, 2: 121–134.

## Academic Tools

How to cite this entry. Preview the PDF version of this entry at the Friends of the SEP Society. Look up this entry topic at the Indiana Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers, with links to its database.

## Other Internet Resources

- Simpson’s Paradox, by Alan Crowe
- Simpson’s Paradox—When Big Data Sets Go Bad, in Amazing Applications of Probability and Statistics at www.intuitor.com.
- Online paper by Nick Chater, Ivo Vlaev and Maurice Grinberg, “A new consequence of Simpson’s Paradox: Stable co-operation in one-shot Prisoner’s Dilemma from populations of individualistic learning agents,” University College London/New Bulgarian University.

### Acknowledgments

The authors would like to thank Paul Oppenheimer for spotting an incorrectly specified statistic and probability in Section 1.3 and Section 2, respectively.