Suppose we have a sample of subjects in two treatment groups. To study the difference of the treatment effects, we can analyse the data using all subjects (overall analysis). We may also divide the subjects into several subgroups based on some covariates of interest (eg, gender), and study the treatment effects within each subgroup. The results of these two analyses may be different or even in opposite directions. In this paper, we give a general sufficient condition of consistency between the overall and subgroup analyses.
- public health administration
- statistics as topic
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
Consider the following hypothetical example. Suppose the fourth grade students of two schools (1 and 2) in a school district took a state maths exam. The principals of these two schools wanted to know whether there was a difference in the performances between the two schools. They calculated the overall average score, the average score of girls and the average score of boys in each school. The result is presented as scenario B in table 1. After looking at the average scores of girls and boys, respectively, the principal of school 1 was very happy as they were both one point higher than those in school 2. However, after looking at the overall average score of these two schools, the principal was very confused as the overall average score of school 2 was higher than that of school 1. Is there anything wrong in calculating the overall average score? What is the reason for the inconsistency between the overall average scores and the average scores stratified by gender?
Before figuring out the reason for the inconsistency, let us take a look at scenario A in table 1. In this scenario, the overall average score, as well as the average scores of girls and boys in school 1 are all higher than those in school 2. A closer examination shows that the proportions of girls are different in scenarios A and B. In scenario A, 48% of students are girls in both schools. In scenario B, 40% and 60% of students are girls in the two schools, respectively. Is the difference in the proportions of girls sufficient to create this inconsistency? The answer is negative. In scenario C in table 1, although the proportions of girls are different in the two schools, the overall average scores and the average scores by gender are higher in school 1.
Examples in table 1 indicate that the results between overall analysis and subgroup analysis may be very different. Now we show what overall analysis and subgroup analysis actually mean.1 Suppose we are interested in the treatment effect of a new drug D. We recruit some subjects and randomise them to two treatment groups (T and C). Subjects in groups T and C were administered drug D and placebo, respectively. After collecting the data, we calculate the average response of these two groups and use appropriate statistical methods (eg, two-sample t-test, Pearson’s χ2 test, and so on) to compare them. This is called the overall analysis. However, we suspect the response of a subject may depend on his/her age. We divide the subjects in the study population into several subgroups based on their ages and study the treatment effect within each age group. This kind of analysis is called the subgroup analysis. The subgroup analysis may offer us more information on the treatment effect of the new drug within each specific age group.
Results in table 1 indicate that even if in each subgroup, the new drug turns out to be better than the placebo, the overall response in the placebo group may be better than the new drug group. In this paper, we studied the reason for this counterintuitive phenomenon. The paper is organised as follows. Section 2 defines some notations. Section 3 gives a very general sufficient condition of consistency. We give some practical guidance in dealing with inconsistency in real studies in section 4.
We used the example in table 1 to develop our notation. However, our results apply to both continuous and categorical outcomes. Let Yi denote the score of a randomly selected student from school i, and pi denote the proportion of girls in school i, i=1, 2. We define the following quantities: a 10=average score of girls in school 1, a 11=average score of boys in school 1, a 20=average score of girls in school 2, a 21=average score of boys in school 2.
Then the overall average scores of these two schools are a 1=p 1 a 10+(1−p 1)a 11, and a 2=p 2 a 20+(1−p 2)a 21, respectively. They are the weighted averages of the subgroup averages.
We also define some differences in the score:
(1) The differences between girls and boys within each school: d 1=a 10 −a 11, d 2=a 20–a 21.
(2) The difference in girls (boys) between the two schools (subgroup differences):
Δ0=a 10 –a 20,
Δ1=a 11 –a 21.
(3) The overall difference between the two schools:
Δ=a 1–a 2.
It is easy to prove that
Δ0–Δ1=d 1–d 2.
Sufficient condition of consistency
The inconsistency between the overall difference and the subgroup differences happens when Δ0 and Δ1 have the same sign but Δ has the opposite sign. In scenario B of table 1, the average scores of girls and boys in school 1 are higher than those in school 2. However, the overall average score of school 2 is higher. The inconsistency between the overall analysis and subgroup analysis happens in this case. Since each of them can be >0, =0 or <0, there are 27 possible combinations of the signs of Δ0, Δ1 and Δ. For the sake of completeness, we list all 27 combinations in table 2.
There are many redundancies in table 2. If we exchange the labels of those two schools, combinations 1–13 become combinations 15–27. Therefore, we do not need to consider combinations 15–27. There are still some other redundancies in combinations 1–14. For example, if we relabel girls and boys, combinations 4–6 become combinations 10–12. Combination 14 is of no interest in practice and will not be discussed further. Hence, we only consider combinations 1–9 and 13 in our discussion of (in)consistency.
The inconsistency between Δ0, Δ1 and Δ occurs if and only if one of the following occurs:
Δ0>0 and Δ1>0 but Δ=0,
Δ0>0 and Δ1>0 but Δ<0,
Δ0>0 and Δ1=0 but Δ=0,
Δ0>0 and Δ1=0 but Δ<0,
Δ0=0 and Δ1=0 but Δ>0.
Combinations 2, 3, 5, 6 and 13 in table 2 satisfy the conditions above.
From the previous section, we can see that
Δ=d 1(p 1 −p 2)+Δ1(1−p 2)+Δ0 p 2=d 2(p 1 −p 2)+Δ1(1−p 1)+Δ0 p 1.
Consider the following four cases:
(1) Two schools have the same proportion of girls, that is, p 1=p 2. The marginal difference is Δ=Δ0(1−p 1)+Δ1 p 1=Δ0(1−p 2)+Δ1 p 2, which is a weighted average of the subgroup differences. There is no inconsistency between the subgroup and overall analyses. Scenario A in table 1 is an example of this case.
(2) There is no difference between the average scores of girls and boys in school 1, that is, d 1=0. The marginal difference is Δ=Δ0(1−p 2)+Δ1 p 2, which is a weighted average of the conditional differences. There is no inconsistency between the subgroup and overall analyses.
(3) There is no difference between the average scores of girls and boys in school 2, that is, d 2=0. The marginal difference is Δ=Δ0(1−p 1)+Δ1 p 1, which is a weighted average of the subgroup differences. There is no inconsistency between the subgroup and overall analyses.
(4) Two schools have different proportions of girls, and the average scores of girls and boys are different within each school, that is, p 1≠p 2, d 1≠0, d 2≠0. This is the most general case in practice.
The first three cases indicate that p 1=p 2 or d 1 d 2=0 is a sufficient condition of consistency between subgroup and overall analyses, as the overall difference is a convex combination of subgroup differences.
The following theorem gives a more general sufficient condition of consistency than the first three cases discussed above.
Theorem: given Δ0 and Δ1, for any p 1 and p 2 between 0 and 1, there always exists a p between 0 and 1 such that Δ=Δ0 p+Δ1(1−p) if and only if p 1=p 2 or d 1 d 2≤0.
The proof of this theorem is available on request. Note that d 1 d 2=0 implies d 1 d 2≤0.
Unfortunately, if we are only given the information that p 1≠p 2 and d 1 d 2>0, we cannot determine whether the inconsistency will happen. For example, combinations 1 and 3 satisfy the condition of p 1≠p 2 and d 1 d 2>0. In combination 1, the overall difference is consistent with the subgroup differences, while it is not in combination 3.
Conclusion and discussion
Many publications of medical studies report the results of primary analysis based on all data and of subgroup analyses (with the same outcome in the primary analysis) based on partial data in the same study.2 Sometimes, the result from the primary analysis may be inconsistent with the subgroup analysis. In this paper, we give a general sufficient condition of consistency between the overall and subgroup analyses. However, examples in table 3 indicate that it is impossible to give a general necessary condition of consistency. We need to check the consistency case by case.
Like the well-known Simpson’s paradox,3–6 the inconsistency between the overall and subgroup analyses seems to be counterintuitive for many people at first sight. Statistically speaking, the overall analysis and subgroup analysis use different parts of the data in the sample. The subgroup analysis uses only a partial sample of the study population, like the subgroup of girls in section 1. If the subgroup is not representative of the whole sample, inconsistency may occur. Both overall and subgroup analyses are valid methods to analyse data. They reveal different aspects of the data. The inconsistency is natural. We should interpret the results separately. It does not make sense to compare the results of the overall and subgroup analyses as they use different data. We can always write the overall analysis and subgroup analysis in the form of conditional expectations.7 However, their conditional parts are different, and the results may be different.
From the example in section 1, we know that if two schools have the same proportions of girls, inconsistency will not happen. Like a randomised clinical trial, if the students were perfectly randomised to two schools, the proportion of girls will be very similar and the inconsistency will not happen in most cases. However, in most clinical trials, randomisation may not be balanced and inconsistency may still happen. On the other hand, if the data are from an observational study,8 inconsistency may happen with high probability.
Another related topic is covariate adjustment in data analysis. For example, suppose the subjects were randomised into two treatment groups (active treatment and control). The primary outcome is binary (success or failure). We can use Pearson’s χ2 test to check the difference in success rates between the two groups. The odds ratio (OR) can be used to characterise the association between the treatment and the outcome. If we also have some other covariates, such as the age or gender of the subjects, we may also run a logistic regression using the treatment indicator and other variables as covariates. They are both valid methods to analyse the data.9 However, the new OR may be different from the old one.
Patient consent for publication
Hongyue Wang obtained her BS in Scientific English from the University of Science and Technology of China (USTC) in 1995, and a PhD in Statistics from the University of Rochester, USA, in 2007. She is currently a Research Associate Professor in the Department of Biostatistics and Computational Biology at the University of Rochester Medical Center in USA. Her main research interests include longitudinal data analysis, missing data, survival data analysis, and design and analysis of clinical trials. She has extensive and successful collaboration with investigators from various areas, including Infectious Disease, Nephrology, Neonatology, Cardiology, Neurodevelopmental and Behavioral Science, Radiation Oncology, Pediatric Surgery, and Dentistry. She has published more than 90 statistical methodology and collaborative research papers in peer-reviewed journals.
Contributors HW, CF and XMT: theoretical derivation; BW: manuscript drafting.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Provenance and peer review Commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.