Introduction
Consider the following hypothetical example. Suppose the fourth grade students of two schools (1 and 2) in a school district took a state maths exam. The principals of these two schools wanted to know whether there was a difference in the performances between the two schools. They calculated the overall average score, the average score of girls and the average score of boys in each school. The result is presented as scenario B in table 1. After looking at the average scores of girls and boys, respectively, the principal of school 1 was very happy as they were both one point higher than those in school 2. However, after looking at the overall average score of these two schools, the principal was very confused as the overall average score of school 2 was higher than that of school 1. Is there anything wrong in calculating the overall average score? What is the reason for the inconsistency between the overall average scores and the average scores stratified by gender?
Before figuring out the reason for the inconsistency, let us take a look at scenario A in table 1. In this scenario, the overall average score, as well as the average scores of girls and boys in school 1 are all higher than those in school 2. A closer examination shows that the proportions of girls are different in scenarios A and B. In scenario A, 48% of students are girls in both schools. In scenario B, 40% and 60% of students are girls in the two schools, respectively. Is the difference in the proportions of girls sufficient to create this inconsistency? The answer is negative. In scenario C in table 1, although the proportions of girls are different in the two schools, the overall average scores and the average scores by gender are higher in school 1.
Examples in table 1 indicate that the results between overall analysis and subgroup analysis may be very different. Now we show what overall analysis and subgroup analysis actually mean.1 Suppose we are interested in the treatment effect of a new drug D. We recruit some subjects and randomise them to two treatment groups (T and C). Subjects in groups T and C were administered drug D and placebo, respectively. After collecting the data, we calculate the average response of these two groups and use appropriate statistical methods (eg, two-sample t-test, Pearson’s χ2 test, and so on) to compare them. This is called the overall analysis. However, we suspect the response of a subject may depend on his/her age. We divide the subjects in the study population into several subgroups based on their ages and study the treatment effect within each age group. This kind of analysis is called the subgroup analysis. The subgroup analysis may offer us more information on the treatment effect of the new drug within each specific age group.
Results in table 1 indicate that even if in each subgroup, the new drug turns out to be better than the placebo, the overall response in the placebo group may be better than the new drug group. In this paper, we studied the reason for this counterintuitive phenomenon. The paper is organised as follows. Section 2 defines some notations. Section 3 gives a very general sufficient condition of consistency. We give some practical guidance in dealing with inconsistency in real studies in section 4.