James Proudfoot obtained a master’s degree in statistics from the University of British Columbia. He is currently in the Biostatistical and Epidemiology Research Division at UC San Diego, CA. His work involves manuscripts preliminary assessment, biostatistical analysis counseling, and research on the application of statistical methods in psychiatric studies.

For moderate to large sample sizes, all tests yielded pvalues close to the nominal, except when models were misspecified. The signed-rank test generally had the lowest power. Within the current context of count outcomes, the signed-rank test shows subpar power when compared with tests that are contrasted based on full data, such as the GEE. Parametric models for count outcomes such as the GLMM with a Poisson for marginal count outcomes are quite sensitive to departures from assumed parametric models. There is some small bias for all the asymptotic tests, that is,the signed-ranktest, GLMM and GEE, especially for small sample sizes. Resampling methods such as permutation can help alleviate this.

Although not as popular as continuous and binary variables, count outcomes arise quite often in clinical research. For example, number of hospitalisations, number of suicide attempts, number of heavy drinking days and number of packs of cigarettes smoked per day are all popular count outcomes in mental health research. Studies yielding paired outcomes are also popular. For example, to evaluate new eye-drops, we can treat one eye of a subject with the new eye-drops and the other eye with a placebo drop. To evaluate skin cancer for truck drivers, we can compare skin cancer on the left arm with the right arm, since the left arm is more exposed to sunlight. To evaluate the stress of combat on Veterans’ health, we may use twins in which one is exposed to combat and the other is not, as differences observed with respect to health are likely attributable to combat experience. In a pre-post study, the effect of an intervention is evaluated by comparing a subject’s outcomes before (pre) and after (post) receiving the intervention. In all these studies, each unit of analysis has two outcomes arising from two different conditions. Interest is centred on the difference between the means of the two outcomes.

For continuous outcomes, the paired t-test is the standard statistical method for evaluating differences between the means. However, the paired t-test does not apply to non-continuous variables such as binary and count (frequency) outcomes. For binary outcomes, McNemar’s test is the standard. For count or frequency outcomes, there is not much discussion in the literature. Many use Wilcoxon’s signed-rank test because this method is applicable to paired non-continuous outcomes such as count responses. One major weakness of the signed-rank test is its limited power. As observations are converted to ranks and only ranks are used in the test statistic, the signed-rank test does not use all available information in the original data, leading to lower power when compared with tests that use all data. This is why t-tests are preferred and widely used to compare two independent groups for continuous outcomes.

With recent advances in statistical methodology, there are more options for comparing paired count responses. In this paper, we discuss some alternative procedures that use all information in the original data and thus generally provide more power than the signed-rank test. In the second section, we first provide a brief review of paired outcomes and methods for comparing continuous and binary paired outcomes. We then discuss the classic signed-rank test and modern alternatives for comparing paired count outcomes. In the third section, we compare different methods for comparing paired count outcomes using simulation studies. In the fourth section, we present our concluding remarks.

Consider a sample of

For a continuous outcome, the paired t-test is generally applied to evaluating differences in the means of the paired outcomes. If we assume that

where

In practice, the bivariate normal distribution assumption for the paired outcomes

In what follows, we assume large samples, since all the tests to be discussed next are asymptotic tests, that is, they approximately follow a mathematical distribution such as the normal distribution only for large samples. For small to moderate samples, all these tests have unknown distributions and asymptotic mathematical distributions such as the standard normal for large samples for the paired t-test may not work well. We discuss alternatives for small to moderate samples in the discussion section.

If the paired outcomes are binary, the above hypothesis becomes the comparison of the proportions of

Then the hypothesis to be tested is given by:

McNemar’s test is premised on the idea of comparing concordant and discordant pairs in the sample.

Shown in

A 2×2 contingency table displaying joint distributions of paired binary outcomes, with a, b, c and d denoting cell count

0 | 1 | ||

| 0 | a | b |

1 | c | d |

Then,

Similarly,

Thus,

McNemar’s test evaluates the difference between the concordant and discordant pairs,

A large difference leads to rejection of the null. By normalising this difference, the statistic

For count outcomes, McNemar’s test clearly does not apply. The paired t-test is also inappropriate for such outcomes. First, the difference

One approach that has been used to compare paired count outcomes is the Wilcoxon signed-rank test. Within our context, let

where

The statistic

where

has approximately a normal distribution with mean

Since paired outcomes are a special case of general longitudinal outcomes, longitudinal methods can be applied to test the null. For example, both the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE), two most popular longitudinal models, can be specialised to the current setting. When applying GLMM, we specify the following model:

where

Note that since the random effect

For applying GEE, we only need to specify the mean of each paired outcome. This is because unlike GLMM, GEE is a ‘semi-parametric’ model and imposes no mathematical distribution on the outcomes. Thus, under GEE, both the Poisson distribution for each outcome and the random effect

Since there is no random effect in Equation (5), the log transformation is also not necessary and thus the GEE can be specified simply as:

Compared with the GLMM in Equation (4), the GEE above imposes no mathematical distribution either jointly or marginally, allowing for valid inference for a broad class of data distributions. The GLMM in Equation (4) may yield biased inference if: (1) at least one of the outcomes does not follow the Poisson; (2) the random effect

In this section, we evaluate and compare the performances of the different methods discussed above by simulation. All simulations are performed with a Monte Carlo (MC) sample of

If a test performs correctly, it should yield type I error rates at the specified nominal level

To evaluate the effects of model assumptions on test performance, we simulate correlated count responses

The above model deviates from the GLMM in Equation (4) in two ways. First,

For the simulation study, we set

Shown in

Averaged p values from testing the null of no difference between paired outcomes by different methods over

Sample size | Paired t-test | Signed-rank test | GLMM (Poisson) | GLMM (NB) | GEE | GEE (log-link) |

Dispersion parameter | ||||||

n=10 | 0.042 | 0.042 | 0.380 | 0.154 | 0.089 | 0.136 |

n=25 | 0.043 | 0.050 | 0.371 | 0.092 | 0.064 | 0.076 |

n=50 | 0.045 | 0.050 | 0.295 | 0.069 | 0.056 | 0.065 |

n=100 | 0.049 | 0.050 | 0.268 | 0.058 | 0.052 | 0.056 |

n=200 | 0.052 | 0.060 | 0.284 | 0.059 | 0.054 | 0.057 |

Dispersion parameter | ||||||

n=10 | 0.046 | 0.035 | 0.068 | 0.054 | 0.094 | 0.101 |

n=25 | 0.051 | 0.050 | 0.054 | 0.051 | 0.065 | 0.070 |

n=50 | 0.051 | 0.046 | 0.059 | 0.058 | 0.059 | 0.062 |

n=100 | 0.046 | 0.040 | 0.054 | 0.05 | 0.051 | 0.052 |

n=200 | 0.046 | 0.049 | 0.050 | 0.049 | 0.047 | 0.049 |

GEE, generalised estimating equation; GLMM, generalised linear mixed-effects model; MC, Monte Carlo; NB, negative binomial.

Although the paired t-test is not a valid test, it performed well for all sample sizes considered, although showing small downward bias, especially for small sample sizes. For extremely small sample sizes such as n=10, all three asymptotically valid methods, signed-rank test, GLMM (NB) and GEE, showed small upward bias, especially when

If a group of tests all provide good type I error rates, we can further compare them for power. It is common that two unbiased tests may provide different power, because they may use a different amount of information from study data or use the same information differently. For example, within the current study, the signed-rank test may provide less power than the GEE, because the former only uses the ranks of the original count outcomes, completely ignoring magnitudes of

We again use the MC approach to compare power across the different methods. However, unlike the evaluation of bias, we must also be specific about the difference in the means of paired outcomes so that we can simulate the outcomes under the alternative hypothesis. For this study, we specify the null and alternative as follows:

We simulate correlated outcomes

For each simulated outcome

Shown in

Power estimates from testing the null of no difference between paired outcomes by different methods over

Sample size | Paired t-test | Signed-rank test | GLMM (Poisson) | GLMM (NB) | GEE | GEE (log-link) |

Dispersion parameter | ||||||

n=10 | 0.057 | 0.060 | 0.406 | 0.194 | 0.120 | 0.178 |

n=25 | 0.102 | 0.100 | 0.495 | 0.151 | 0.132 | 0.159 |

n=50 | 0.190 | 0.188 | 0.555 | 0.214 | 0.209 | 0.227 |

n=100 | 0.344 | 0.310 | 0.718 | 0.344 | 0.360 | 0.373 |

n=200 | 0.599 | 0.555 | 0.897 | 0.583 | 0.607 | 0.611 |

Dispersion parameter | ||||||

n=10 | 0.119 | 0.104 | 0.172 | 0.161 | 0.205 | 0.222 |

n=25 | 0.266 | 0.260 | 0.333 | 0.320 | 0.321 | 0.331 |

n=50 | 0.506 | 0.490 | 0.559 | 0.546 | 0.535 | 0.539 |

n=100 | 0.834 | 0.818 | 0.861 | 0.858 | 0.842 | 0.842 |

n=200 | 0.981 | 0.980 | 0.988 | 0.987 | 0.983 | 0.983 |

GEE, generalised estimating equation; GLMM, generalised linear mixed-effects model; MC, Monte Carlo; NB, negative binomial.

Power for each method under different alternative hypotheses. Data are generated with larger dispersion (ie,

Power for each method under different alternative hypotheses. Data are generated with smaller dispersion (ie,

In this report, we discussed several methods for testing differences in paired count outcomes. Unlike paired continuous and binary outcomes, analysis of paired count outcomes has received less attention in the literature. Although the signed-rank test is often used, it is not an optimal test. This is because it uses ranks, rather than original count outcomes (differences between paired count outcomes), resulting in loss of information and leading to reduced power. Thus, unless study data depart severely from the normal distribution, the signed-rank test is not used for comparing paired continuous outcomes, as the paired t-test is a more powerful test. Within the current context of count outcomes, the signed-rank test again shows subpar power when compared with tests that are contrasted based on full data, such as the GEE.

The simulation study in this report also shows that parametric models for count outcomes such as the GLMM with a Poisson for marginal count outcomes are quite sensitive to departures from assumed parametric models. As expected, semiparametric models like the GEE provide better performance. Also, the paired t-test seems to perform quite well. This is not really surprising, since within the current context the GEE and paired t-test are essentially the same, except that the former relies on the asymptotic normal distribution for inference, while the latter uses the t distribution for inference. As the sample size grows, the t becomes closer to the standard normal distribution. Thus, p values and power estimates are only slightly different between the two for small to moderate samples.

The simulation results also show some small bias for all the asymptotic tests, that is, the signed-rank test, GLMM and GEE, especially for small sample sizes. In most clinical studies, sample sizes are relatively large and this limitation has no significant impact. For studies with small samples, such as those in bench sciences, bias in type I error rates may be high and require attention. One popular statistical approach is to use resampling methods such as permutation.

JAP directed all simulation studies, ran some of the simulation examples and helped edit and finalise the manuscript. TL helped run some of the simulation examples and drafted some parts of the manuscript. BW helped check some of the simulation study results and draft part of the simulation results. XMT helped draft and finalise the manuscript.

The report was partially supported by the National Institutes of Health, Grant UL1TR001442 of CTSA funding.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

None declared.

Not required.

Not commissioned; externally peer reviewed.

No additional data are available.