Introduction
In 2016, a statement was jointly released by Ronald Wasserstein and Nicole Lazar1 on behalf of the American Statistical Association warning against the misuse and misinterpretation of statistical significance and p values in scientific research. The Statement offers six principles in using p values. In fact, the controversies around p values have appeared from time to time in statistical communities and in other areas. For example, in 2015, the editors of Basic and Applied Social Psychology decided to ban p values in the papers published in that journal. A recent paper in Nature raised the issue again about the use of p values in scientific discoveries.2 Although the p value has a history almost as long as that of the modern statistics and has been used in millions of scientific publications, ironically it has never been rigorously defined in statistical literature. This means that all reported p values in publications are based on some very intuitive interpretations. This is one of the possible reasons that p value has caused a tremendous number of misuse and misconception. A formal discussion of the rigorous definition of a p value is out of the scope of this paper. Our discussion follows the current tradition of interpretation of p values.
Suppose X1, X2, …, Xn is a random sample from some probability and T=T(X1, X2, …, Xn) is a statistic used to test some null hypothesis. Suppose the observed data are x1, x2, …, xn. Then the calculated value of the test statistic is T(x1, x2, …, xn). Informally speaking, the p value is the probability that T(X1, X2, …, Xn), as a random variable, has values more ‘extreme’ than the currently observed value T(x1, x2, …, xn) under the null hypothesis. Suppose a larger value means more ‘extreme’. Then the p value (given the data) is the probability that T(X1, X2, …, Xn)≥T(x1, x2, …, xn), that is,
p(x1, x2, …, xn)=P{T(X1, X2, …, Xn)≥T(x1, x2, …, xn)}.(1)
To calculate the p value, we need to know the distribution of T under the null hypothesis. For example, the two-sample t-test has been used a lot to test the hypothesis that whether the two independent samples have the same mean value. If those two samples are normally distributed with the same variance, the test statistic has a central t-distribution under the null hypothesis. If the variances are not the same, the distribution of the test statistic does not have a close. This is the so called Fisher’s two-sample t-test problem.3 However, if the sample sizes in two groups are large enough, the asymptotic distribution of the test statistic can be safely approximated by normal distribution (due to the central limit theorem).
In case of discrete outcome variables, for example, the treatment outcome (success or failure) in a randomised clinical trial, the result is often presented in a contingency table (see table 1).
where Si in group ( ) follows the binomial distribution with the sample size and probability of success . The hypothesis that is usually of interest is
Pearson’s χ2 test is one way of measuring departure from conducted by calculating an expected frequency table (see table 2). This table is constructed by conditioning on the marginal totals and filling in the table according to , that is,
Using this expected frequency table, a statistic is calculated by going through the cells of the tables and computing
Note that follows the χ2 distribution asymptotically. Thus, the p value yield by this is
where this X follows the χ2 distribution with df.
Fisher’s exact test is preferred over Pearson’s χ2 test in the case of either cell or in table 1 is very small. Consider the testing of the hypothesis:
One could show is a sufficient statistic under . Given the value of , it is reasonable to use as a test statistic and reject in favour of for large values of , because large values of correspond to small values of . The conditional distribution given is hypergeometric . Thus, the conditional p value is
where is the probability density function of the hypergeometric .
It is well known that due to the randomness, with the same design, if we repeat the same experiment, we may get different results. From equation (1) we can see that the p value explicitly depends on the observed data x1, x2, …, xn. Hence, the p value is a random variable with the range [0,1]. In this paper we study the behaviour of p value. Our results show that under some conditions, the distribution of the p value may be weird, which makes the result based on the p value uninterpretable.