Biostatistical methods in psychiatry

Homoscedasticity: an overlooked critical assumption for linear regression

Abstract

Linear regression is widely used in biomedical and psychosocial research. A critical assumption that is often overlooked is homoscedasticity. Unlike normality, the other assumption on data distribution, homoscedasticity is often taken for granted when fitting linear regression models. However, contrary to popular belief, this assumption actually has a bigger impact on validity of linear regression results than normality. In this report, we use Monte Carlo simulation studies to investigate and compare their effects on validity of inference.

Introduction

Linear regression (LR) is arguably the most popular statistical model used to facilitate biomedical and psychosocial research. LR can be used to examine relationships between continuous variables, and associations between a continuous and a categorical variable. For example, by using one binary independent variable, LR can be used to compare the means between two groups, akin to the two independent samples t-test. If we have a multilevel categorical independent variable, LR yields the analysis of variance (ANOVA) model. Although the t-test for unequal group variance is often used as an alternative for comparing group means when large differences in group variances emerge, the same homoscedasticity assumption underlying ANOVA is often taken for granted when this classic model is applied for comparing more than two groups. For ANOVA, much of the focus is centred on normality, with little attention paid to homoscedasticity.

Contrary to popular belief, the homoscedasticity assumption actually plays a more critical role than normality on validity of ANOVA. This is because the F-test, testing for overall differences in group means across all the groups (omnibus test), is more sensitive to heteroscedasticity than normality. Thus, even when data are perfectly normal, F-test will generally yield incorrect results, if large group variances exist. Although the Kruskal-Wallis (KW) test is applied when homoscedasticity is deemed suspicious,1 this test is less powerful than the F-test, since it discretises original data using ranks, a sequence of natural numbers such as 1, 2 and 3 to represent ordinal differences in the original continuous outcomes. An even more serious problem with the KW test is its extremely complex distribution of the test statistic and consequently limited applications in practice.2

Over the past 30 years, many new statistical methods have been developed to address the aforementioned limitations of the classic LR and associated alternatives. Such new models apply to cross-sectional and longitudinal data, the latter being the hallmark of modern clinical research. Semiparameter statistical models are the most popular, since they require one of the distribution assumptions and apply to continuous outcomes without changing the continuous scale.3 In this report, we use the Monte Carlo simulation study to investigate and compare results when one of the two assumptions is violated, and to show the importance of homoscedasticity for valid inference for LR. We will discuss and perform head-to-head comparison of power between the classic KW test and modern semiparametric models in a future article.

LR model

We start with a brief overview of the classic LR. Consider a continuous outcome of interest, Y , and a set of p independent variables,  Inline Formula . We are interested in modelling the relationship of Y with the independent variables. Given a sample of n subjects, the classic LR models this relationship as:

Display Formula

where i indexes the subjects,  Inline Formula  are the regression coefficients (parameters),  Inline Formula  is the error term,  Inline Formula  denotes the normal distribution with mean μ and variance  Inline Formula . The LR in equation (1) posits a linear association between the outcome (dependent variable) Y and each of the independent variables. The latter have been called different names such as predictors, covariates and explanatory variables.

The first part of LR,

Display Formula

is called the conditional (population) mean of  Inline Formula  given the independent variables  Inline Formula . On estimating the regression coefficients, this conditional mean can be calculated to provide an estimate of  Inline Formula  for each subject. In addition to the assumed linear relationship, there are two additional assumptions in equation (1): (A) normal distribution and (B) homoscedasticity, or constant variance  Inline Formula  for all subjects.

All three assumptions play an important role in obtaining valid inference for regression coefficients. For example, if the association of Y with a particular independent variable  Inline Formula  is quadratic, the linear model in equation (1) must also include  Inline Formula , since otherwise estimates of  Inline Formula  will generally be biased. Likewise, if the error term  Inline Formula  is not normally distributed, SEs of estimated coefficients may be incorrect. Both the linearity and normality have been receiving great coverage in the literature.

In contrast, the impact of homoscedasticity on statistical inference of regression coefficients has received much less attention. Most publications in the biomedical and psychosocial literature do not even acknowledge this assumption for their applications of LR. Contrary to popular belief, inference about regression coefficients is actually more sensitive to departures from homoscedasticity than normality. In fact, normality actually does not matter at all when sample size is relatively large. In contrast, homoscedasticity remains an issue regardless of how large the sample size becomes. Below we illustrate these facts using Monte Carlo (MC) simulated data. For ease of exposition, we focus on one-way ANOVA, but the same conclusions apply to general LR as well.

ANOVA model

One particularly popular special case of LR is the ANOVA model. This occurs when the independent variables  Inline Formula  are binary indicators, representing different levels of a categorical or ordinal variable for multiple groups. The conditional mean of Y in equation (2) becomes the group mean. For example, if there are three groups, we may use group 1 as the referent and the other two independent variables  Inline Formula  to represent groups 2 and 3:

Display Formula

In this case, the LR in equation (1) becomes:

­

Display Formula

Thus, the regression coefficients become the group mean of Y :

Display Formula

where  Inline Formula  denotes the mean of Y for group k ( Inline Formula ).

Because of the relationship of the coefficient with the group mean, the ANOVA is often simply expressed as:

Display Formula

where  Inline Formula  denotes the outcome from the i th subject within the k th group,  Inline Formula  is the (population) mean of the kth group, and K is the total number of groups. For the one-way ANOVA in equation (3) the linearity assumption does not apply, as  Inline Formula  represents the group mean and no linear or any relationship is assumed between the group means. The normality and homoscedasticity become easier to interpret and check as well, as they apply to distributions of  Inline Formula  within each group.

Under ANOVA, comparisons of group means across all groups are readily expressed by a null,  Inline Formula , and alternative  Inline Formula  hypothesis as:

Display Formula

Under the null hypothesis  Inline Formula , all groups have the same mean. If  Inline Formula  is rejected, post hoc analyses are followed to determine the groups that have different group means. We focus on the hypothesis in equation (4) for overall group difference below, but the same conclusions apply to post hoc pairwise group comparisons as well.

ANOVA uses F-tests for testing the null hypothesis of no group difference in equation (4). This omnibus test is defined by elements of a so-called ANOVA table:

|

In the above table,

Display Formula

where  Inline Formula  is the sample mean of group k ,  Inline Formula  is the sample mean of the entire sample (grand mean), N is the total sample size from all groups,  Inline Formula  denotes sum of all  Inline Formula , and  Inline Formula  denotes sum of  Inline Formula  over both indices k and i. The terms SS(R), SS(E) and SS(Total) are called the regression, error and total sum of squares, respectively, and M S(R) and MS(R), obtained by dividing SS(R) and SS(E) by their respective df, are called the mean regression and the mean error sum of squares mean. The three sums of squares are related as:

Display Formula

where  Inline Formula  and  Inline Formula  form a partition of  Inline Formula .

Under the null of no group difference, the sample group mean  Inline Formula  will be close to the sample grand mean,  Inline Formula , in which case  Inline Formula  will be close to 0 and  Inline Formula  will be close to  Inline Formula . Otherwise,  Inline Formula  will be different from 0 with its magnitude reflecting differences between the group means and  Inline Formula  will account for a smaller portion of  Inline Formula  in this case. Thus, relationships between SS(R) and SS(E) indicate if there is evidence of differences in group means. By normalising the two by number of groups and sample size, the F-test is defined by their normalised counterparts, MS(R) and MS(R) .

Under the null hypothesis H 0, the F statistic, or the ratio  Inline Formula , follows the F distribution:

Display Formula

where F I −1,NI denotes the F distribution with K–1 (numerator) df and N–K (denominator) df. As a smaller MS(R) relative to MS(E) indicates evidence supporting the null and vice versa, this is consistent with the fact that a larger value of the F statistic leads to rejection of the null and vice versa.

As the F-test is derived under ANOVA in equation (3), validity of the F distribution for the F statistic in equation (5) depends on (A) normality and (B) homoscedasticity. Under both assumptions, the outcomes  Inline Formula  do follow a normal distribution for each group, and all groups have the same variance. As noted earlier, the extant literature has largely focused on normality with little discussion on homoscedasticity. Next, we use MC simulation to examine the performance of the F-test under departures from each assumption.

Simulation study

To evaluate the performance of the F-test when either normality or homoscedasticity is violated, we use MC simulated data to create multiple samples from an ANOVA. By repeatedly testing the null of no group mean difference using each simulated sample and comparing the per cent of times when the null is rejected with a prespecified type I error, we can see how each assumption impacts the performance of the F-test. We start with specifying an ANOVA from which samples will be drawn using MC simulation.

For brevity and without loss of generality, we consider three groups with a common mean μ and group size n, in which case the ANOVA in equation (3) simplifies to:

Display Formula

We set  Inline Formula ,  Inline Formula  and consider testing the null of no group difference,

Display Formula

Thus, the null hypothesis H 0 is true for the ANOVA in equation (6). If we set type I error level to, say, α=0.05, and repeatedly test the null hypothesis using data Y ki sampled from the ANOVA in equation (6) using MC simulation, we expect to reject the null H 05% of the times. We will refer to per cent of times when the null is rejected through repeated testing from multiple samples (MC simulated samples in the current setting) as empirical type I errors.

For evaluating the performance of the F-test under non-normal distribution, we simulate the error terms  Inline Formula  in equation (6) from (A) a centred and rescaled χ2 with 1 df and (B) a centred and rescaled Weibull distribution with both shape and scale equal to 1, so the resulting distributions in both cases have mean 0 and variance 1. In this case, the simulated Y ki have the same mean (μ=0) and variance (σ=1) across the three groups. However, as the Y ki no longer follow the normal distribution, the F-test in equation (5) may not follow the F distribution, in which case we may not reject the null H 0 5% of the times if we set the nominal level as α=0.05.

To evaluate the impact of homoscedasticity on type I errors, we consider two scenarios: (a) all three groups have different variances, and (b) one group has a different variance than the other two groups:

Display Formula

In each scenario above, we consider three sets of variances indexed by k (=1, 10, 100). As differences between the variances become larger as k varies from 1 to 10 to 100, this setting will show if increased degree of heteroscedasticity will have a larger effect on type I errors.

To reduce the sampling variability, we set MC sample size to M=1000. We consider the three sample sizes (A)  Inline Formula , (B)  Inline Formula  and (C)  Inline Formula  to see if and how sample size will affect the performance of the F-test.

Shown in table 1 are empirical type I errors for testing the null of no group mean difference by the F-test from 1000 MC simulated outcomes under the ANOVA in equation (6) with no violation and violation of each of the two assumptions. If none of the assumptions is violated, empirical p values are quite close to their nominal counterparts. When normality is not met, there is downward bias in empirical p values for sample size n=10, but the bias seems to have disappeared when n=100 and n=1000. When homoscedasticity is violated, however, there is clearly upward bias in both scenarios. Also, the bias becomes larger as the degree of heteroscedasticity increases under both scenarios. The first scenario has a larger bias than the second, which is expected since it has a higher degree of heteroscedasticity than the second scenario. Moreover, unlike the case of violation of normality, the upward bias becomes larger as sample size increases from 10 to 100 to 1000. An increased empirical type I error means the likelihood of rejecting the null hypothesis is higher than the specified nominal level, yielding a higher rate of false positives.

Table 1
|
Empirical type I errors for testing the null of no difference in group mean across three groups in equation (7) from the F-test based on 1000 Monte Carlo simulated samples from the ANOVA in equation (6) under (A) no violation of normality and homoscedasticity, (B) violation of normality and (C) violation of homoscedasticity for sample size n=10, 100 and 1000 with nominal type I error α=0.05 and α=0.01

Discussion

In this report, we examined the performance of the F-test in one-way ANOVA using MC simulation when data violate the (1) normality and (2) homoscedasticity assumption. Our simulation results show that although there is downward bias in type I errors for extremely small sample size (n=10), it virtually disappears for moderate (n=100) and large (n=1000) sample sizes. The diminishing bias as sample size increases is not a coincidence and is actually the working of a mechanism known as the central limit theorem in statistics.3 Thus, unless sample size is small (eg, less than n=50), normality can be ignored when applying ANOVA.

In contrast, the F-test is more sensitive to departures from homoscedasticity. Our simulation study results show that type I errors are upwardly biased when groups have different variances. Moreover, unlike violation of normality, the amount of bias persists and actually increases as sample size increases. Thus, one needs to pay close attention to this assumption when applying ANOVA. If violation of this assumption is suspected either by formal testing4 or rule of thumb (eg, the ratio of the largest to the smallest variance is larger than 2), one may need to apply other tests for comparing group means. For example, the KW test may be used for inference. However, as this test generally has lower power, one may opt for modern alternatives such as the semiparametric models, which provide inference for linear and more general regression models without imposing any distributional distribution.3 Unlike the KW test, semiparametric models use original continuous outcomes and thus provide more power than the KW test. We will compare the KW test and applications of semiparametric models to ANOVA in a future article.

Kun Yang graduated from Huazhong Agricultural University in 2016. He has been reading for a master’s degree of bioinformatics in Huazhong Agricultural University since 2016. Now he is studying and working at University of California, San Diego, as a visiting student. His research interests include bioinformatics, biostatistics and machine learning.

author bio image