Introduction
Linear regression (LR) is arguably the most popular statistical model used to facilitate biomedical and psychosocial research. LR can be used to examine relationships between continuous variables, and associations between a continuous and a categorical variable. For example, by using one binary independent variable, LR can be used to compare the means between two groups, akin to the two independent samples t-test. If we have a multilevel categorical independent variable, LR yields the analysis of variance (ANOVA) model. Although the t-test for unequal group variance is often used as an alternative for comparing group means when large differences in group variances emerge, the same homoscedasticity assumption underlying ANOVA is often taken for granted when this classic model is applied for comparing more than two groups. For ANOVA, much of the focus is centred on normality, with little attention paid to homoscedasticity.
Contrary to popular belief, the homoscedasticity assumption actually plays a more critical role than normality on validity of ANOVA. This is because the F-test, testing for overall differences in group means across all the groups (omnibus test), is more sensitive to heteroscedasticity than normality. Thus, even when data are perfectly normal, F-test will generally yield incorrect results, if large group variances exist. Although the Kruskal-Wallis (KW) test is applied when homoscedasticity is deemed suspicious,1 this test is less powerful than the F-test, since it discretises original data using ranks, a sequence of natural numbers such as 1, 2 and 3 to represent ordinal differences in the original continuous outcomes. An even more serious problem with the KW test is its extremely complex distribution of the test statistic and consequently limited applications in practice.2
Over the past 30 years, many new statistical methods have been developed to address the aforementioned limitations of the classic LR and associated alternatives. Such new models apply to cross-sectional and longitudinal data, the latter being the hallmark of modern clinical research. Semiparameter statistical models are the most popular, since they require one of the distribution assumptions and apply to continuous outcomes without changing the continuous scale.3 In this report, we use the Monte Carlo simulation study to investigate and compare results when one of the two assumptions is violated, and to show the importance of homoscedasticity for valid inference for LR. We will discuss and perform head-to-head comparison of power between the classic KW test and modern semiparametric models in a future article.