Biostatistical Methods In Psychiatry

On modelling relative risks for longitudinal binomial responses: implications from two dueling paradigms

Abstract

Although logistic regression is the most popular for modelling regression relationships with binary responses, many find relative risk (RR), or risk ratio, easier to interpret and prefer to use this measure of risk in regression analysis. Indeed, since Zou published his modified Poisson regression approach for modelling RR for cross-sectional data, his paper has been cited over 7 000 times, demonstrating the popularity of this alternative measure of risk in regression analysis involving binary responses. As longitudinal studies have become increasingly popular in clinical trials and observational studies, it is imperative to extend Zou’s approach for longitudinal data.

The two most popular approaches for longitudinal data analysis are the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE). However, the parametric GLMM cannot be used for the extension within the current context, because Zou’s approach treats the binary response as a Poisson variable, which is at odds with the Bernoulli distribution for the binary response. On the other hand, as it imposes no mathematical model on data distributions, the semiparametric GEE is coherent with Zou’s modified Poisson regression. In this paper, we develop a GEE-based longitudinal model for binary responses to provide inference about RR.

Introduction

Logistic regression is widely used to model binary responses. However, many find relative risk (RR), or risk ratio easier to interpret and prefer to model regression relationships with inference about RR, rather than odds ratio (OR) as in logistic regression. Indeed, since Zou1 published his modified Poisson regression approach for inference about RR, his paper has been cited 7 128 times, demonstrating the popularity of using RR in modelling binary responses. However, his approach isn’t applied to longitudinal data. Moreover, there is no one-to-one relationship between RR and OR for regression analysis.2 As longitudinal studies have become increasingly the standard in clinical trials and observational studies, it is imperative to develop statistical models for longitudinal binary responses with inference based on RR to fill the critical gap.

The two most popular paradigms to extend models for cross-sectional data to longitudinal data are the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE). The parametric GLMM explicitly models the within-subject correlation using random effects, while the semiparametric, or distribution-free GEE implicitly accounts for such correlations using sandwich variance estimates.3 Since Zou’s approach treats binary responses as count variables and derives estimators of RR under the Poisson distribution, GLMM cannot be used to extend his approach to longitudinal data within the current context. As his approach is essentially a semiparametric log-linear model, a simplified version of GEE for cross-sectional data, GEE provides a coherent paradigm to develop to extend his approach to longitudinal data.

In the Models for Relative Risks for Longitudinal Binary Responses section, we first review semiparametric regression models for cross-sectional and longitudinal binary responses under the logit and log link for inference about the respective log of OR and log of RR. We then discuss a GEE-based approach for longitudinal binary responses for inference about RR by leveraging semiparametric log-linear models. In the Application section, we use real and simulated data to illustrate the proposed approach. In the Discussion section, we give our concluding remarks.

Models for relative risks for longitudinal binary responses

We start with a brief review of Zou’s approach for inference about RR when modelling binary responses in cross-sectional data.

Cross-sectional data

Consider a study with n subjects indexed by  Inline Formula  Let  Inline Formula  denote a binary response of interest and let  Inline Formula  with  Inline Formula  denote a  Inline Formula  vector of explanatory, or independent, variables from the i th subject  Inline Formula . The popular logistic regression model is defined by a generalised linear model (GLM) with the logit link as Tang et al 3:

Display Formula

where  Inline Formula  denotes independently distributed, Bernoulli  Inline Formula  denotes the Bernoulli distribution with mean  Inline Formula , logit denotes the logit link function and γ is the vector of model parameters or coefficients. Under logistic regression, each regression coefficient  Inline Formula  has the log OR interpretation per unit change in  Inline Formula  for  Inline Formula  3 Inference about γ is generalised based on maximum likelihood.3

For  Inline Formula  to have the RR interpretation, we need to change the logit link to the log link function to express (1) as:

Display Formula

For differentiating log OR from log RR, we use a different symbol β in (2) to denote the model coefficients. Under (2), each coefficient  Inline Formula  has the log RR interpretation. For example, consider one unit increase in  Inline Formula  from  Inline Formula  to  Inline Formula . Denote the change in the mean of  Inline Formula  in response to the change in  Inline Formula  by:

Display Formula

Then, it follows from (2) that the log of RR,  Inline Formula , for the unit change in  Inline Formula  from  Inline Formula  to  Inline Formula  is:

Display Formula

The two GLMs in (1) and (2) are quite similar except for the different link functions. Under logit link in (1), the conditional mean  Inline Formula  is constrained between 0 and 1, while under the log link in (2),  Inline Formula  is confined only to positive values. Since  Inline Formula  may exceed 1, the upper bound for a probability quantity, estimates based on maximising the Bernoulli likelihood may not converge under the log link.4 5 To alleviate this problem, we may switch the Bernoulli distribution in (2) to the Poisson, that is,

Display Formula

Since the logic restriction of positive values on  Inline Formula  is consistent with the mean of the Poisson, fitting the model (3) to observed data will not be an issue. For rare diseases,  Inline Formula  will be close to 0 and  Inline Formula  may be viewed as a count, frequency, or response with mean  Inline Formula , in which case the Poisson-based (3) is a reasonable approximation. In general, with increased  Inline Formula , (3) may not provide valid inference, since the binary  Inline Formula  will not have a Poisson distribution in this case. Zou discussed the use of the sandwich variance estimator as an alternative to estimate the variance of the estimator of β . This approach is essentially a semiparametric regression, or restricted moment model, in which only the model for the conditional mean of  Inline Formula  given  Inline Formula  in (3) is assumed:

Display Formula

Thus, unlike (3), the semiparametric log-linear model above does not assume Poisson or any other parametric distribution for  Inline Formula . Different from a parametric model, a semiparametric model leverages estimating equations to play the role of the likelihood to provide inference.3 Unlike maximum likelihood estimation, inference based on estimating equations is consistent regardless of the distribution of  Inline Formula , so long as the assumed conditional mean in (4) is correct.3 Thus, even if  Inline Formula  does not have a Poisson distribution, inference about β in (4) is still correct when based on the estimating equations.

Within the current context, the estimating equations for inference about β have the form:

Display Formula

where  Inline Formula  is the conditional variance of  Inline Formula  given  Inline Formula . Under (4),  Inline Formula  and  Inline Formula  are readily evaluated. However,  Inline Formula  is not determined by the semiparametric log-linear model in (4), since it only specifies the conditional mean  Inline Formula . Within the current context,  Inline Formula  follows the Bernoulli  Inline Formula , in which case we have  Inline Formula . We obtain the estimate  Inline Formula  of β by solving (5) for β . Unlike linear regression,  Inline Formula  cannot be evaluated in closed form but is readily computed numerically.3

The estimator  Inline Formula  has an asymptotically normal distribution with mean β and variance  Inline Formula :

Display Formula

where  Inline Formula  denotes the inverse of B . We can estimate  Inline Formula  by the following sandwich variance estimator  Inline Formula :

Display Formula

Note that unlike likelihood-based inference for parametric models, inference based on the estimating equations in (5) for semiparametric models is always valid, regardless of the distribution of  Inline Formula . In particular, instead of  Inline Formula , we may also set  Inline Formula  to any function of  Inline Formula  such as  Inline Formula  (by treating  Inline Formula  as a Poisson with mean  Inline Formula ) for valid inference about β . This is why we can model a binary  Inline Formula  using a semiparametric log-linear model for a count response.

Longitudinal data

We now consider extending the semiparametric log-linear model above to longitudinal data.

Suppose that the subjects are assessed repeatedly over T time points  Inline Formula . Let  Inline Formula  and  Inline Formula  denote the same response and explanatory variables as in the cross-sectional data setting, but with t indicating their dependence on the time of assessment ( Inline Formula ,  Inline Formula ). By applying the semiparametric log-linear model in (4) to each assessment t , we obtain an extension of the semiparametric log-linear model for the association of longitudinal  Inline Formula  and  Inline Formula :

Display Formula

Thus, we do not explicitly model correlations among the repeated  Inline Formula  ’s. Inference about β is based on extending the estimating equations in (5) to the correlated  Inline Formula  ’s.

Let

Display Formula

The estimating equations, which are often called the generalised estimating equations (GEE) in the literature, for inference about β have the form:

Display Formula

where  Inline Formula  is the conditional variance of  Inline Formula  given  Inline Formula . As in the cross-sectional case, we can readily evaluate  Inline Formula  and  Inline Formula  under (8) and set  Inline Formula  for each  Inline Formula . However, the conditional covariance between  Inline Formula  given  Inline Formula  is quite complex. In almost all applications of GEE, we use a working correlation  Inline Formula  to approximate the true correlation  Inline Formula , where  Inline Formula  is a  Inline Formula  correlation matrix with its entries defined by a parameter vector α . 3 Popular choices of  Inline Formula  are the working independence, with  Inline Formula , and working exchangeable, with  Inline Formula , model, where  Inline Formula  denotes the  Inline Formula  identity matrix and ρ is a parameter.

Under a specific  Inline Formula , we have  Inline Formula , where  Inline Formula  denotes a diagonal matrix with  Inline Formula  on its t th diagonal. As in the case of cross-sectional data, inference is always valid even if  Inline Formula   Inline Formula  is not the true correlation (variance) of  Inline Formula  given  Inline Formula . In (9),  Inline Formula  also depends on α , though we have suppressed this dependence to highlight the fact that (9) is the equation for estimating β . Thus, α must be estimated (except for the working independence model) to solve (9) for  Inline Formula  We can either assign a value to or estimate α together with β . For example, under  Inline Formula , we may set ρ to any value between 0 and 1 or estimate ρ using correlated residuals  Inline Formula , with  Inline Formula . Inference about β is based on the asymptotic normal distribution of the GEE estimator  Inline Formula , which has mean β and variance  Inline Formula :

Display Formula

where  Inline Formula  denotes the transpose of B . We can estimate  Inline Formula  by the sandwich variance estimator  Inline Formula , which is obtained by:

Display Formula

where  Inline Formula ,  Inline Formula  and  Inline Formula  denote substituting  Inline Formula  in place of β for the respective quantity  Inline Formula ,  Inline Formula  and  Inline Formula .

Popular software packages all support semiparametric regression models for both cross-sectional and longitudinal data. For example, PROC GEE in SAS and geeglm() in the geepack package in R6 can be used to fit the semiparametric log-linear models in (4) for cross-sectional and (8) for longitudinal data.

Application

We illustrate our considerations with both real and simulated data. In all the examples, we set the statistical significance at  Inline Formula . All analyses are carried out using the geeglm() function in the geepack package in R.6

Simulation study

We consider modelling regression associations of a single time-invariant binary explanatory variable  Inline Formula  with a binary response  Inline Formula  in a longitudinal study with three assessments. To simulate the correlated  Inline Formula , we use a Gaussian copula with the marginal  Inline Formula  given  Inline Formula  following a Bernoulli7:

Display Formula

For our simulation, we set  Inline Formula  and  Inline Formula  and an exchangeable correlation  Inline Formula  in the trivariate normal with  Inline Formula .

We fit the semiparametric (8) to the data simulated, that is,

Display Formula

using the GEE in (9) under the working independent correlation model. Shown in table 1 are the estimates of β along with their standard errors (SEs) (both asymptotic and empirical), over 1 000 Monte Carlo (MC) replications under a sample size  Inline Formula . The estimates  Inline Formula  were quite close to their true values, and the asymptotic SEs were quite close to their empirical counterparts. Also, shown in table 1 are type I error rates from testing the null hypothesis  Inline Formula  and  Inline Formula . We estimate the type I errors using MC iterations. Let  Inline Formula  denotes the Wald statistic at the m th MC replication, the type I error rate for testing  Inline Formula  is estimated by:  Inline Formula , where  Inline Formula  is the 95th percentile of a  Inline Formula  distribution, a χ2 distribution with 1 df. As seen, the type I error rates were close the normal values  Inline Formula .

Table 1
|
Parameter estimates, SEs (asymptotic and empirical) and type I errors from GEE model with 1 000 MC replications

Real study

Smoking is the chief avoidable cause of morbidity and mortality in the USA, exacting a substantive financial burden as well.8 Smoking rates among persons with serious mental illness are exceptionally high, contributing to significant medical morbidity and mortality in this population, with many unlikely to live beyond their 50th birthday. Persons with mental illness spend nearly one-third of their monthly public assistance income on cigarettes instead of buying needed food, clothing and shelter.9 A study was conducted to evaluate the effect of a multicomponent smoking cessation programme adapted to patients with serious psychiatric disorders within an outpatient psychiatric clinic at the University of Rochester Medical Center. This study, sponsored by the New York State Department of Health Tobacco Control Program, capitalises on packaging multiple evidence-based components to achieve a better outcome than when each practice is individually implemented in a number of clinical venues, for example, central line–associated bloodstream infections and ventilator-associated pneumonia.10 Among the 276 participating subjects, 99 also participated in a formal evaluation, in which interviews were conducted at the point of enrolment (baseline), prior to intervention and again at 3, 6 and 12 months.

For illustrative purposes, we model the binary abstinence outcome, defined as the 7-day point prevalence (ie, abstinent from smoking for 7 days in a row), from preintervention at baseline,  Inline Formula , to each of the three postintervention assessments,  Inline Formula , at 3, 6 and 12 months, using data from 99 subjects. We create three time-varying dummy variables  Inline Formula ,  Inline Formula  and  Inline Formula  to indicate intervention effects at  Inline Formula :

Display Formula

Let  Inline Formula  if the i th subject is abstinent for 7 days consecutively and  Inline Formula  otherwise. The semiparametric GEE for change of abstinence rates over time is given by:

Display Formula

We fit (8) to the 7-day point prevalence data using the GEE in (9) under the working independent correlation model.

Shown in table 2 are the estimates  Inline Formula  of  Inline Formula  and associated SEs, p values for testing the null  Inline Formula  and RRs (exponentiated  Inline Formula ) at each assessment  Inline Formula . The results show a RR greater than 1 for all three postintervention assessments, though only statistically significant at months 3 and 6. The intervention did have a significant effect on reducing smoking in this study sample, though the effect diminished 12 months after the intervention.

Table 2
|
Estimates of parameters, SEs, p values and relative risks over time from GEE model to the Smoking Cessation Study data

Discussion

We extended the popular approach for modelling RRs for binary responses to longitudinal data by leveraging the semiparametric GEE. Like the original approach in Zou,1 the parameters of the proposed log-linear model have the log of RR interpretation and, thus, with appropriately defined explanatory variables, can be used for inference about RRs when modelling longitudinal regression relationships with binary responses. We also illustrated the proposed approach using both real and simulated longitudinal data.

The proposed GEE-based approach provides valid inference under the missing completely at random (MCAR) mechanism.3 11 In many real studies, missing data follow the missing at random (MAR) mechanism,3 11 in which case the lowest patterns done by the proposed approach generally yield biased estimates of RR. We can readily extend the approach to provide valid inference under MAR by employing the weighted generalised estimating equations (WGEEs).11 Under WGEE, we also model the missingness of the binary response over time using GLMs for binary responses such as logistic regression and estimate its parameters and the parameters of the log-linear model in (8) together using a set of estimating equations that extend (9) to include the additional parameters.3

Tuo Lin is a fifth-year PhD student in Biostatistics at the University of California, San Diego (UCSD) in the USA. He obtained his master’s degree in Statistics at UCSD in 2018. He is currently working as a graduate student researcher in the division of Biostatistics and Bioinformatics of Herbert Wertheim School of Public Health and Human Longevity Science at UCSD. He has also been working at Altman Clinical and Translational Research Institute (ACTRI) in the USA for many years, helping with study designs and data analyses. His main research interests include survey sampling and methods, causal inference and longitudinal data analysis in psychiatry studies.

author bio image