Article Text

On modelling relative risks for longitudinal binomial responses: implications from two dueling paradigms
  1. Tuo Lin1,
  2. Rongzhe Zhao1,
  3. Shengjia Tu2,
  4. Hao Wu3,
  5. Hui Zhang4 and
  6. Xin M Tu1
  1. 1 Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health and Human Longevity Science, UC San Diego, La Jolla, California, USA
  2. 2 College of Environmental Science and Engineering, Tongji University, Shanghai, China
  3. 3 Department of Mathematics and Statistics, Georgia State University, Atlanta, Georgia, USA
  4. 4 Division of Biostatistics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
  1. Correspondence to Mr Tuo Lin; tulin{at}health.ucsd.edu

Abstract

Although logistic regression is the most popular for modelling regression relationships with binary responses, many find relative risk (RR), or risk ratio, easier to interpret and prefer to use this measure of risk in regression analysis. Indeed, since Zou published his modified Poisson regression approach for modelling RR for cross-sectional data, his paper has been cited over 7 000 times, demonstrating the popularity of this alternative measure of risk in regression analysis involving binary responses. As longitudinal studies have become increasingly popular in clinical trials and observational studies, it is imperative to extend Zou’s approach for longitudinal data.

The two most popular approaches for longitudinal data analysis are the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE). However, the parametric GLMM cannot be used for the extension within the current context, because Zou’s approach treats the binary response as a Poisson variable, which is at odds with the Bernoulli distribution for the binary response. On the other hand, as it imposes no mathematical model on data distributions, the semiparametric GEE is coherent with Zou’s modified Poisson regression. In this paper, we develop a GEE-based longitudinal model for binary responses to provide inference about RR.

  • OR
  • generalized estimating equations
  • relative risk
  • sandwich variance estimator
  • semiparametric generalized linear models
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Logistic regression is widely used to model binary responses. However, many find relative risk (RR), or risk ratio easier to interpret and prefer to model regression relationships with inference about RR, rather than odds ratio (OR) as in logistic regression. Indeed, since Zou1 published his modified Poisson regression approach for inference about RR, his paper has been cited 7 128 times, demonstrating the popularity of using RR in modelling binary responses. However, his approach isn’t applied to longitudinal data. Moreover, there is no one-to-one relationship between RR and OR for regression analysis.2 As longitudinal studies have become increasingly the standard in clinical trials and observational studies, it is imperative to develop statistical models for longitudinal binary responses with inference based on RR to fill the critical gap.

The two most popular paradigms to extend models for cross-sectional data to longitudinal data are the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE). The parametric GLMM explicitly models the within-subject correlation using random effects, while the semiparametric, or distribution-free GEE implicitly accounts for such correlations using sandwich variance estimates.3 Since Zou’s approach treats binary responses as count variables and derives estimators of RR under the Poisson distribution, GLMM cannot be used to extend his approach to longitudinal data within the current context. As his approach is essentially a semiparametric log-linear model, a simplified version of GEE for cross-sectional data, GEE provides a coherent paradigm to develop to extend his approach to longitudinal data.

In the Models for Relative Risks for Longitudinal Binary Responses section, we first review semiparametric regression models for cross-sectional and longitudinal binary responses under the logit and log link for inference about the respective log of OR and log of RR. We then discuss a GEE-based approach for longitudinal binary responses for inference about RR by leveraging semiparametric log-linear models. In the Application section, we use real and simulated data to illustrate the proposed approach. In the Discussion section, we give our concluding remarks.

Models for relative risks for longitudinal binary responses

We start with a brief review of Zou’s approach for inference about RR when modelling binary responses in cross-sectional data.

Cross-sectional data

Consider a study with n subjects indexed by Embedded Image Let Embedded Image denote a binary response of interest and let Embedded Image with Embedded Image denote a Embedded Image vector of explanatory, or independent, variables from the i th subject Embedded Image . The popular logistic regression model is defined by a generalised linear model (GLM) with the logit link as Tang et al 3:

Embedded Image (1)

where Embedded Image denotes independently distributed, Bernoulli Embedded Image denotes the Bernoulli distribution with mean Embedded Image , logit denotes the logit link function and γ is the vector of model parameters or coefficients. Under logistic regression, each regression coefficient Embedded Image has the log OR interpretation per unit change in Embedded Image for Embedded Image 3 Inference about γ is generalised based on maximum likelihood.3

For Embedded Image to have the RR interpretation, we need to change the logit link to the log link function to express (1) as:

Embedded Image (2)

For differentiating log OR from log RR, we use a different symbol β in (2) to denote the model coefficients. Under (2), each coefficient Embedded Image has the log RR interpretation. For example, consider one unit increase in Embedded Image from Embedded Image to Embedded Image . Denote the change in the mean of Embedded Image in response to the change in Embedded Image by:

Embedded Image

Then, it follows from (2) that the log of RR, Embedded Image , for the unit change in Embedded Image from Embedded Image to Embedded Image is:

Embedded Image

The two GLMs in (1) and (2) are quite similar except for the different link functions. Under logit link in (1), the conditional mean Embedded Image is constrained between 0 and 1, while under the log link in (2), Embedded Image is confined only to positive values. Since Embedded Image may exceed 1, the upper bound for a probability quantity, estimates based on maximising the Bernoulli likelihood may not converge under the log link.4 5 To alleviate this problem, we may switch the Bernoulli distribution in (2) to the Poisson, that is,

Embedded Image (3)

Since the logic restriction of positive values on Embedded Image is consistent with the mean of the Poisson, fitting the model (3) to observed data will not be an issue. For rare diseases, Embedded Image will be close to 0 and Embedded Image may be viewed as a count, frequency, or response with mean Embedded Image , in which case the Poisson-based (3) is a reasonable approximation. In general, with increased Embedded Image , (3) may not provide valid inference, since the binary Embedded Image will not have a Poisson distribution in this case. Zou discussed the use of the sandwich variance estimator as an alternative to estimate the variance of the estimator of β . This approach is essentially a semiparametric regression, or restricted moment model, in which only the model for the conditional mean of Embedded Image given Embedded Image in (3) is assumed:

Embedded Image (4)

Thus, unlike (3), the semiparametric log-linear model above does not assume Poisson or any other parametric distribution for Embedded Image . Different from a parametric model, a semiparametric model leverages estimating equations to play the role of the likelihood to provide inference.3 Unlike maximum likelihood estimation, inference based on estimating equations is consistent regardless of the distribution of Embedded Image , so long as the assumed conditional mean in (4) is correct.3 Thus, even if Embedded Image does not have a Poisson distribution, inference about β in (4) is still correct when based on the estimating equations.

Within the current context, the estimating equations for inference about β have the form:

Embedded Image (5)

where Embedded Image is the conditional variance of Embedded Image given Embedded Image . Under (4), Embedded Image and Embedded Image are readily evaluated. However, Embedded Image is not determined by the semiparametric log-linear model in (4), since it only specifies the conditional mean Embedded Image . Within the current context, Embedded Image follows the Bernoulli Embedded Image , in which case we have Embedded Image . We obtain the estimate Embedded Image of β by solving (5) for β . Unlike linear regression, Embedded Image cannot be evaluated in closed form but is readily computed numerically.3

The estimator Embedded Image has an asymptotically normal distribution with mean β and variance Embedded Image :

Embedded Image (6)

where Embedded Image denotes the inverse of B . We can estimate Embedded Image by the following sandwich variance estimator Embedded Image :

Embedded Image (7)

Note that unlike likelihood-based inference for parametric models, inference based on the estimating equations in (5) for semiparametric models is always valid, regardless of the distribution of Embedded Image . In particular, instead of Embedded Image , we may also set Embedded Image to any function of Embedded Image such as Embedded Image (by treating Embedded Image as a Poisson with mean Embedded Image ) for valid inference about β . This is why we can model a binary Embedded Image using a semiparametric log-linear model for a count response.

Longitudinal data

We now consider extending the semiparametric log-linear model above to longitudinal data.

Suppose that the subjects are assessed repeatedly over T time points Embedded Image . Let Embedded Image and Embedded Image denote the same response and explanatory variables as in the cross-sectional data setting, but with t indicating their dependence on the time of assessment (Embedded Image , Embedded Image ). By applying the semiparametric log-linear model in (4) to each assessment t , we obtain an extension of the semiparametric log-linear model for the association of longitudinal Embedded Image and Embedded Image :

Embedded Image (8)

Thus, we do not explicitly model correlations among the repeated Embedded Image ’s. Inference about β is based on extending the estimating equations in (5) to the correlated Embedded Image ’s.

Let

Embedded Image

The estimating equations, which are often called the generalised estimating equations (GEE) in the literature, for inference about β have the form:

Embedded Image (9)

where Embedded Image is the conditional variance of Embedded Image given Embedded Image . As in the cross-sectional case, we can readily evaluate Embedded Image and Embedded Image under (8) and set Embedded Image for each Embedded Image . However, the conditional covariance between Embedded Image given Embedded Image is quite complex. In almost all applications of GEE, we use a working correlation Embedded Image to approximate the true correlation Embedded Image , where Embedded Image is a Embedded Image correlation matrix with its entries defined by a parameter vector α . 3 Popular choices of Embedded Image are the working independence, with Embedded Image , and working exchangeable, with Embedded Image , model, where Embedded Image denotes the Embedded Image identity matrix and ρ is a parameter.

Under a specific Embedded Image , we have Embedded Image , where Embedded Image denotes a diagonal matrix with Embedded Image on its t th diagonal. As in the case of cross-sectional data, inference is always valid even if Embedded Image Embedded Image is not the true correlation (variance) of Embedded Image given Embedded Image . In (9), Embedded Image also depends on α , though we have suppressed this dependence to highlight the fact that (9) is the equation for estimating β . Thus, α must be estimated (except for the working independence model) to solve (9) for Embedded Image We can either assign a value to or estimate α together with β . For example, under Embedded Image , we may set ρ to any value between 0 and 1 or estimate ρ using correlated residuals Embedded Image , with Embedded Image . Inference about β is based on the asymptotic normal distribution of the GEE estimator Embedded Image , which has mean β and variance Embedded Image :

Embedded Image (10)

where Embedded Image denotes the transpose of B . We can estimate Embedded Image by the sandwich variance estimator Embedded Image , which is obtained by:

Embedded Image (11)

where Embedded Image , Embedded Image and Embedded Image denote substituting Embedded Image in place of β for the respective quantity Embedded Image , Embedded Image and Embedded Image .

Popular software packages all support semiparametric regression models for both cross-sectional and longitudinal data. For example, PROC GEE in SAS and geeglm() in the geepack package in R6 can be used to fit the semiparametric log-linear models in (4) for cross-sectional and (8) for longitudinal data.

Application

We illustrate our considerations with both real and simulated data. In all the examples, we set the statistical significance at Embedded Image . All analyses are carried out using the geeglm() function in the geepack package in R.6

Simulation study

We consider modelling regression associations of a single time-invariant binary explanatory variable Embedded Image with a binary response Embedded Image in a longitudinal study with three assessments. To simulate the correlated Embedded Image , we use a Gaussian copula with the marginal Embedded Image given Embedded Image following a Bernoulli7:

Embedded Image (12)

For our simulation, we set Embedded Image and Embedded Image and an exchangeable correlation Embedded Image in the trivariate normal with Embedded Image .

We fit the semiparametric (8) to the data simulated, that is,

Embedded Image (13)

using the GEE in (9) under the working independent correlation model. Shown in table 1 are the estimates of β along with their standard errors (SEs) (both asymptotic and empirical), over 1 000 Monte Carlo (MC) replications under a sample size Embedded Image . The estimates Embedded Image were quite close to their true values, and the asymptotic SEs were quite close to their empirical counterparts. Also, shown in table 1 are type I error rates from testing the null hypothesis Embedded Image and Embedded Image . We estimate the type I errors using MC iterations. Let Embedded Image denotes the Wald statistic at the m th MC replication, the type I error rate for testing Embedded Image is estimated by: Embedded Image , where Embedded Image is the 95th percentile of a Embedded Image distribution, a χ2 distribution with 1 df. As seen, the type I error rates were close the normal values Embedded Image .

Table 1

Parameter estimates, SEs (asymptotic and empirical) and type I errors from GEE model with 1 000 MC replications

Real study

Smoking is the chief avoidable cause of morbidity and mortality in the USA, exacting a substantive financial burden as well.8 Smoking rates among persons with serious mental illness are exceptionally high, contributing to significant medical morbidity and mortality in this population, with many unlikely to live beyond their 50th birthday. Persons with mental illness spend nearly one-third of their monthly public assistance income on cigarettes instead of buying needed food, clothing and shelter.9 A study was conducted to evaluate the effect of a multicomponent smoking cessation programme adapted to patients with serious psychiatric disorders within an outpatient psychiatric clinic at the University of Rochester Medical Center. This study, sponsored by the New York State Department of Health Tobacco Control Program, capitalises on packaging multiple evidence-based components to achieve a better outcome than when each practice is individually implemented in a number of clinical venues, for example, central line–associated bloodstream infections and ventilator-associated pneumonia.10 Among the 276 participating subjects, 99 also participated in a formal evaluation, in which interviews were conducted at the point of enrolment (baseline), prior to intervention and again at 3, 6 and 12 months.

For illustrative purposes, we model the binary abstinence outcome, defined as the 7-day point prevalence (ie, abstinent from smoking for 7 days in a row), from preintervention at baseline, Embedded Image , to each of the three postintervention assessments, Embedded Image , at 3, 6 and 12 months, using data from 99 subjects. We create three time-varying dummy variables Embedded Image , Embedded Image and Embedded Image to indicate intervention effects at Embedded Image :

Embedded Image

Let Embedded Image if the i th subject is abstinent for 7 days consecutively and Embedded Image otherwise. The semiparametric GEE for change of abstinence rates over time is given by:

Embedded Image (14)

We fit (8) to the 7-day point prevalence data using the GEE in (9) under the working independent correlation model.

Shown in table 2 are the estimates Embedded Image of Embedded Image and associated SEs, p values for testing the null Embedded Image and RRs (exponentiated Embedded Image ) at each assessment Embedded Image . The results show a RR greater than 1 for all three postintervention assessments, though only statistically significant at months 3 and 6. The intervention did have a significant effect on reducing smoking in this study sample, though the effect diminished 12 months after the intervention.

Table 2

Estimates of parameters, SEs, p values and relative risks over time from GEE model to the Smoking Cessation Study data

Discussion

We extended the popular approach for modelling RRs for binary responses to longitudinal data by leveraging the semiparametric GEE. Like the original approach in Zou,1 the parameters of the proposed log-linear model have the log of RR interpretation and, thus, with appropriately defined explanatory variables, can be used for inference about RRs when modelling longitudinal regression relationships with binary responses. We also illustrated the proposed approach using both real and simulated longitudinal data.

The proposed GEE-based approach provides valid inference under the missing completely at random (MCAR) mechanism.3 11 In many real studies, missing data follow the missing at random (MAR) mechanism,3 11 in which case the lowest patterns done by the proposed approach generally yield biased estimates of RR. We can readily extend the approach to provide valid inference under MAR by employing the weighted generalised estimating equations (WGEEs).11 Under WGEE, we also model the missingness of the binary response over time using GLMs for binary responses such as logistic regression and estimate its parameters and the parameters of the log-linear model in (8) together using a set of estimating equations that extend (9) to include the additional parameters.3

Supplemental material

References

Tuo Lin is a fifth-year PhD student in Biostatistics at the University of California, San Diego (UCSD) in the USA. He obtained his master’s degree in Statistics at UCSD in 2018. He is currently working as a graduate student researcher in the division of Biostatistics and Bioinformatics of Herbert Wertheim School of Public Health and Human Longevity Science at UCSD. He has also been working at Altman Clinical and Translational Research Institute (ACTRI) in the USA for many years, helping with study designs and data analyses. His main research interests include survey sampling and methods, causal inference and longitudinal data analysis in psychiatry studies.


Embedded Image

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors All authors participated in the discussion of the statistical issues and worked together to develop this paper. HZ and XMT suggested the topic, and TL, RZ, ST and HW reviewed the literature. All authors discussed the conceptual and analytical issues with modelling RRs for longitudinal data using the parametric and semiparametric models. RZ, ST and HW developed the simulation settings, algorithms and associated R codes and performed the simulation study under the direction of TL. TL, HZ and XMT drafted the manuscript, while TL, RZ, ST and HW provided all the technical details and derivations, along with completing the application section. All authors worked together to finalise the manuscript.

  • Funding The project described was partially supported by the National Institutes of Health (grant UL1TR001442) of Georgia Clinical and Translational Science Alliance funding.

  • Competing interests None declared.

  • Provenance and peer review Commissioned; externally peer-reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.