Article Text

## Abstract

Although logistic regression is the most popular for modelling regression relationships with binary responses, many find relative risk (RR), or risk ratio, easier to interpret and prefer to use this measure of risk in regression analysis. Indeed, since Zou published his modified Poisson regression approach for modelling RR for cross-sectional data, his paper has been cited over 7 000 times, demonstrating the popularity of this alternative measure of risk in regression analysis involving binary responses. As longitudinal studies have become increasingly popular in clinical trials and observational studies, it is imperative to extend Zou’s approach for longitudinal data.

The two most popular approaches for longitudinal data analysis are the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE). However, the parametric GLMM cannot be used for the extension within the current context, because Zou’s approach treats the binary response as a Poisson variable, which is at odds with the Bernoulli distribution for the binary response. On the other hand, as it imposes no mathematical model on data distributions, the semiparametric GEE is coherent with Zou’s modified Poisson regression. In this paper, we develop a GEE-based longitudinal model for binary responses to provide inference about RR.

- OR
- generalized estimating equations
- relative risk
- sandwich variance estimator
- semiparametric generalized linear models

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

## Statistics from Altmetric.com

- OR
- generalized estimating equations
- relative risk
- sandwich variance estimator
- semiparametric generalized linear models

## Introduction

Logistic regression is widely used to model binary responses. However, many find relative risk (RR), or risk ratio easier to interpret and prefer to model regression relationships with inference about RR, rather than odds ratio (OR) as in logistic regression. Indeed, since Zou1 published his modified Poisson regression approach for inference about RR, his paper has been cited 7 128 times, demonstrating the popularity of using RR in modelling binary responses. However, his approach isn’t applied to longitudinal data. Moreover, there is no one-to-one relationship between RR and OR for regression analysis.2 As longitudinal studies have become increasingly the standard in clinical trials and observational studies, it is imperative to develop statistical models for longitudinal binary responses with inference based on RR to fill the critical gap.

The two most popular paradigms to extend models for cross-sectional data to longitudinal data are the generalised linear mixed-effects model (GLMM) and generalised estimating equations (GEE). The parametric GLMM explicitly models the within-subject correlation using random effects, while the semiparametric, or distribution-free GEE implicitly accounts for such correlations using sandwich variance estimates.3 Since Zou’s approach treats binary responses as count variables and derives estimators of RR under the Poisson distribution, GLMM cannot be used to extend his approach to longitudinal data within the current context. As his approach is essentially a semiparametric log-linear model, a simplified version of GEE for cross-sectional data, GEE provides a coherent paradigm to develop to extend his approach to longitudinal data.

In the Models for Relative Risks for Longitudinal Binary Responses section, we first review semiparametric regression models for cross-sectional and longitudinal binary responses under the logit and log link for inference about the respective log of OR and log of RR. We then discuss a GEE-based approach for longitudinal binary responses for inference about RR by leveraging semiparametric log-linear models. In the Application section, we use real and simulated data to illustrate the proposed approach. In the Discussion section, we give our concluding remarks.

## Models for relative risks for longitudinal binary responses

We start with a brief review of Zou’s approach for inference about RR when modelling binary responses in cross-sectional data.

### Cross-sectional data

Consider a study with
n
subjects indexed by
Let
denote a binary response of interest and let
with
denote a
vector of explanatory, or independent, variables from the
i
th subject
. The popular logistic regression model is defined by a generalised linear model (GLM) with the logit link as Tang *et al*
3:

(1)

where
denotes independently distributed, Bernoulli
denotes the *Bernoulli* distribution with mean
, logit denotes the logit link function and
γ
is the vector of model parameters or coefficients. Under logistic regression, each regression coefficient
has the log OR interpretation per unit change in
for
3 Inference about
γ
is generalised based on maximum likelihood.3

For to have the RR interpretation, we need to change the logit link to the log link function to express (1) as:

(2)

For differentiating log OR from log RR, we use a different symbol β in (2) to denote the model coefficients. Under (2), each coefficient has the log RR interpretation. For example, consider one unit increase in from to . Denote the change in the mean of in response to the change in by:

Then, it follows from (2) that the log of RR, , for the unit change in from to is:

The two GLMs in (1) and (2) are quite similar except for the different link functions. Under logit link in (1), the conditional mean is constrained between 0 and 1, while under the log link in (2), is confined only to positive values. Since may exceed 1, the upper bound for a probability quantity, estimates based on maximising the Bernoulli likelihood may not converge under the log link.4 5 To alleviate this problem, we may switch the Bernoulli distribution in (2) to the Poisson, that is,

(3)

Since the logic restriction of positive values on is consistent with the mean of the Poisson, fitting the model (3) to observed data will not be an issue. For rare diseases, will be close to 0 and may be viewed as a count, frequency, or response with mean , in which case the Poisson-based (3) is a reasonable approximation. In general, with increased , (3) may not provide valid inference, since the binary will not have a Poisson distribution in this case. Zou discussed the use of the sandwich variance estimator as an alternative to estimate the variance of the estimator of β . This approach is essentially a semiparametric regression, or restricted moment model, in which only the model for the conditional mean of given in (3) is assumed:

(4)

Thus, unlike (3), the semiparametric log-linear model above does not assume Poisson or any other parametric distribution for . Different from a parametric model, a semiparametric model leverages estimating equations to play the role of the likelihood to provide inference.3 Unlike maximum likelihood estimation, inference based on estimating equations is consistent regardless of the distribution of , so long as the assumed conditional mean in (4) is correct.3 Thus, even if does not have a Poisson distribution, inference about β in (4) is still correct when based on the estimating equations.

Within the current context, the estimating equations for inference about β have the form:

(5)

where is the conditional variance of given . Under (4), and are readily evaluated. However, is not determined by the semiparametric log-linear model in (4), since it only specifies the conditional mean . Within the current context, follows the Bernoulli , in which case we have . We obtain the estimate of β by solving (5) for β . Unlike linear regression, cannot be evaluated in closed form but is readily computed numerically.3

The estimator has an asymptotically normal distribution with mean β and variance :

(6)

where denotes the inverse of B . We can estimate by the following sandwich variance estimator :

(7)

Note that unlike likelihood-based inference for parametric models, inference based on the estimating equations in (5) for semiparametric models is always valid, regardless of the distribution of . In particular, instead of , we may also set to any function of such as (by treating as a Poisson with mean ) for valid inference about β . This is why we can model a binary using a semiparametric log-linear model for a count response.

### Longitudinal data

We now consider extending the semiparametric log-linear model above to longitudinal data.

Suppose that the subjects are assessed repeatedly over T time points . Let and denote the same response and explanatory variables as in the cross-sectional data setting, but with t indicating their dependence on the time of assessment ( , ). By applying the semiparametric log-linear model in (4) to each assessment t , we obtain an extension of the semiparametric log-linear model for the association of longitudinal and :

(8)

Thus, we do not explicitly model correlations among the repeated ’s. Inference about β is based on extending the estimating equations in (5) to the correlated ’s.

Let

The estimating equations, which are often called the generalised estimating equations (GEE) in the literature, for inference about β have the form:

(9)

where
is the conditional variance of
given
. As in the cross-sectional case, we can readily evaluate
and
under (8) and set
for each
. However, the conditional covariance between
given
is quite complex. In almost all applications of GEE, we use a working correlation
to approximate the true correlation
, where
is a
correlation matrix with its entries defined by a parameter vector
α
*.*
3 Popular choices of
are the working independence, with
, and working exchangeable, with
, model, where
denotes the
identity matrix and
ρ
is a parameter.

Under a specific , we have , where denotes a diagonal matrix with on its t th diagonal. As in the case of cross-sectional data, inference is always valid even if is not the true correlation (variance) of given . In (9), also depends on α , though we have suppressed this dependence to highlight the fact that (9) is the equation for estimating β . Thus, α must be estimated (except for the working independence model) to solve (9) for We can either assign a value to or estimate α together with β . For example, under , we may set ρ to any value between 0 and 1 or estimate ρ using correlated residuals , with . Inference about β is based on the asymptotic normal distribution of the GEE estimator , which has mean β and variance :

(10)

where denotes the transpose of B . We can estimate by the sandwich variance estimator , which is obtained by:

(11)

where , and denote substituting in place of β for the respective quantity , and .

Popular software packages all support semiparametric regression models for both cross-sectional and longitudinal data. For example, PROC GEE in SAS and geeglm() in the geepack package in R6 can be used to fit the semiparametric log-linear models in (4) for cross-sectional and (8) for longitudinal data.

## Application

We illustrate our considerations with both real and simulated data. In all the examples, we set the statistical significance at . All analyses are carried out using the geeglm() function in the geepack package in R.6

### Simulation study

We consider modelling regression associations of a single time-invariant binary explanatory variable with a binary response in a longitudinal study with three assessments. To simulate the correlated , we use a Gaussian copula with the marginal given following a Bernoulli7:

(12)

For our simulation, we set and and an exchangeable correlation in the trivariate normal with .

We fit the semiparametric (8) to the data simulated, that is,

(13)

using the GEE in (9) under the working independent correlation model. Shown in table 1 are the estimates of
β
along with their standard errors (SEs) (both asymptotic and empirical), over 1 000 Monte Carlo (MC) replications under a sample size
. The estimates
were quite close to their true values, and the asymptotic SEs were quite close to their empirical counterparts. Also, shown in table 1 are type I error rates from testing the null hypothesis
and
. We estimate the type I errors using MC iterations. Let
denotes the Wald statistic at the
m
th MC replication, the type I error rate for testing
is estimated by:
, where
is the 95th percentile of a
distribution, a χ^{2} distribution with 1 df. As seen, the type I error rates were close the normal values
.

### Real study

Smoking is the chief avoidable cause of morbidity and mortality in the USA, exacting a substantive financial burden as well.8 Smoking rates among persons with serious mental illness are exceptionally high, contributing to significant medical morbidity and mortality in this population, with many unlikely to live beyond their 50th birthday. Persons with mental illness spend nearly one-third of their monthly public assistance income on cigarettes instead of buying needed food, clothing and shelter.9 A study was conducted to evaluate the effect of a multicomponent smoking cessation programme adapted to patients with serious psychiatric disorders within an outpatient psychiatric clinic at the University of Rochester Medical Center. This study, sponsored by the New York State Department of Health Tobacco Control Program, capitalises on packaging multiple evidence-based components to achieve a better outcome than when each practice is individually implemented in a number of clinical venues, for example, central line–associated bloodstream infections and ventilator-associated pneumonia.10 Among the 276 participating subjects, 99 also participated in a formal evaluation, in which interviews were conducted at the point of enrolment (baseline), prior to intervention and again at 3, 6 and 12 months.

For illustrative purposes, we model the binary abstinence outcome, defined as the 7-day point prevalence (ie, abstinent from smoking for 7 days in a row), from preintervention at baseline, , to each of the three postintervention assessments, , at 3, 6 and 12 months, using data from 99 subjects. We create three time-varying dummy variables , and to indicate intervention effects at :

Let if the i th subject is abstinent for 7 days consecutively and otherwise. The semiparametric GEE for change of abstinence rates over time is given by:

(14)

We fit (8) to the 7-day point prevalence data using the GEE in (9) under the working independent correlation model.

Shown in table 2 are the estimates of and associated SEs, p values for testing the null and RRs (exponentiated ) at each assessment . The results show a RR greater than 1 for all three postintervention assessments, though only statistically significant at months 3 and 6. The intervention did have a significant effect on reducing smoking in this study sample, though the effect diminished 12 months after the intervention.

## Discussion

We extended the popular approach for modelling RRs for binary responses to longitudinal data by leveraging the semiparametric GEE. Like the original approach in Zou,1 the parameters of the proposed log-linear model have the log of RR interpretation and, thus, with appropriately defined explanatory variables, can be used for inference about RRs when modelling longitudinal regression relationships with binary responses. We also illustrated the proposed approach using both real and simulated longitudinal data.

The proposed GEE-based approach provides valid inference under the missing completely at random (MCAR) mechanism.3 11 In many real studies, missing data follow the missing at random (MAR) mechanism,3 11 in which case the lowest patterns done by the proposed approach generally yield biased estimates of RR. We can readily extend the approach to provide valid inference under MAR by employing the weighted generalised estimating equations (WGEEs).11 Under WGEE, we also model the missingness of the binary response over time using GLMs for binary responses such as logistic regression and estimate its parameters and the parameters of the log-linear model in (8) together using a set of estimating equations that extend (9) to include the additional parameters.3

### Supplemental material

## References

Tuo Lin is a fifth-year PhD student in Biostatistics at the University of California, San Diego (UCSD) in the USA. He obtained his master’s degree in Statistics at UCSD in 2018. He is currently working as a graduate student researcher in the division of Biostatistics and Bioinformatics of Herbert Wertheim School of Public Health and Human Longevity Science at UCSD. He has also been working at Altman Clinical and Translational Research Institute (ACTRI) in the USA for many years, helping with study designs and data analyses. His main research interests include survey sampling and methods, causal inference and longitudinal data analysis in psychiatry studies.

## Supplementary materials

## Supplementary Data

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

## Footnotes

Contributors All authors participated in the discussion of the statistical issues and worked together to develop this paper. HZ and XMT suggested the topic, and TL, RZ, ST and HW reviewed the literature. All authors discussed the conceptual and analytical issues with modelling RRs for longitudinal data using the parametric and semiparametric models. RZ, ST and HW developed the simulation settings, algorithms and associated R codes and performed the simulation study under the direction of TL. TL, HZ and XMT drafted the manuscript, while TL, RZ, ST and HW provided all the technical details and derivations, along with completing the application section. All authors worked together to finalise the manuscript.

Funding The project described was partially supported by the National Institutes of Health (grant UL1TR001442) of Georgia Clinical and Translational Science Alliance funding.

Competing interests None declared.

Provenance and peer review Commissioned; externally peer-reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.