We start with a brief review of Zou’s approach for inference about RR when modelling binary responses in cross-sectional data.
Cross-sectional data
Consider a study with
n
subjects indexed by Let denote a binary response of interest and let with denote a vector of explanatory, or independent, variables from the
i
th subject . The popular logistic regression model is defined by a generalised linear model (GLM) with the logit link as Tang et al
3:
where denotes independently distributed, Bernoulli denotes the Bernoulli distribution with mean , logit denotes the logit link function and
γ
is the vector of model parameters or coefficients. Under logistic regression, each regression coefficient has the log OR interpretation per unit change in for
3 Inference about
γ
is generalised based on maximum likelihood.3
For to have the RR interpretation, we need to change the logit link to the log link function to express (1) as:
For differentiating log OR from log RR, we use a different symbol
β
in (2) to denote the model coefficients. Under (2), each coefficient has the log RR interpretation. For example, consider one unit increase in from to . Denote the change in the mean of in response to the change in by:
Then, it follows from (2) that the log of RR, , for the unit change in from to is:
The two GLMs in (1) and (2) are quite similar except for the different link functions. Under logit link in (1), the conditional mean is constrained between 0 and 1, while under the log link in (2), is confined only to positive values. Since may exceed 1, the upper bound for a probability quantity, estimates based on maximising the Bernoulli likelihood may not converge under the log link.4 5 To alleviate this problem, we may switch the Bernoulli distribution in (2) to the Poisson, that is,
Since the logic restriction of positive values on is consistent with the mean of the Poisson, fitting the model (3) to observed data will not be an issue. For rare diseases, will be close to
0
and may be viewed as a count, frequency, or response with mean , in which case the Poisson-based (3) is a reasonable approximation. In general, with increased , (3) may not provide valid inference, since the binary will not have a Poisson distribution in this case. Zou discussed the use of the sandwich variance estimator as an alternative to estimate the variance of the estimator of
β
. This approach is essentially a semiparametric regression, or restricted moment model, in which only the model for the conditional mean of given in (3) is assumed:
Thus, unlike (3), the semiparametric log-linear model above does not assume Poisson or any other parametric distribution for . Different from a parametric model, a semiparametric model leverages estimating equations to play the role of the likelihood to provide inference.3 Unlike maximum likelihood estimation, inference based on estimating equations is consistent regardless of the distribution of , so long as the assumed conditional mean in (4) is correct.3 Thus, even if does not have a Poisson distribution, inference about
β
in (4) is still correct when based on the estimating equations.
Within the current context, the estimating equations for inference about
β
have the form:
where is the conditional variance of given . Under (4), and are readily evaluated. However, is not determined by the semiparametric log-linear model in (4), since it only specifies the conditional mean . Within the current context, follows the Bernoulli , in which case we have . We obtain the estimate of
β
by solving (5) for
β
. Unlike linear regression, cannot be evaluated in closed form but is readily computed numerically.3
The estimator has an asymptotically normal distribution with mean
β
and variance :
where denotes the inverse of
B
. We can estimate by the following sandwich variance estimator :
Note that unlike likelihood-based inference for parametric models, inference based on the estimating equations in (5) for semiparametric models is always valid, regardless of the distribution of . In particular, instead of , we may also set to any function of such as (by treating as a Poisson with mean ) for valid inference about
β
. This is why we can model a binary using a semiparametric log-linear model for a count response.
Longitudinal data
We now consider extending the semiparametric log-linear model above to longitudinal data.
Suppose that the subjects are assessed repeatedly over
T
time points . Let and denote the same response and explanatory variables as in the cross-sectional data setting, but with
t
indicating their dependence on the time of assessment ( , ). By applying the semiparametric log-linear model in (4) to each assessment
t
, we obtain an extension of the semiparametric log-linear model for the association of longitudinal and :
Thus, we do not explicitly model correlations among the repeated ’s. Inference about
β
is based on extending the estimating equations in (5) to the correlated ’s.
Let
The estimating equations, which are often called the generalised estimating equations (GEE) in the literature, for inference about
β
have the form:
where is the conditional variance of given . As in the cross-sectional case, we can readily evaluate and under (8) and set for each . However, the conditional covariance between given is quite complex. In almost all applications of GEE, we use a working correlation to approximate the true correlation , where is a correlation matrix with its entries defined by a parameter vector
α
.
3 Popular choices of are the working independence, with , and working exchangeable, with , model, where denotes the identity matrix and
ρ
is a parameter.
Under a specific , we have , where denotes a diagonal matrix with on its
t
th diagonal. As in the case of cross-sectional data, inference is always valid even if
is not the true correlation (variance) of given . In (9), also depends on
α
, though we have suppressed this dependence to highlight the fact that (9) is the equation for estimating
β
. Thus,
α
must be estimated (except for the working independence model) to solve (9) for We can either assign a value to or estimate
α
together with
β
. For example, under , we may set
ρ
to any value between 0 and 1 or estimate
ρ
using correlated residuals , with . Inference about
β
is based on the asymptotic normal distribution of the GEE estimator , which has mean
β
and variance :
where denotes the transpose of
B
. We can estimate by the sandwich variance estimator , which is obtained by:
where , and denote substituting in place of
β
for the respective quantity , and .
Popular software packages all support semiparametric regression models for both cross-sectional and longitudinal data. For example, PROC GEE in SAS and geeglm() in the geepack package in R6 can be used to fit the semiparametric log-linear models in (4) for cross-sectional and (8) for longitudinal data.