Biostatistical methods in psychiatry

Guidance for use of weights: an analysis of different types of weights and their implications when using SAS PROCs

Abstract

SAS and other popular statistical packages provide support for survey data with sampling weights. For example, PROC MEANS and PROC LOGISTIC in SAS have their counterparts PROC SURVEYMEANS and PROC SURVEYLOGISTIC to facilitate analysis of data from complex survey studies. On the other hand, PROC MEANS and many other classic SAS procedures also provide an option for including weights and yield identical point estimates, but different standard errors (SEs), as their corresponding survey procedures. This paper takes an in-depth look at different types of weights and provides guidance on use of different SAS procedures.

Introduction

All popular SAS procedures support use of weights, such as the classic PROC MEANS, PROC GLM and PROC LOGISTIC. To facilitate analysis of survey study data, SAS also provides an array of procedures with a ‘SURVEY’ prefix, such as the PROC SURVEYMEANS, PROC SURVEYREG and PROC SURVEYLOGISTIC. As the latter are designed and developed specifically for survey study data, weights reflecting sampling and non-response are an integral part of these procedures. A common question is which SAS PROC to use when analysing survey study data involving sampling weights. Although they produce identical point estimates, the classic SAS procedures and their SURVEY counterparts generally yield different SEs, which in some cases lead to quite different p-values and conclusions. In this document, we discuss conceptual differences underlying the different types of weights and their implications in statistical methods developed touse the different SAS procedures. Although we focus on SAS in this paper, the discussions also apply to other statistical packages such as R and SPSS. For ease of exposition, we focus on the PROC MEANS and PROC SURVEYMEANS and illustrate our considerations with real and Monte Carlo simulated data. These same considerations will also apply to other SAS procedures for multiple variable analyses such as PROC LOGISTIC versus PROC SURVEYLOGISTIC; however, here we will focus on univariate analyses.

Types of Weights and Methods Underlying SAS PROC MEANS and PROC SURVEYMEANS

Within these two procedures, the weights have different uses and meaning. Weights used in PROC MEANS are designed to address violations of ‘homoscedasticity’, a key assumption underlying many statistical methods such as inference for population means with the current context and, more generally, regression analysis. In linear regression models, ‘the best linear unbiased estimate’ (BLUE) is the most popular estimate. The BLUE estimate has the smallest variance among all competing estimates that are a linear combination of observations. However, if the assumption of homoscedasticity is not met, BLUE will generally be biased. In some cases, weights, a series of known constants for each of the observations, can be used to address such violations, or heteroscedasticity.

Note that regularised estimates, such as the popular ‘least absolute shrinkage and selection operator’ (LASSO), have become increasingly popular in recent years due to the surge of high-dimensional data arising in biomedical and online social media research. Although these estimates are generally biased, the bias is typically small. Moreover, when the number of independent variables exceeds the sample size in regression models, it is no longer possible to obtain unbiased estimates. Thus, the class of BLUE estimates becomes irrelevant in such applications.

Within the context of inference for the population mean, the sample mean is a BLUE estimate, if all observations have the same population mean and variance in addition to being independent, representing the so-called ‘independently and identically distributed’ (i.i.d.) sample. If the observations do not have the same variance, that is, a violation of the homoscedasticity assumption, the sample mean is no longer a BLUE estimate, even though it is still an unbiased estimate of the population mean. As the usual sample variance formula and t-statistic are both based on the i.i.d. assumption, the sample variance no longer describes the variability of the sample mean and t-statistic cannot be used to provide valid inference about the population mean. For example, a 95% CI based on the sample variance and t-statistic no longer covers the population mean 95% of the time. In most cases, causes of such heteroscedasticity are unknown and other statistical methods must be used to provide valid inference about the population mean, such as by using estimating equations (see below for details and examples). In some studies, heteroscedasticity is due to aggregating data, in which case weights can be used to ‘correct’ the type of heteroscedasticity so that the usual sample variance and t-statistic can continue to provide valid inference.

Example 1

Consider taking a random sample of 100 subjects from a population of interest and let  Inline Formula  denote an outcome of interest from the i th subject. Such a sampling scheme is often called ‘simple random sampling’. Let  Inline Formula  denote the population mean and  Inline Formula  denote the population variance, where  Inline Formula  denotes the mathematical expectation of  Inline Formula  and  Inline Formula . We are interested in estimating the population mean μ .

Under simple random sampling, the 100 observations  Inline Formula  form an i.i.d. sample, from which the sample mean  Inline Formula , sample variance  Inline Formula , variance  Inline Formula  (SE  Inline Formula ) of the sample mean and t-statistic can be computed:

­

 Inline Formula (1)

­

The above statistics can be used to calculate CIs or test hypotheses of interest. For example, to test if the population mean is 0, we specify the hypothesis as:

Display Formula

We can readily use the value of the t-statistic  Inline Formula  to compute the p-value or construct a 95% CI.

Now suppose that we average the first 20 and last 10 observations and replace the first 20 and last 10 observations by their averaged counterparts:

Display Formula

where  Inline Formula  and  Inline Formula  denote the averaged first 20 and last 10 observations. We can recalculate all the statistics in Equation (1) using the two averaged outcomes plus the remaining 70 observations:

 Inline Formula (2)

Although the sample mean  Inline Formula  in Equation (2) is still an unbiased estimate of μ, the other statistics no longer have the same interpretations as their counterparts in Equation (1);  Inline Formula  is no longer an estimate of  Inline Formula ,  Inline Formula  (or  Inline Formula ) is no longer an estimate of the variability of the sample mean  Inline Formula , and  Inline Formula  no longer follows the t-distribution. This is because although  Inline Formula  and  Inline Formula  still have the same population mean, they no longer have the same variance as the remaining 70 observations:

Display Formula

Thus, the reduced 72 observations do not meet the homoscedasticity assumption and all the statistics, except for the sample mean μ, do not have the same meaningful interpretations as their counterparts in Equation (1).

Several methods are available to address heteroscedasticity and its impact on variance estimation and associated p-values. For example, bootstrap and Jackknife resampling methods are commonly implemented in various statistical packages and can be used to provide valid inference in this case. A modern alternative is estimating equations (EE), which do not involve resampling of the observations and provide a more efficient approach. In this example, where the source of heteroscedasticity is known to be caused by averaging some of the observations, weights can be used to ‘correct’ this special type of heteroscedasticity.

As seen in Equation (3), the variance of the first averaged 20 observation  Inline Formula  differs from the other 70 observations by a factor of  Inline Formula  and the variance of the last averaged 10 observation  Inline Formula  differs from the other 70 observations by a factor of  Inline Formula . By taking the inverse of these numbers as weights for the two respective averaged observations,

Display Formula

and applying such weights to the sample mean and variance in Equation (2), we obtain a weighted sample mean and sample variance, along with the variance (SE) of the weighted sample mean and t-statistic:

Display Formula

By comparing Equation (2) and (4), we see that each averaged observation receives more weight than the original observation and the weighted is equal to the number of subjects within the averaged outcome. Also, the mean variance estimates are defined by the sum of the weights,  Inline Formula , which is the same as the original sample size. Thus, with the weights, the 72 observations in Equation (4) carry the same ‘weight’ as the original 100 observations. For example, in the sample mean (variance),  Inline Formula  is weighted 20 times more than each original observation, allowing it to have the same effect as the first 20 observations on the estimated mean (variance). The t-statistic  Inline Formula  in Equation (4) follows the same t-distribution for inference about the population mean.

In Example 1, the heteroscedasticity has a particular form:

Display Formula

The approach to use weights to correct for heteroscedasticity and construct weighted estimates not only works for inference about the population mean in this example but also for more complex regression models. For example, weighted ordinary least squares (WOLS) uses the same approach to address this type of heteroscedasticity for linear regression. Since the weighted approach here and WOLS in regression setting yield the same point and variance estimate as the maximum likelihood (ML) method (when the outcome is assumed to follow a normal distribution), we will refer to the weighted approach here as the WOLS/ML method throughout the rest of the discussion.

In SAS, the second procedure under consideration, PROC SURVEYMEANS, addresses conceptually different issues arising from complex survey sampling. For the simple random sample as in Example 1, the usual sample mean, sample variance, variance (SE) of the estimate and t-statistic provide valid inference about the population mean. In this case, PROC MEANS and PROC SURVEYMEANS yield identical point estimates and SEs. In most survey studies, more complex sampling strategies are used to more efficiently obtain more reliable estimates. Stratified random sampling is a popular alternative to simple random sampling when sampling heterogeneous populations. Although still yielding the same point estimate, the two SAS PROCs will generally produce different SEs with this type of sampling. We illustrate such difference using data from the Ice Cream Example in the SAS SURVEYMEANS Procedure document, SAS/STAT V.9.2.1 ,2

Example 2

In the Ice Cream study, researchers are interested in how much students in a junior high school spend weekly on ice cream. The junior high school has a total of 4000 students distributed in grades 7, 8, and 9 as follows:

Display Formula

where the three different grades represent three strata indexed by h;,  Inline Formula  denotes the number of students in the hth stratum and N denotes the population size. To address this question, 40 students are selected from the study population using a stratified random sampling; a random sample of 20, 9 and 11 students is taken from the three strata:

Display Formula

The distribution of the three grades in the sample is

Display Formula

If the 40 students were sampled under simple random sampling, the distribution of the grades would be:

Display Formula

By comparing Equation (5) and (6), we see that grade 7 is over-represented while the other two grades are under-represented in the study sample. Thus, the usual sample mean of the whole sample will be biased towards Grade 7. To obtain an estimate of mean weekly spending on ice cream for this junior high school, we must use sampling weights to reduce the over-representation of Grade seven and increase the under-representation of other two grades.

To correctly include a sampling weight, it must be the inverse of the sampling probability that a subject is selected from the population. For simple random sampling, this probability is approximated by the sampling fraction,  Inline Formula , which is constant. Similarly, the sampling weight for each randomly sampled subject is  Inline Formula , is also constant, regardless of grade. Under stratified sampling, the sampling fraction is no longer constant and both this fraction and the sampling weight depend on the strata:

Display Formula

The sampling weight  Inline Formula  in this case counterbalances the effect of oversampling or undersampling of a stratum, allowing for unbiased estimation of the population mean.

Let  Inline Formula  denote the spending on ice cream by a student in stratum i in the sample. We can apply the weighted mean in Equation (4) to estimate the mean spending on ice cream by the students in the junior high school μ (the formula looks slightly different because of the added notation for stratification):

Display Formula

However, the variance  Inline Formula , or SE  Inline Formula , calculated according to Equation (4), no longer estimates the variability of the estimate  Inline Formula  of the mean. Unlike the weights in Example 1, sampling weights in this example are used to address different sampling probabilities between the strata. Even under homoscedasticity, that is, a common variance of  Inline Formula  across all three strata, we still need to use the weighted mean in Equation (8) to estimate the correct population mean.

Note that if a sample of n is taken from a population of size N , the probability for the first subject sampled is  Inline Formula , the probability for the second sampled is  Inline Formula  and so on. So, the sampling probabilities for the sampled subjects are not constant and given by:

Display Formula

In most survey studies, n is much smaller than N, so  Inline Formula  is well approximated by N. Thus, for all practical purposes, the sampling probability for a random sample of n subjects is the sampling fraction,  Inline Formula . Also, unlike the weights applied in Example 1, sampling weights in Example 2 sum to the total population size. However, a sampling weight is a unit-free quantity and can be multiplied by any number without affecting the statistics. For example, by dividing the weights in Equation (8) by N=4000, they sum to one and the estimate:

Display Formula

which is the same as the one in Equation (8).

Shown in table 1 are the weighted means and SEs from the PROC MEANS and PROC SURVEYMEANS for the Ice Cream data. The weighted means are the same, but the SEs are quite different from the two PROCs.

Table 1
|
Comparison of PROC MEANS and PROC SURVEYMEANS for Ice Cream data in Example 1

The difference in SEs between the two SAS PROCs above is the result of a different statistical approach used to compute the SE in PROC SURVEYMEANS. Unlike the WOLS/ML variance estimate in the PROC MEANS which is specific to weights selected to address heteroscedasticity, the variance estimate from PROC SURVEYMEANS is derived by estimating equations, or Taylor series expansion, which is valid for any types of weights, including weights computed in PROC MEANS, sampling weights for survey studies, non-response weights and combinations of such weights.

For the stratified sampling in Example 2, the estimating equation variance (SE) of the weighted mean is given by:

Display Formula

If applying the WOLS (ML) variance estimate in Equation (3), the variance of the weighted mean is:

Display Formula

The different variance estimates in Equations (9) and (10) can yield quite different SEs as shown by the Ice Cream data in Example 2.

Thus, when analysing survey data involving sampling weights, PROC SURVEYMEANS must be used to provide valid variance (SE) estimates. As noted above, the estimating equation approach also provides valid inference for all other types of weights. The next example shows that this variance estimate also applies when weights are used to address heteroscedasticity.

Implications of using the wrong PROC with a weight versus a sampling weight: a Monte Carlo simulation

In the case of homoscedasticity weights, the choice of PROC MEANS or PROC SURVEYMEANS and associated variance estimates is facilitated as either will give valid and consistent results. We must underscore that this is true only when the weight used is in fact a homoscedasticity (heteroskedasticiy-correction) weight and not sampling weight.

Example 3

In this example, we perform Monte Carlo simulations to show that the estimating equation variance estimate is also valid when used to deal with heteroscedasticity in the data.

To simulate a sample with heteroscedasticity, we consider a normal distribution consists of five subpopulations: all with the same (population) mean μ=1, but different variances:

Display Formula

When sampled from this five-component mixture,  Inline Formula , the variance of the observation varies depending on the subpopulation sampled:

Display Formula

By using the weights,  Inline Formula , we can use the weighted mean, WOLS/ML variance (SE) of the weighted mean and t-statistic in Equation (2) and (4) used in PROC MEANS to make inference about the mean μ. Alternatively, we can also apply the estimating equation (EE) variance estimate in Equation (9) to compute variance (SE) of the weighted mean using PROC SURVEYMEANS. Although the two variance estimates look quite different, they are both consistent estimates of the variance of the weighted mean  Inline Formula .

To demonstrate this using Monte Carlo simulations, we perform the following steps:

  1. Simulate a sample of 25 000  Inline Formula ’s from Equation (11), with 5000  Inline Formula ’s from each subpopulation;

  2. Compute the estimate  Inline Formula  and two variance estimates of  Inline Formula  according to Equation (4) and (9):

Display Formula

Display Formula

Display Formula

(c) Repeat Step (a) and (b) M=2000 times;

(d) For each Monte Carlo iteration, let  Inline Formula  denote the estimate and  Inline Formula  and  Inline Formula  denote the two variance estimates;

(e) Compute the Monte Carlo mean  Inline Formula  and empirical variance  Inline Formula  of the estimate  Inline Formula :

Display Formula

(f) Compute the two variance estimates averaged over the 2000 Monte Carlo simulations:

Display Formula

­

The Monte Carlo mean and variance above provide a benchmark to assess and compare performances of estimates. The Monte Carlo sample variance is also known as the empirical variance of the estimate, since it measures the variability of the estimate and is a consistent estimate of the variance of the asymptotic distribution of the estimate.

If the method for estimating the mean is correct, the Monte Carlo mean  Inline Formula  should be close to the population mean μ=1. Likewise, if a variance estimate is consistent, its Monte Carlo average in Equation (13) will be close to the empirical version  Inline Formula  and vice versa.

Shown in table 2 are the weighted mean and SEs from the WOLS/ML, EE and empirical variance estimates. The WOLS/ML and EE SEs are virtually identical (difference is 2.67×10 8  Inline Formula ) and both are extremely close to the empirical SE.

Table 2
|
Comparison of PROC MEANS and PROC SURVEYMEANS for simulation study

Therefore, for non-sampling weights such as weights selected to address heteroscedasticity, the EE variance estimate still describes the sampling variability of the estimate, as illustrated by the simulation study above. The EE variance estimate is more general, as it also provides valid inference for more complex weights such as those used for sampling and non-response bias, while WOLS/ML based variance formulas cannot be applied to all types of weights.

We would like to point out that the EE can also be used to address heteroscedasticity when a correction weight is not available. In many studies, the cause of heteroscedasticity is unknown and weights cannot be computed. In this case, the WOLS/ML approach no longer applies. But, even without a known heteroscedasticity weight, the EE still provides valid variance estimates. For example, when applied to the simulated data in Example 3 without weights, the estimated population mean is 0.9996, which is quite close to 1. The Monte Carlo average of the EE SE, 0.0002, is also quite close to the empirical error, 0.000185. Both the EE and empirical SEs in this case are a bit larger than their weighted counterparts, which is consistent with the property that the weighted mean is the BLUE, that is, the estimate with the smallest standard error among all estimates that are a linear combination of the observations.

In the next Example, we use simulated data to show thatwhen using survey weights,  the WOLS/ML variance estimate can have severe bias and the EE variance estimate must be used to provide valid inference about the population mean.

Example 4

We use the Ice Cream Example as the setting to simulate the outcome (spending) for each student. First, we compute sample means and sample variances for each of the three strata (grade). Next, we construct the population distribution as a three-strata normal mixture using these sample means as the population means for the three strata and an averaged sample variance as the common population variance for all the strata. We then estimate the population mean using a weighted mean and compare the two variance estimates.

Specifically, let h denote the index strata and  Inline Formula  denote spending of ith student sampled from the hth stratum. Let  Inline Formula  denote the population mean of the hth stratum:

Display Formula

where 5, 15.4 and 10.1 are the sample means of the corresponding strata in the Ice Cream Example. Let  Inline Formula  denote the common variance of  Inline Formula  across all strata (average of three strata variances). Let  Inline Formula  denote the population size and  Inline Formula  denote the sample size of the  Inline Formula  stratum. We assume that  Inline Formula  follows a three-component mixture with the mean  Inline Formula  and variance  Inline Formula :

Display Formula

As in the Ice Cream Example, a sample of n=40 is taken from the population with strata sample size following the following distribution:

Display Formula

The total population size N and overall population mean μ are given by:

Display Formula

where  Inline Formula  is the proportion of the hth stratum size to the total population size. Under stratified random sampling, the sampling weights are used to estimate the overall population mean μ:

Display Formula

To reduce sampling variability in Monte Carlo estimates, we set Monte Carlo replication size to M=10 000. For each Monte Carlo iteration, let  Inline Formula  denote the estimate and  Inline Formula  and  Inline Formula  denote the WOLS/ML and EE variance estimates from the mth simulated data. We compute the Monte Carlo sample mean  Inline Formula , empirical variance  Inline Formula  and averaged variance estimates  Inline Formula  and  Inline Formula  from the two methods the same way as in (13) and (14).

Shown in table 3 are the Monte Carlo mean, (empirical) SE and SEs from the two variance estimates along with the empirical SE. As expected, the Monte Carlo mean is nearly identical to the population mean μ=9.1325 and the averaged EE SE  Inline Formula  is identical to the empirical version  Inline Formula .

Table 3
|
Monte Carlo mean, empirical, WOLS/ML and EE SE

Discussion

In this paper, we focused on sampling and homoscedasticity weights, discussed the conceptual difference between the two and illustrated the implications of the conceptual difference in SEs of estimated population means though analytic expressions and Monte Carlo simulations. We have demonstrated that homoscedasticity weights have very specific applications. Our experiences with SAS and other popular packages indicate that if weights are available as an option in a procedure such as SAS PROC MEANS, they are typically of the homoscedasticity type. Such procedures should not be used for any other types of weights such as sampling weights. Sampling weights must only be used in survey specific procedures such as SAS PROC SURVEYMEANS, as PROC MEANS will not compute the correct SE and will show substantial bias even in large samples. In contrast, procedures such as SAS PROC SURVEYMEANS are more general and will compute the correct SE when using both sampling and homoscedasticity weights. In the case of heteroscedasticity, estimating equation methods for calculating the SE can even compute the correct variance estimate if a researcher does not have access to a known homoscedasticity weight, correcting for potential distortions in the SE that can result from this violation. We recommend that researchers identify the type of weight they are using and understand the implications of using the weight within common analytic programs such as SAS, as incorrect application of weights can have important consequences for research analyses.

Sabrina Richardson received her PhD in developmental psychology from the University of California, Riverside, and is currently a Leidos research psychologist at the Naval Health Research Center working on the Millennium Cohort Family Study, a longitudinal study investigating the well-being of service members and their families. Her research broadly seeks to understand the multifaceted processes of adaptation supporting better than expected outcomes when encountering risk (i.e., resilience), particularly focused on child and family development. She has authored works on children’s resilience, sibling relationships, child maltreatment, transition aged foster youth, and military family readiness. She is also interested in quantitative methods such as structural equation modeling, moderation/mediation analyses, and survey methodology.

author bio image