SAS and other popular statistical packages provide support for survey data with sampling weights. For example, PROC MEANS and PROC LOGISTIC in SAS have their counterparts PROC SURVEYMEANS and PROC SURVEYLOGISTIC to facilitate analysis of data from complex survey studies. On the other hand, PROC MEANS and many other classic SAS procedures also provide an option for including weights and yield identical point estimates, but different standard errors (SEs), as their corresponding survey procedures. This paper takes an in-depth look at different types of weights and provides guidance on use of different SAS procedures.
- linear regression
- statistical method
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0
Statistics from Altmetric.com
All popular SAS procedures support use of weights, such as the classic PROC MEANS, PROC GLM and PROC LOGISTIC. To facilitate analysis of survey study data, SAS also provides an array of procedures with a ‘SURVEY’ prefix, such as the PROC SURVEYMEANS, PROC SURVEYREG and PROC SURVEYLOGISTIC. As the latter are designed and developed specifically for survey study data, weights reflecting sampling and non-response are an integral part of these procedures. A common question is which SAS PROC to use when analysing survey study data involving sampling weights. Although they produce identical point estimates, the classic SAS procedures and their SURVEY counterparts generally yield different SEs, which in some cases lead to quite different p-values and conclusions. In this document, we discuss conceptual differences underlying the different types of weights and their implications in statistical methods developed touse the different SAS procedures. Although we focus on SAS in this paper, the discussions also apply to other statistical packages such as R and SPSS. For ease of exposition, we focus on the PROC MEANS and PROC SURVEYMEANS and illustrate our considerations with real and Monte Carlo simulated data. These same considerations will also apply to other SAS procedures for multiple variable analyses such as PROC LOGISTIC versus PROC SURVEYLOGISTIC; however, here we will focus on univariate analyses.
Types of Weights and Methods Underlying SAS PROC MEANS and PROC SURVEYMEANS
Within these two procedures, the weights have different uses and meaning. Weights used in PROC MEANS are designed to address violations of ‘homoscedasticity’, a key assumption underlying many statistical methods such as inference for population means with the current context and, more generally, regression analysis. In linear regression models, ‘the best linear unbiased estimate’ (BLUE) is the most popular estimate. The BLUE estimate has the smallest variance among all competing estimates that are a linear combination of observations. However, if the assumption of homoscedasticity is not met, BLUE will generally be biased. In some cases, weights, a series of known constants for each of the observations, can be used to address such violations, or heteroscedasticity.
Note that regularised estimates, such as the popular ‘least absolute shrinkage and selection operator’ (LASSO), have become increasingly popular in recent years due to the surge of high-dimensional data arising in biomedical and online social media research. Although these estimates are generally biased, the bias is typically small. Moreover, when the number of independent variables exceeds the sample size in regression models, it is no longer possible to obtain unbiased estimates. Thus, the class of BLUE estimates becomes irrelevant in such applications.
Within the context of inference for the population mean, the sample mean is a BLUE estimate, if all observations have the same population mean and variance in addition to being independent, representing the so-called ‘independently and identically distributed’ (i.i.d.) sample. If the observations do not have the same variance, that is, a violation of the homoscedasticity assumption, the sample mean is no longer a BLUE estimate, even though it is still an unbiased estimate of the population mean. As the usual sample variance formula and t-statistic are both based on the i.i.d. assumption, the sample variance no longer describes the variability of the sample mean and t-statistic cannot be used to provide valid inference about the population mean. For example, a 95% CI based on the sample variance and t-statistic no longer covers the population mean 95% of the time. In most cases, causes of such heteroscedasticity are unknown and other statistical methods must be used to provide valid inference about the population mean, such as by using estimating equations (see below for details and examples). In some studies, heteroscedasticity is due to aggregating data, in which case weights can be used to ‘correct’ the type of heteroscedasticity so that the usual sample variance and t-statistic can continue to provide valid inference.
Consider taking a random sample of 100 subjects from a population of interest and let denote an outcome of interest from the i th subject. Such a sampling scheme is often called ‘simple random sampling’. Let denote the population mean and denote the population variance, where denotes the mathematical expectation of and . We are interested in estimating the population mean μ .
Under simple random sampling, the 100 observations form an i.i.d. sample, from which the sample mean , sample variance , variance (SE ) of the sample mean and t-statistic can be computed:
The above statistics can be used to calculate CIs or test hypotheses of interest. For example, to test if the population mean is 0, we specify the hypothesis as:
We can readily use the value of the t-statistic to compute the p-value or construct a 95% CI.
Now suppose that we average the first 20 and last 10 observations and replace the first 20 and last 10 observations by their averaged counterparts:
where and denote the averaged first 20 and last 10 observations. We can recalculate all the statistics in Equation (1) using the two averaged outcomes plus the remaining 70 observations:
Although the sample mean in Equation (2) is still an unbiased estimate of μ, the other statistics no longer have the same interpretations as their counterparts in Equation (1); is no longer an estimate of , (or ) is no longer an estimate of the variability of the sample mean , and no longer follows the t-distribution. This is because although and still have the same population mean, they no longer have the same variance as the remaining 70 observations:
Thus, the reduced 72 observations do not meet the homoscedasticity assumption and all the statistics, except for the sample mean μ, do not have the same meaningful interpretations as their counterparts in Equation (1).
Several methods are available to address heteroscedasticity and its impact on variance estimation and associated p-values. For example, bootstrap and Jackknife resampling methods are commonly implemented in various statistical packages and can be used to provide valid inference in this case. A modern alternative is estimating equations (EE), which do not involve resampling of the observations and provide a more efficient approach. In this example, where the source of heteroscedasticity is known to be caused by averaging some of the observations, weights can be used to ‘correct’ this special type of heteroscedasticity.
As seen in Equation (3), the variance of the first averaged 20 observation differs from the other 70 observations by a factor of and the variance of the last averaged 10 observation differs from the other 70 observations by a factor of . By taking the inverse of these numbers as weights for the two respective averaged observations,
and applying such weights to the sample mean and variance in Equation (2), we obtain a weighted sample mean and sample variance, along with the variance (SE) of the weighted sample mean and t-statistic:
By comparing Equation (2) and (4), we see that each averaged observation receives more weight than the original observation and the weighted is equal to the number of subjects within the averaged outcome. Also, the mean variance estimates are defined by the sum of the weights, , which is the same as the original sample size. Thus, with the weights, the 72 observations in Equation (4) carry the same ‘weight’ as the original 100 observations. For example, in the sample mean (variance), is weighted 20 times more than each original observation, allowing it to have the same effect as the first 20 observations on the estimated mean (variance). The t-statistic in Equation (4) follows the same t-distribution for inference about the population mean.
In Example 1, the heteroscedasticity has a particular form:
The approach to use weights to correct for heteroscedasticity and construct weighted estimates not only works for inference about the population mean in this example but also for more complex regression models. For example, weighted ordinary least squares (WOLS) uses the same approach to address this type of heteroscedasticity for linear regression. Since the weighted approach here and WOLS in regression setting yield the same point and variance estimate as the maximum likelihood (ML) method (when the outcome is assumed to follow a normal distribution), we will refer to the weighted approach here as the WOLS/ML method throughout the rest of the discussion.
In SAS, the second procedure under consideration, PROC SURVEYMEANS, addresses conceptually different issues arising from complex survey sampling. For the simple random sample as in Example 1, the usual sample mean, sample variance, variance (SE) of the estimate and t-statistic provide valid inference about the population mean. In this case, PROC MEANS and PROC SURVEYMEANS yield identical point estimates and SEs. In most survey studies, more complex sampling strategies are used to more efficiently obtain more reliable estimates. Stratified random sampling is a popular alternative to simple random sampling when sampling heterogeneous populations. Although still yielding the same point estimate, the two SAS PROCs will generally produce different SEs with this type of sampling. We illustrate such difference using data from the Ice Cream Example in the SAS SURVEYMEANS Procedure document, SAS/STAT V.9.2.1 ,2
In the Ice Cream study, researchers are interested in how much students in a junior high school spend weekly on ice cream. The junior high school has a total of 4000 students distributed in grades 7, 8, and 9 as follows:
where the three different grades represent three strata indexed by h;, denotes the number of students in the hth stratum and N denotes the population size. To address this question, 40 students are selected from the study population using a stratified random sampling; a random sample of 20, 9 and 11 students is taken from the three strata:
The distribution of the three grades in the sample is
If the 40 students were sampled under simple random sampling, the distribution of the grades would be:
By comparing Equation (5) and (6), we see that grade 7 is over-represented while the other two grades are under-represented in the study sample. Thus, the usual sample mean of the whole sample will be biased towards Grade 7. To obtain an estimate of mean weekly spending on ice cream for this junior high school, we must use sampling weights to reduce the over-representation of Grade seven and increase the under-representation of other two grades.
To correctly include a sampling weight, it must be the inverse of the sampling probability that a subject is selected from the population. For simple random sampling, this probability is approximated by the sampling fraction, , which is constant. Similarly, the sampling weight for each randomly sampled subject is , is also constant, regardless of grade. Under stratified sampling, the sampling fraction is no longer constant and both this fraction and the sampling weight depend on the strata:
The sampling weight in this case counterbalances the effect of oversampling or undersampling of a stratum, allowing for unbiased estimation of the population mean.
Let denote the spending on ice cream by a student in stratum i in the sample. We can apply the weighted mean in Equation (4) to estimate the mean spending on ice cream by the students in the junior high school μ (the formula looks slightly different because of the added notation for stratification):
However, the variance , or SE , calculated according to Equation (4), no longer estimates the variability of the estimate of the mean. Unlike the weights in Example 1, sampling weights in this example are used to address different sampling probabilities between the strata. Even under homoscedasticity, that is, a common variance of across all three strata, we still need to use the weighted mean in Equation (8) to estimate the correct population mean.
Note that if a sample of n is taken from a population of size N , the probability for the first subject sampled is , the probability for the second sampled is and so on. So, the sampling probabilities for the sampled subjects are not constant and given by:
In most survey studies, n is much smaller than N, so is well approximated by N. Thus, for all practical purposes, the sampling probability for a random sample of n subjects is the sampling fraction, . Also, unlike the weights applied in Example 1, sampling weights in Example 2 sum to the total population size. However, a sampling weight is a unit-free quantity and can be multiplied by any number without affecting the statistics. For example, by dividing the weights in Equation (8) by N=4000, they sum to one and the estimate:
which is the same as the one in Equation (8).
Shown in table 1 are the weighted means and SEs from the PROC MEANS and PROC SURVEYMEANS for the Ice Cream data. The weighted means are the same, but the SEs are quite different from the two PROCs.
The difference in SEs between the two SAS PROCs above is the result of a different statistical approach used to compute the SE in PROC SURVEYMEANS. Unlike the WOLS/ML variance estimate in the PROC MEANS which is specific to weights selected to address heteroscedasticity, the variance estimate from PROC SURVEYMEANS is derived by estimating equations, or Taylor series expansion, which is valid for any types of weights, including weights computed in PROC MEANS, sampling weights for survey studies, non-response weights and combinations of such weights.
For the stratified sampling in Example 2, the estimating equation variance (SE) of the weighted mean is given by:
If applying the WOLS (ML) variance estimate in Equation (3), the variance of the weighted mean is:
The different variance estimates in Equations (9) and (10) can yield quite different SEs as shown by the Ice Cream data in Example 2.
Thus, when analysing survey data involving sampling weights, PROC SURVEYMEANS must be used to provide valid variance (SE) estimates. As noted above, the estimating equation approach also provides valid inference for all other types of weights. The next example shows that this variance estimate also applies when weights are used to address heteroscedasticity.
Implications of using the wrong PROC with a weight versus a sampling weight: a Monte Carlo simulation
In the case of homoscedasticity weights, the choice of PROC MEANS or PROC SURVEYMEANS and associated variance estimates is facilitated as either will give valid and consistent results. We must underscore that this is true only when the weight used is in fact a homoscedasticity (heteroskedasticiy-correction) weight and not sampling weight.
In this example, we perform Monte Carlo simulations to show that the estimating equation variance estimate is also valid when used to deal with heteroscedasticity in the data.
To simulate a sample with heteroscedasticity, we consider a normal distribution consists of five subpopulations: all with the same (population) mean μ=1, but different variances:
When sampled from this five-component mixture, , the variance of the observation varies depending on the subpopulation sampled:
By using the weights, , we can use the weighted mean, WOLS/ML variance (SE) of the weighted mean and t-statistic in Equation (2) and (4) used in PROC MEANS to make inference about the mean μ. Alternatively, we can also apply the estimating equation (EE) variance estimate in Equation (9) to compute variance (SE) of the weighted mean using PROC SURVEYMEANS. Although the two variance estimates look quite different, they are both consistent estimates of the variance of the weighted mean .
To demonstrate this using Monte Carlo simulations, we perform the following steps:
Simulate a sample of 25 000 ’s from Equation (11), with 5000 ’s from each subpopulation;
Compute the estimate and two variance estimates of according to Equation (4) and (9):
(c) Repeat Step (a) and (b) M=2000 times;
(d) For each Monte Carlo iteration, let denote the estimate and and denote the two variance estimates;
(e) Compute the Monte Carlo mean and empirical variance of the estimate :
(f) Compute the two variance estimates averaged over the 2000 Monte Carlo simulations:
The Monte Carlo mean and variance above provide a benchmark to assess and compare performances of estimates. The Monte Carlo sample variance is also known as the empirical variance of the estimate, since it measures the variability of the estimate and is a consistent estimate of the variance of the asymptotic distribution of the estimate.
If the method for estimating the mean is correct, the Monte Carlo mean should be close to the population mean μ=1. Likewise, if a variance estimate is consistent, its Monte Carlo average in Equation (13) will be close to the empirical version and vice versa.
Shown in table 2 are the weighted mean and SEs from the WOLS/ML, EE and empirical variance estimates. The WOLS/ML and EE SEs are virtually identical (difference is 2.67×10− 8 ) and both are extremely close to the empirical SE.
Therefore, for non-sampling weights such as weights selected to address heteroscedasticity, the EE variance estimate still describes the sampling variability of the estimate, as illustrated by the simulation study above. The EE variance estimate is more general, as it also provides valid inference for more complex weights such as those used for sampling and non-response bias, while WOLS/ML based variance formulas cannot be applied to all types of weights.
We would like to point out that the EE can also be used to address heteroscedasticity when a correction weight is not available. In many studies, the cause of heteroscedasticity is unknown and weights cannot be computed. In this case, the WOLS/ML approach no longer applies. But, even without a known heteroscedasticity weight, the EE still provides valid variance estimates. For example, when applied to the simulated data in Example 3 without weights, the estimated population mean is 0.9996, which is quite close to 1. The Monte Carlo average of the EE SE, 0.0002, is also quite close to the empirical error, 0.000185. Both the EE and empirical SEs in this case are a bit larger than their weighted counterparts, which is consistent with the property that the weighted mean is the BLUE, that is, the estimate with the smallest standard error among all estimates that are a linear combination of the observations.
In the next Example, we use simulated data to show thatwhen using survey weights, the WOLS/ML variance estimate can have severe bias and the EE variance estimate must be used to provide valid inference about the population mean.
We use the Ice Cream Example as the setting to simulate the outcome (spending) for each student. First, we compute sample means and sample variances for each of the three strata (grade). Next, we construct the population distribution as a three-strata normal mixture using these sample means as the population means for the three strata and an averaged sample variance as the common population variance for all the strata. We then estimate the population mean using a weighted mean and compare the two variance estimates.
Specifically, let h denote the index strata and denote spending of ith student sampled from the hth stratum. Let denote the population mean of the hth stratum:
where 5, 15.4 and 10.1 are the sample means of the corresponding strata in the Ice Cream Example. Let denote the common variance of across all strata (average of three strata variances). Let denote the population size and denote the sample size of the stratum. We assume that follows a three-component mixture with the mean and variance :
As in the Ice Cream Example, a sample of n=40 is taken from the population with strata sample size following the following distribution:
The total population size N and overall population mean μ are given by:
where is the proportion of the hth stratum size to the total population size. Under stratified random sampling, the sampling weights are used to estimate the overall population mean μ:
To reduce sampling variability in Monte Carlo estimates, we set Monte Carlo replication size to M=10 000. For each Monte Carlo iteration, let denote the estimate and and denote the WOLS/ML and EE variance estimates from the mth simulated data. We compute the Monte Carlo sample mean , empirical variance and averaged variance estimates and from the two methods the same way as in (13) and (14).
Shown in table 3 are the Monte Carlo mean, (empirical) SE and SEs from the two variance estimates along with the empirical SE. As expected, the Monte Carlo mean is nearly identical to the population mean μ=9.1325 and the averaged EE SE is identical to the empirical version .
In this paper, we focused on sampling and homoscedasticity weights, discussed the conceptual difference between the two and illustrated the implications of the conceptual difference in SEs of estimated population means though analytic expressions and Monte Carlo simulations. We have demonstrated that homoscedasticity weights have very specific applications. Our experiences with SAS and other popular packages indicate that if weights are available as an option in a procedure such as SAS PROC MEANS, they are typically of the homoscedasticity type. Such procedures should not be used for any other types of weights such as sampling weights. Sampling weights must only be used in survey specific procedures such as SAS PROC SURVEYMEANS, as PROC MEANS will not compute the correct SE and will show substantial bias even in large samples. In contrast, procedures such as SAS PROC SURVEYMEANS are more general and will compute the correct SE when using both sampling and homoscedasticity weights. In the case of heteroscedasticity, estimating equation methods for calculating the SE can even compute the correct variance estimate if a researcher does not have access to a known homoscedasticity weight, correcting for potential distortions in the SE that can result from this violation. We recommend that researchers identify the type of weight they are using and understand the implications of using the weight within common analytic programs such as SAS, as incorrect application of weights can have important consequences for research analyses.
Sabrina Richardson received her PhD in developmental psychology from the University of California, Riverside, and is currently a Leidos research psychologist at the Naval Health Research Center working on the Millennium Cohort Family Study, a longitudinal study investigating the well-being of service members and their families. Her research broadly seeks to understand the multifaceted processes of adaptation supporting better than expected outcomes when encountering risk (i.e., resilience), particularly focused on child and family development. She has authored works on children’s resilience, sibling relationships, child maltreatment, transition aged foster youth, and military family readiness. She is also interested in quantitative methods such as structural equation modeling, moderation/mediation analyses, and survey methodology.
Contributors SR, XN, VS, MX and XMT had extensive discussions of the statistical issues with different types of weights and their implementations in some popular statistical packages and worked together to structure this report. TL, YL and XN worked together to find the formulas for the estimate and SEs of the estimate of the population mean as implemented in SAS and developed the computer codes to perform data simulations and inference for the examples.
Funding The project described was partially supported by the National Institutes of Health, Grant UL1TR001442 of CTSA funding beginning 13 August 2015 and beyond and by the Navy Bureau of Medicine and Surgery under work unit no. N1240. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Disclaimer I am an employee of the US Government. This work was prepared as part of my official duties. Title 17, U.S.C. §105 provides that copyright protection under this title is not available for any work of the US Government. Title 17, U.S.C. §101 defines a US Government work as work prepared by a military service member or employee of the US Government as part of that person’s official duties. This work was supported by the Navy Bureau of Medicine and Surgery under work unit no. N1240. The views expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the Department of the Navy, Department of Defense, nor the US Government.
Competing interests None declared.
Patient consent for publication Not required.
Provenance and peer review Commissioned; internally peer reviewed.
Data sharing statement No additional data are available.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.