Article Text

## Abstract

In many statistical applications, composite variables are constructed to reduce the number of variables and improve the performances of statistical analyses of these variables, especially when some of the variables are highly correlated. Principal component analysis (PCA) and factor analysis (FA) are generally used for such purposes. If the variables are used as explanatory or independent variables in linear regression analysis, partial least squares (PLS) regression is a better alternative. Unlike PCA and FA, PLS creates composite variables by also taking into account the response, or dependent variable, so that they have higher correlations with the response than composites from their PCA and FA counterparts. In this report, we provide an introduction to this useful approach and illustrate it with data from a real study.

- biostatistics
- statistics as topic
- linear models

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

## Statistics from Altmetric.com

## Introduction

Composite variables are widely used to summarise information from a set of outcomes in statistical analysis. In some studies, composite variables are used to create domain scales or subscales, such as the SF-36 (the MOS item short-form health survey) instrument, while in some other studies, composite variables are used to deal with limitations of data. For example, in regression analysis, we may need to create composite variables if the number of explanatory or independent variables is larger than the sample size. This statistical issue arises when modelling high-throughput data such as in fitting regression models to determine associations of brain functions with behavioural and health outcomes of interest due to large numbers of braining imaging variables and limited study sample sizes in most studies. Principal component analysis (PCA) and factor analysis (FA) are generally used for creating composite variables. In this report, we describe another less-known approach called partial least squares (PLS) regression, to create composite variables and discuss scenarios where this approach is more effective than PCA and FA. We illustrate this approach with a real-life application to research data.

## Partial least squares regression

As noted earlier, PCA and FA are two popular approaches for creating composite variables. Under PCA, a set of ordered composite variables are created to represent the original set of outcomes. Each composite variable is a linear combination, or a weighted sum, of the original outcomes. The coefficients, or weights, of the linear combination in each compositive variable, are called loadings, and their signs and magnitudes indicate the directions and contributions of the corresponding variables. Unlike the original outcomes, the composite variables are orthogonal to each other. Moreover, the first composite variable has the largest variance, followed by the second and so on. FA also creates a set of composite variables. However, unlike PCA, FA composite variables are not ordered in the sense of PCA composites and are not orthogonal to each other. Instead, loadings of the FA composites can be used to group the original variables to create subscale, or domain scales, for different constructs such as the domains of Physical Functioning and Emotional Well-being in the SF-36.1

PCA and FA create composite variables for general purposes. When composite variables are used as explanatory or independent variables in regression analysis involving a response or dependent variable, a more effective approach is PLS. Like PCA, PLS composite variables are also ordered. However, unlike PCA, PLS composite variables are ordered by their correlations with the response in the regression model; the first composite variable has the maximum correlation with the response, followed by the second and so on. If interest lies in finding a subset of the original explanatory variables in the linear model that explains the most variability in the response, PLS composite variables are more effective than PCA.

To describe in detail how to compute PLS composite variables, consider a linear regression with a continuous response of interest, Y , and a set of p explanatory variables, . We are interested in modelling the relationship of Y with . Given a sample of n subjects, the classic linear regression relating Y to is given by:

(1)

where i indexes the subjects, denote the regression parameters, is the error term, and denotes a normal distribution with mean μ and variance . The first part of the linear regression,

(2)

is called the conditional (population) mean of given the explanatory variables . On estimating the regression parameters , this conditional mean describes the association of with each of the explanatory variables.

When the explanatory variables are highly correlated, estimates of may not be reliable due to multicollinearity using the standard least squares (LS) or maximum likelihood (ML) method. In studies of high-throughput data, the number of explanatory variables exceeds sample sizes, in which case LS method will not apply. In both cases, we need to reduce the number of the variables, , or the dimension p . There are different approaches to address this issue. One may use the least absolute shrinkage and selection operator (LASSO) to determine a subset of that provides reliable associations with . Alternatively, one may create composite variables from and use a subset of the composite variables to predict . The latter composite variable approach is preferred if some or all need to work together to explain the variability in . For example, if one wants to predict areas of rectangles, lengths or widths alone will not provide reliable predictions since a rectangle with a very large length can still have a small area if it has a small width. LASSO is most effective to deal with high-throughput data as dimension is the primary problem in this case. In the presence of multicollinearity, it is likely more meaningful to aggregate information in correlated variables using a subset of composite variables, rather than to select a subset of the original variables. In this case, correlated variables may all contribute to explaining the variability in the response and composite variables will account for all such contributions.

The composite variables for PLS are obtained by solving an optimisation problem.2 Unlike PCA composite variables, PLS finds directions of the composite variables that have both high variance and high correlation with the response . Specifically, let denote the l th composite variable:

(3)

where denote weights, or loadings, of the composite . We can also express (3) equivalently in a vector form

(4)

or in a matrix form:

where , and denote column vectors, and denotes an matrix. The loadings are determined by the following optimisation procedure:

where
is a
column vector,
S
is the sample covariance matrix of
,
denotes the squared (Pearson) correlation matrix between
and
, and
denotes the sample variance of
. The condition
ensures that the *l*th composite
is uncorrelated with all previous composite variables
(
).

In practice, we can use the following procedure to find the PLS composite variables.3 We start by standardising each of the original explanatory variables to have mean 0 and variance 1. Set and ( ), where denotes the sample mean of ( ) and denotes a volume vector of 1. For , we perform the following steps:

(a) , where , where denotes the inner product between two vectors a and b ;

(b) ;

(c) ;

(d) Orthogonalise each with respect to :

To illustrate the difference between PSL and PCA, here is the procedure to compute composite variables under PCA:

We start with the data matrix X , which is formed by the column vectors , ,…, , that is, . Then we perform the following steps:

Average over all the columns of X : ;

Centre the matrix X at this average by subtracting from each column vector of X , denote as: ;

Compute the sample variance–covariance matrix: ;

Compute the eigenvalues and corresponding eigenvectors of Σ with ;

The top m ( ) principal components, or composite variables, , are then used as independent variables in linear regression models with Y as the dependent variable, where m is generally determined by the magnitude of the sum of the top m eigenvalues relative to the sum of all p eigen values, , which has the interpretation of being the percent of the variability of X explained by the top m eigenvectors .

By comparing the two procedures, we can see that PCA creates composite variables without using any information in the dependent variable Y as PLS does in creating its composite variables. If the goal is to find composite variables of X that are most predictive of Y , PLS is more preferrable than PCA. On the other hand, if the goal is to find composite variables that maximally explain the variability of the data matrix X , then PCA is more preferrable.

## Real study example

We illustrate PLS with data from a recent study on the association of loneliness and wisdom with gut microbial diversity and composition.4 Loneliness and wisdom have opposite effects on health and well-being. Loneliness is a serious public health problem associated with increased morbidity and mortality. Wisdom is associated with better health and well-being. Nguyen *et al*
4 successfully applied PLS to demonstrate relationships between the association of loneliness and wisdom with alpha-diversity. We use this study to illustrate the advantages of PLS over standard linear regression. More details about the study population, measures of loneliness, wisdom, gut microbial diversity and other outcomes, and additional findings can be found in the paper.

The study included 184 community-dwelling adults (28–97 years). Participants completed validated scales of loneliness (UCLA Loneliness Scale),5 wisdom (including cognitive, affective and reflective dimensions; Three-Dimensional Wisdom Scale),6 compassion (Santa Clara Brief Compassion Scale),7 social support (Emotional Support Scale)8 and social engagement (Cognitively Stimulating Questionnaire).9 These variables are interrelated; loneliness and wisdom have strong inverse correlations; social support, social engagement and loneliness are often inversely correlated, but they are distinct concepts. Faecal samples were obtained from participants using at-home self-collection kits and returned via mail. Alpha- diversity is the ecological diversity (ie, richness, evenness, compositional complexity) of a single sample and was quantified using Faith’s Phylogenetic Diversity (PD) based on the DNAs extracted from the faecal samples. It measures the total length of branches in a reference phylogenetic tree for all species in a given sample.10

We first fit a standard linear regression to model the association of alpha-diversity with individual loneliness, wisdom, compassion, social support and social engagement outcomes as predictors, controlling for age and body mass index (BMI). Shown in table 1 were estimated regression coefficients (*β*) for the predictors and covariates, along with associated t-statistics (*t*) and p values . As seen, none of the predictors were significant.

We then applied PLS to construct composite variables from all the predictors and included the extracted composite variables and the covariates to build the linear regression to predict alpha-diversity by examining the contribution of each composite component added in terms of the amount of explained variability in the outcome of alpha-diversity.11 We settled on the first two composite variables because adding component 3 led to a decreased adjusted R squared. Shown in table 2 are estimated regression coefficients (*β*) for the predictors and covariates, along with t-statistics (*t*) and p values. The model revealed that the effect of component 1 was significantly positively associated with alpha-diversity (p=0.008), whereas component 2 was not (p=0.217).

When applying PLS, it is important to determine directions of effects for the original predictors of interest (loneliness, wisdom, compassion, social support and social engagement) when a composite variable is used as a predictor in the final model. Shown in table 3 were loadings of individual predictors on the first two composite variables. The sign of the loading of a predictor on the composite variable indicates the direction of association of the predictor with the composite variable. Except for loneliness, all the predictors had positive loadings on the first composite variable, indicating that wisdom, compassion, social support and social engagement had positive associations with alpha-diversity. Loneliness had a negative association with alpha-diversity because of its negative sign. The first composite variable also accounted for 40% of the total variability of the psychosocial variables.

To illustrate the differences between PLS and PCA, we also applied PCA to construct composite variables and use them as explanatory variables in modelling the association of alpha-diversity with the psychosocial variables. To be consistent with the PLS, we used the first two eigenvectors as the composite variables and controlled for age and BMI. Shown in table 4 were estimated regression coefficients (*β*) for the predictors and covariates, along with t-statistics (*t*) and p values (component 1: p=0.015; component 2: p=0.190). The results were similar to their PLS counterparts. A notable difference is the slightly weaker association between the first composite variable and alpha-diversity. Both PCA and PLS yielded the same conclusion regarding the association of composite variables with alpha-diversity.

Shown in table 5 were loadings of individual predictors on the first two PCA composite variables. The signs of the loadings are consistent with their PLS counterparts. The wisdom-cognitive subscore had less loading under PLS than PCA, while compassion, social support and social engagement had higher loadings under PLS than PCA.

## Discussion

In this report, we described the partial least squares (PLS) regression, discussed its relationship with a closely related alternative, the principal component analysis (PCA), and illustrated the PLS with a real study example. Although both aim to reduce explanatory variables (predictors), PLS and PCA work quite differently in developing composite variables. While PCA constructs the composite variables to explain the maximum variability in all the original predictors, or the explanatory variables of interest, PLS creates its composite variables to explain the maximum variability in the response within the context of linear regression.

In practice, if the goal is to develop a set of composite variables for use as explanatory variables in regression models for multiple responses, PCA may be preferred since, unlike PLS, it will create a common set of composite variables for regression across all the responses. On the other hand, if the objective is to develop a set of composite variables to explain the maximum variability for a given response, then PLS should be used. When applying PLS to develop composite variables for regression analysis for multiple responses, multiple sets of composite variables will be created with one set for each response and consequently, regression results from composite variables must be interpreted with respect to factor loadings within each set of composite variables.

In the illustrative example, the two approaches yield similar results. In general, results from the two approaches may differ and yield different conclusions. For example, PLS may yield significant associations of its composite variables while PCA does not. If interest lies in finding associations of a response with a set of explanatory variables, PLS should be used.

## Ethics statements

### Patient consent for publication

### Ethics approval

The studies involving human participants were reviewed and approved by UCSD Human Research Protections Program. The patients/participants provided their written informed consent to participate in this study. Participants gave informed consent to participate in the study before taking part.

## References

Chenyu Liu is a PhD student in Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health and Human Longevity Science, UC San Diego in USA. She is currently working as a Graduate Student Researcher at UC San Diego. She got her master’s degree in statistics from University of Minnesota, Twin Cities in USA. Her main research interests include statistical learning, statistical methods, clinical trial and causal inference.

## Footnotes

Contributors CL conceived the initial idea, searched the literature on related topics, performed analyses and assisted in manuscript preparation. XZ participated in the discussion of the statistical problems and helped with technical details of PLS, and helped finalise the manuscript. TTN brought the statistical problem in the real study example, participated in the discussion of the statistical problems, helped in the interpretation of estimates in the real study example and helped with technical details of least squares, and helped finalise the manuscript. JL, TW, XMT researched the statistical issues, directed simulation studies, drafted parts of the manuscript and finalised the manuscript. All authors provided critical feedback and helped shape the research, analysis and manuscript.

Funding This study was funded by National Institutes of Health (UL1TR001442).

Competing interests None declared.

Provenance and peer review Commissioned; externally peer reviewed.