Article Text

Download PDFPDF

Partial least squares regression and principal component analysis: similarity and differences between two popular variable reduction approaches
  1. Chenyu Liu1,
  2. Xinlian Zhang1,
  3. Tanya T Nguyen2,
  4. Jinyuan Liu1,
  5. Tsungchin Wu1,
  6. Ellen Lee2 and
  7. Xin M Tu1
  1. 1 Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health and Human Longevity Science, UC San Diego, La Jolla, California, USA
  2. 2 Department of Psychiatry, Stein Institute for Research on Aging, UC San Diego, La Jolla, California, USA
  1. Correspondence to Dr Chenyu Liu; chl056{at}health.ucsd.edu

Abstract

In many statistical applications, composite variables are constructed to reduce the number of variables and improve the performances of statistical analyses of these variables, especially when some of the variables are highly correlated. Principal component analysis (PCA) and factor analysis (FA) are generally used for such purposes. If the variables are used as explanatory or independent variables in linear regression analysis, partial least squares (PLS) regression is a better alternative. Unlike PCA and FA, PLS creates composite variables by also taking into account the response, or dependent variable, so that they have higher correlations with the response than composites from their PCA and FA counterparts. In this report, we provide an introduction to this useful approach and illustrate it with data from a real study.

  • biostatistics
  • statistics as topic
  • linear models
http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

Composite variables are widely used to summarise information from a set of outcomes in statistical analysis. In some studies, composite variables are used to create domain scales or subscales, such as the SF-36 (the MOS item short-form health survey) instrument, while in some other studies, composite variables are used to deal with limitations of data. For example, in regression analysis, we may need to create composite variables if the number of explanatory or independent variables is larger than the sample size. This statistical issue arises when modelling high-throughput data such as in fitting regression models to determine associations of brain functions with behavioural and health outcomes of interest due to large numbers of braining imaging variables and limited study sample sizes in most studies. Principal component analysis (PCA) and factor analysis (FA) are generally used for creating composite variables. In this report, we describe another less-known approach called partial least squares (PLS) regression, to create composite variables and discuss scenarios where this approach is more effective than PCA and FA. We illustrate this approach with a real-life application to research data.

Partial least squares regression

As noted earlier, PCA and FA are two popular approaches for creating composite variables. Under PCA, a set of ordered composite variables are created to represent the original set of outcomes. Each composite variable is a linear combination, or a weighted sum, of the original outcomes. The coefficients, or weights, of the linear combination in each compositive variable, are called loadings, and their signs and magnitudes indicate the directions and contributions of the corresponding variables. Unlike the original outcomes, the composite variables are orthogonal to each other. Moreover, the first composite variable has the largest variance, followed by the second and so on. FA also creates a set of composite variables. However, unlike PCA, FA composite variables are not ordered in the sense of PCA composites and are not orthogonal to each other. Instead, loadings of the FA composites can be used to group the original variables to create subscale, or domain scales, for different constructs such as the domains of Physical Functioning and Emotional Well-being in the SF-36.1

PCA and FA create composite variables for general purposes. When composite variables are used as explanatory or independent variables in regression analysis involving a response or dependent variable, a more effective approach is PLS. Like PCA, PLS composite variables are also ordered. However, unlike PCA, PLS composite variables are ordered by their correlations with the response in the regression model; the first composite variable has the maximum correlation with the response, followed by the second and so on. If interest lies in finding a subset of the original explanatory variables in the linear model that explains the most variability in the response, PLS composite variables are more effective than PCA.

To describe in detail how to compute PLS composite variables, consider a linear regression with a continuous response of interest, Y , and a set of p explanatory variables, Embedded Image . We are interested in modelling the relationship of Y with Embedded Image . Given a sample of n subjects, the classic linear regression relating Y to Embedded Image is given by:

Embedded Image (1)

where i indexes the subjects, Embedded Image denote the regression parameters, Embedded Image is the error term, and Embedded Image denotes a normal distribution with mean μ and variance Embedded Image . The first part of the linear regression,

Embedded Image (2)

is called the conditional (population) mean of Embedded Image given the explanatory variables Embedded Image . On estimating the regression parameters Embedded Image , this conditional mean describes the association of Embedded Image with each of the explanatory variables.

When the explanatory variables Embedded Image are highly correlated, estimates of Embedded Image may not be reliable due to multicollinearity using the standard least squares (LS) or maximum likelihood (ML) method. In studies of high-throughput data, the number of explanatory variables exceeds sample sizes, in which case LS method will not apply. In both cases, we need to reduce the number of the variables, Embedded Image , or the dimension p . There are different approaches to address this issue. One may use the least absolute shrinkage and selection operator (LASSO) to determine a subset of Embedded Image that provides reliable associations with Embedded Image . Alternatively, one may create composite variables Embedded Image from Embedded Image and use a subset of the composite variables to predict Embedded Image . The latter composite variable approach is preferred if some or all Embedded Image need to work together to explain the variability in Embedded Image . For example, if one wants to predict areas of rectangles, lengths or widths alone will not provide reliable predictions since a rectangle with a very large length can still have a small area if it has a small width. LASSO is most effective to deal with high-throughput data as dimension is the primary problem in this case. In the presence of multicollinearity, it is likely more meaningful to aggregate information in correlated variables using a subset of composite variables, rather than to select a subset of the original variables. In this case, correlated variables may all contribute to explaining the variability in the response Embedded Image and composite variables will account for all such contributions.

The composite variables Embedded Image for PLS are obtained by solving an optimisation problem.2 Unlike PCA composite variables, PLS finds directions of the composite variables Embedded Image that have both high variance and high correlation with the response Embedded Image . Specifically, let Embedded Image denote the l th composite variable:

Embedded Image (3)

where Embedded Image denote weights, or loadings, of the composite Embedded Image . We can also express (3) equivalently in a vector form

Embedded Image (4)

or in a matrix form:

Embedded Image

where Embedded Image , Embedded Image and Embedded Image denote Embedded Image column vectors, and Embedded Image denotes an Embedded Image matrix. The loadings are determined by the following optimisation procedure:

Embedded Image

Embedded Image

where Embedded Image is a Embedded Image column vector, S is the sample covariance matrix of Embedded Image , Embedded Image denotes the squared (Pearson) correlation matrix between Embedded Image and Embedded Image , and Embedded Image denotes the sample variance of Embedded Image . The condition Embedded Image ensures that the lth composite Embedded Image is uncorrelated with all previous composite variables Embedded Image (Embedded Image ).

In practice, we can use the following procedure to find the PLS composite variables.3 We start by standardising each of the original explanatory variables Embedded Image to have mean 0 and variance 1. Set Embedded Image and Embedded Image (Embedded Image ), where Embedded Image denotes the sample mean of Embedded Image (Embedded Image ) and Embedded Image denotes a Embedded Image volume vector of 1. For Embedded Image , we perform the following steps:

(a) Embedded Image , where Embedded Image , where Embedded Image denotes the inner product between two vectors a and b ;

(b) Embedded Image ;

(c) Embedded Image ;

(d) Orthogonalise each Embedded Image with respect to Embedded Image :

Embedded Image

To illustrate the difference between PSL and PCA, here is the procedure to compute composite variables under PCA:

We start with the Embedded Image data matrix X , which is formed by the column vectors Embedded Image , Embedded Image ,…, Embedded Image , that is, Embedded Image . Then we perform the following steps:

  1. Average over all the columns of X : Embedded Image ;

  2. Centre the matrix X at this average Embedded Image by subtracting Embedded Image from each column vector Embedded Image of X , denote as: Embedded Image ;

  3. Compute the sample variance–covariance matrix: Embedded Image ;

  4. Compute the eigenvalues Embedded Image and corresponding eigenvectors Embedded Image of Σ with Embedded Image ;

  5. The top m (Embedded Image ) principal components, or composite variables, Embedded Image , are then used as independent variables in linear regression models with Y as the dependent variable, where m is generally determined by the magnitude of the sum of the top m eigenvalues relative to the sum of all p eigen values, Embedded Image , which has the interpretation of being the percent of the variability of X explained by the top m eigenvectors Embedded Image .

By comparing the two procedures, we can see that PCA creates composite variables without using any information in the dependent variable Y as PLS does in creating its composite variables. If the goal is to find composite variables of X that are most predictive of Y , PLS is more preferrable than PCA. On the other hand, if the goal is to find composite variables that maximally explain the variability of the data matrix X , then PCA is more preferrable.

Real study example

We illustrate PLS with data from a recent study on the association of loneliness and wisdom with gut microbial diversity and composition.4 Loneliness and wisdom have opposite effects on health and well-being. Loneliness is a serious public health problem associated with increased morbidity and mortality. Wisdom is associated with better health and well-being. Nguyen et al 4 successfully applied PLS to demonstrate relationships between the association of loneliness and wisdom with alpha-diversity. We use this study to illustrate the advantages of PLS over standard linear regression. More details about the study population, measures of loneliness, wisdom, gut microbial diversity and other outcomes, and additional findings can be found in the paper.

The study included 184 community-dwelling adults (28–97 years). Participants completed validated scales of loneliness (UCLA Loneliness Scale),5 wisdom (including cognitive, affective and reflective dimensions; Three-Dimensional Wisdom Scale),6 compassion (Santa Clara Brief Compassion Scale),7 social support (Emotional Support Scale)8 and social engagement (Cognitively Stimulating Questionnaire).9 These variables are interrelated; loneliness and wisdom have strong inverse correlations; social support, social engagement and loneliness are often inversely correlated, but they are distinct concepts. Faecal samples were obtained from participants using at-home self-collection kits and returned via mail. Alpha- diversity is the ecological diversity (ie, richness, evenness, compositional complexity) of a single sample and was quantified using Faith’s Phylogenetic Diversity (PD) based on the DNAs extracted from the faecal samples. It measures the total length of branches in a reference phylogenetic tree for all species in a given sample.10

We first fit a standard linear regression to model the association of alpha-diversity with individual loneliness, wisdom, compassion, social support and social engagement outcomes as predictors, controlling for age and body mass index (BMI). Shown in table 1 were estimated regression coefficients (β) for the predictors and covariates, along with associated t-statistics (t) and p values . As seen, none of the predictors were significant.

Table 1

Results from linear regression for association of alpha-diversity (Faith’s Phylogenetic Diversity) with loneliness and wisdom outcomes, controlling for covariates

We then applied PLS to construct composite variables from all the predictors and included the extracted composite variables and the covariates to build the linear regression to predict alpha-diversity by examining the contribution of each composite component added in terms of the amount of explained variability in the outcome of alpha-diversity.11 We settled on the first two composite variables because adding component 3 led to a decreased adjusted R squared. Shown in table 2 are estimated regression coefficients (β) for the predictors and covariates, along with t-statistics (t) and p values. The model revealed that the effect of component 1 was significantly positively associated with alpha-diversity (p=0.008), whereas component 2 was not (p=0.217).

Table 2

Coefficients from linear regression model of partial least squares (PLS) composite variables predicting alpha-diversity (Faith’s Phylogenetic Diversity), controlling for age and BMI

When applying PLS, it is important to determine directions of effects for the original predictors of interest (loneliness, wisdom, compassion, social support and social engagement) when a composite variable is used as a predictor in the final model. Shown in table 3 were loadings of individual predictors on the first two composite variables. The sign of the loading of a predictor on the composite variable indicates the direction of association of the predictor with the composite variable. Except for loneliness, all the predictors had positive loadings on the first composite variable, indicating that wisdom, compassion, social support and social engagement had positive associations with alpha-diversity. Loneliness had a negative association with alpha-diversity because of its negative sign. The first composite variable also accounted for 40% of the total variability of the psychosocial variables.

Table 3

Loadings for PLS composite variables

To illustrate the differences between PLS and PCA, we also applied PCA to construct composite variables and use them as explanatory variables in modelling the association of alpha-diversity with the psychosocial variables. To be consistent with the PLS, we used the first two eigenvectors as the composite variables and controlled for age and BMI. Shown in table 4 were estimated regression coefficients (β) for the predictors and covariates, along with t-statistics (t) and p values (component 1: p=0.015; component 2: p=0.190). The results were similar to their PLS counterparts. A notable difference is the slightly weaker association between the first composite variable and alpha-diversity. Both PCA and PLS yielded the same conclusion regarding the association of composite variables with alpha-diversity.

Table 4

Coefficients from linear regression model of principal component analysis (PCA) composite variables predicting alpha-diversity (Faith’s Phylogenetic Diversity), controlling for age and BMI

Shown in table 5 were loadings of individual predictors on the first two PCA composite variables. The signs of the loadings are consistent with their PLS counterparts. The wisdom-cognitive subscore had less loading under PLS than PCA, while compassion, social support and social engagement had higher loadings under PLS than PCA.

Table 5

Loadings for the first PCA composite variables

Discussion

In this report, we described the partial least squares (PLS) regression, discussed its relationship with a closely related alternative, the principal component analysis (PCA), and illustrated the PLS with a real study example. Although both aim to reduce explanatory variables (predictors), PLS and PCA work quite differently in developing composite variables. While PCA constructs the composite variables to explain the maximum variability in all the original predictors, or the explanatory variables of interest, PLS creates its composite variables to explain the maximum variability in the response within the context of linear regression.

In practice, if the goal is to develop a set of composite variables for use as explanatory variables in regression models for multiple responses, PCA may be preferred since, unlike PLS, it will create a common set of composite variables for regression across all the responses. On the other hand, if the objective is to develop a set of composite variables to explain the maximum variability for a given response, then PLS should be used. When applying PLS to develop composite variables for regression analysis for multiple responses, multiple sets of composite variables will be created with one set for each response and consequently, regression results from composite variables must be interpreted with respect to factor loadings within each set of composite variables.

In the illustrative example, the two approaches yield similar results. In general, results from the two approaches may differ and yield different conclusions. For example, PLS may yield significant associations of its composite variables while PCA does not. If interest lies in finding associations of a response with a set of explanatory variables, PLS should be used.

Ethics statements

Patient consent for publication

Ethics approval

The studies involving human participants were reviewed and approved by UCSD Human Research Protections Program. The patients/participants provided their written informed consent to participate in this study. Participants gave informed consent to participate in the study before taking part.

References

Chenyu Liu is a PhD student in Division of Biostatistics and Bioinformatics, Herbert Wertheim School of Public Health and Human Longevity Science, UC San Diego in USA. She is currently working as a Graduate Student Researcher at UC San Diego. She got her master’s degree in statistics from University of Minnesota, Twin Cities in USA. Her main research interests include statistical learning, statistical methods, clinical trial and causal inference.


Embedded Image

Footnotes

  • Contributors CL conceived the initial idea, searched the literature on related topics, performed analyses and assisted in manuscript preparation. XZ participated in the discussion of the statistical problems and helped with technical details of PLS, and helped finalise the manuscript. TTN brought the statistical problem in the real study example, participated in the discussion of the statistical problems, helped in the interpretation of estimates in the real study example and helped with technical details of least squares, and helped finalise the manuscript. JL, TW, XMT researched the statistical issues, directed simulation studies, drafted parts of the manuscript and finalised the manuscript. All authors provided critical feedback and helped shape the research, analysis and manuscript.

  • Funding This study was funded by National Institutes of Health (UL1TR001442).

  • Competing interests None declared.

  • Provenance and peer review Commissioned; externally peer reviewed.