## Partial least squares regression

As noted earlier, PCA and FA are two popular approaches for creating composite variables. Under PCA, a set of ordered composite variables are created to represent the original set of outcomes. Each composite variable is a linear combination, or a weighted sum, of the original outcomes. The coefficients, or weights, of the linear combination in each compositive variable, are called loadings, and their signs and magnitudes indicate the directions and contributions of the corresponding variables. Unlike the original outcomes, the composite variables are orthogonal to each other. Moreover, the first composite variable has the largest variance, followed by the second and so on. FA also creates a set of composite variables. However, unlike PCA, FA composite variables are not ordered in the sense of PCA composites and are not orthogonal to each other. Instead, loadings of the FA composites can be used to group the original variables to create subscale, or domain scales, for different constructs such as the domains of Physical Functioning and Emotional Well-being in the SF-36.1

PCA and FA create composite variables for general purposes. When composite variables are used as explanatory or independent variables in regression analysis involving a response or dependent variable, a more effective approach is PLS. Like PCA, PLS composite variables are also ordered. However, unlike PCA, PLS composite variables are ordered by their correlations with the response in the regression model; the first composite variable has the maximum correlation with the response, followed by the second and so on. If interest lies in finding a subset of the original explanatory variables in the linear model that explains the most variability in the response, PLS composite variables are more effective than PCA.

To describe in detail how to compute PLS composite variables, consider a linear regression with a continuous response of interest,
Y
, and a set of
p
explanatory variables, . We are interested in modelling the relationship of
Y
with . Given a sample of
n
subjects, the classic linear regression relating
Y
to is given by:

where
i
indexes the subjects, denote the regression parameters, is the error term, and denotes a normal distribution with mean
μ
and variance . The first part of the linear regression,

is called the conditional (population) mean of given the explanatory variables . On estimating the regression parameters , this conditional mean describes the association of with each of the explanatory variables.

When the explanatory variables are highly correlated, estimates of may not be reliable due to multicollinearity using the standard least squares (LS) or maximum likelihood (ML) method. In studies of high-throughput data, the number of explanatory variables exceeds sample sizes, in which case LS method will not apply. In both cases, we need to reduce the number of the variables, , or the dimension
p
. There are different approaches to address this issue. One may use the least absolute shrinkage and selection operator (LASSO) to determine a subset of that provides reliable associations with . Alternatively, one may create composite variables from and use a subset of the composite variables to predict . The latter composite variable approach is preferred if some or all need to work together to explain the variability in . For example, if one wants to predict areas of rectangles, lengths or widths alone will not provide reliable predictions since a rectangle with a very large length can still have a small area if it has a small width. LASSO is most effective to deal with high-throughput data as dimension is the primary problem in this case. In the presence of multicollinearity, it is likely more meaningful to aggregate information in correlated variables using a subset of composite variables, rather than to select a subset of the original variables. In this case, correlated variables may all contribute to explaining the variability in the response and composite variables will account for all such contributions.

The composite variables for PLS are obtained by solving an optimisation problem.2 Unlike PCA composite variables, PLS finds directions of the composite variables that have both high variance and high correlation with the response . Specifically, let denote the
l
th composite variable:

where denote weights, or loadings, of the composite . We can also express (3) equivalently in a vector form

or in a matrix form:

where , and denote column vectors, and denotes an matrix. The loadings are determined by the following optimisation procedure:

where is a column vector,
S
is the sample covariance matrix of , denotes the squared (Pearson) correlation matrix between and , and denotes the sample variance of . The condition ensures that the *l*th composite is uncorrelated with all previous composite variables ( ).

In practice, we can use the following procedure to find the PLS composite variables.3 We start by standardising each of the original explanatory variables to have mean 0 and variance 1. Set and ( ), where denotes the sample mean of ( ) and denotes a volume vector of 1. For , we perform the following steps:

(a) , where , where denotes the inner product between two vectors
a
and
b
;

(b) ;

(c) ;

(d) Orthogonalise each with respect to :

To illustrate the difference between PSL and PCA, here is the procedure to compute composite variables under PCA:

We start with the data matrix
X
, which is formed by the column vectors , ,…, , that is, . Then we perform the following steps:

Average over all the columns of
X
: ;

Centre the matrix
X
at this average by subtracting from each column vector of
X
, denote as: ;

Compute the sample variance–covariance matrix: ;

Compute the eigenvalues and corresponding eigenvectors of
Σ
with ;

The top
m
( ) principal components, or composite variables, , are then used as independent variables in linear regression models with
Y
as the dependent variable, where
m
is generally determined by the magnitude of the sum of the top
m
eigenvalues relative to the sum of all
p
eigen values, , which has the interpretation of being the percent of the variability of
X
explained by the top
m
eigenvectors .

By comparing the two procedures, we can see that PCA creates composite variables without using any information in the dependent variable
Y
as PLS does in creating its composite variables. If the goal is to find composite variables of
X
that are most predictive of
Y
, PLS is more preferrable than PCA. On the other hand, if the goal is to find composite variables that maximally explain the variability of the data matrix
X
, then PCA is more preferrable.