In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). Typically, it considers regressing the outcome (also known as the response or, the dependent variable) on a set of covariates (also known as predictors or, explanatory variables or, independent variables) based on a standard linear regression model, but uses PCA for estimating the unknown regression coefficients in the model.
Contents
- The principle
- Details of the method
- Two basic properties
- Variance reduction
- Addressing multicollinearity
- Dimension reduction
- Regularization effect
- Optimality of PCR among a class of regularized estimators
- Efficiency
- Shrinkage effect of PCR
- Generalization to kernel settings
- References
In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. One typically uses only a subset of all the principal components for regression, thus making PCR some kind of a regularized procedure. Often the principal components with higher variances (the ones based on eigenvectors corresponding to the higher eigenvalues of the sample variance-covariance matrix of the explanatory variables) are selected as regressors. However, for the purpose of predicting the outcome, the principal components with low variances may also be important, in some cases even more important.
One major use of PCR lies in overcoming the multicollinearity problem which arises when two or more of the explanatory variables are close to being collinear. PCR can aptly deal with such situations by excluding some of the low-variance principal components in the regression step. In addition, by usually regressing on only a subset of all the principal components, PCR can result in dimension reduction through substantially lowering the effective number of parameters characterizing the underlying model. This can be particularly useful in settings with high-dimensional covariates. Also, through appropriate selection of the principal components to be used for regression, PCR can lead to efficient prediction of the outcome based on the assumed model.
The principle
The PCR method may be broadly divided into three major steps:
1.Details of the method
Data Representation: Let
Data Pre-processing: Assume that
Underlying Model: Following centering, the standard Gauss–Markov linear regression model for
Objective: The primary goal is to obtain an efficient estimator
PCA Step: PCR starts by performing a PCA on the centered data matrix
The Principal Components:
Derived covariates: For any
The PCR Estimator: Let
Two basic properties
The fitting process for obtaining the PCR estimator involves regressing the response vector on the derived data matrix
When all the principal components are selected for regression so that
Variance reduction
For any
In particular:
Hence for all
Thus, for ll
where
Addressing multicollinearity
Under multicollinearity, two or more of the covariates are highly correlated, so that one can be linearly predicted from the others with a non-trivial degree of accuracy. Consequently, the columns of the data matrix
Dimension reduction
PCR may also be used for performing dimension reduction. To see this, let
Then, it can be shown that
is minimized at
The corresponding reconstruction error is given by:
Thus any potential dimension reduction may be achieved by choosing
Regularization effect
Since the PCR estimator typically uses only a subset of all the principal components for regression, it can be viewed as some sort of a regularized procedure. More specifically, for any
The constraint may be equivalently written as:
where:
Thus, when only a proper subset of all the principal components are selected for regression, the PCR estimator so obtained is based on a hard form of regularization that constrains the resulting solution to the column space of the selected principal component directions, and consequently restricts it to be orthogonal to the excluded directions.
Optimality of PCR among a class of regularized estimators
Given the constrained minimization problem as defined above, let us consider the following generalized version of it:
where,
Let
Then the optimal choice of the restriction matrix
where
Quite clearly, the resulting optimal estimator
Efficiency
Since the ordinary least squares estimator is unbiased for
where, MSE denotes the mean squared error. Now, if for some
We have already seen that
which then implies:
for that particular
Now Suppose now that for a given
it is still possible that
In order to ensure efficient estimation and prediction performance of PCR as an estimator of
Unlike the criteria based on the cumulative sum of the eigenvalues of
Shrinkage effect of PCR
In general, PCR is essentially a shrinkage estimator that usually retains the high variance principal components (corresponding to the higher eigenvalues of
In addition, the principal components are obtained from the eigen-decomposition of
Recently, a variant of the classical PCR known as the supervised PCR was proposed by Bair, Hastie, Paul and Tibshirani (2006). In a spirit similar to that of PLS, it attempts at obtaining derived covariates of lower dimensions based on a criteria that involves both the outcome as well as the covariates. The method starts by performing a set of
Generalization to kernel settings
The classical PCR method as described above is based on classical PCA and considers a linear regression model for predicting the outcome based on the covariates. However, it can be easily generalized to a kernel machine setting whereby the regression function need not necessarily be linear in the covariates, but instead it can belong to the Reproducing Kernel Hilbert Space associated with any arbitrary (possibly non-linear), symmetric positive-definite kernel. The linear regression model turns out to be a special case of this setting when the kernel function is chosen to be the linear kernel.
In general, under the kernel machine setting, the vector of covariates is first mapped into a high-dimensional (potentially infinite-dimensional) feature space characterized by the kernel function chosen. The mapping so obtained is known as the feature map and each of its coordinates, also known as the feature elements, corresponds to one feature (may be linear or, non-linear) of the covariates. The regression function is then assumed to be a linear combination of these feature elements. Thus, the underlying regression model in the kernel machine setting is essentially a linear regression model with the understanding that instead of the original set of covariates, the predictors are now given by the vector (potentially infinite-dimensional) of feature elements obtained by transforming the actual covariates using the feature map.
However, the kernel trick actually enables us to operate in the feature space without ever explicitly computing the feature map. It turns out that it is only sufficient to compute the pairwise inner products among the feature maps for the observed covariate vectors and these inner products are simply given by the values of the kernel function evaluated at the corresponding pairs of covariate vectors. The pairwise inner products so obtained may therefore be represented in the form of a
PCR in the kernel machine setting can now be implemented by first appropriately centering this kernel matrix (K, say) with respect to the feature space and then performing a kernel PCA on the centered kernel matrix (K', say) whereby an eigendecomposition of K' is obtained. Kernel PCR then proceeds by (usually) selecting a subset of all the eigenvectors so obtained and then performing a standard linear regression of the outcome vector on these selected eigenvectors. The eigenvectors to be used for regression are usually selected using cross-validation. The estimated regression coefficients (having the same dimension as the number of selected eigenvectors) along with the corresponding selected eigenvectors are then used for predicting the outcome for a future observation. In machine learning, this technique is also known as spectral regression.
Clearly, kernel PCR has a discrete shrinkage effect on the eigenvectors of K', quite similar to the discrete shrinkage effect of classical PCR on the principal components, as discussed earlier. However, it should be noted that the feature map associated with the chosen kernel could potentially be infinite-dimensional, and hence the corresponding principal components and principal component directions could be infinite-dimensional as well. Therefore, these quantities are often practically intractable under the kernel machine setting. Kernel PCR essentially works around this problem by considering an equivalent dual formulation based on using the spectral decomposition of the associated kernel matrix. Under the linear regression model (which corresponds to choosing the kernel function as the linear kernel), this amounts to considering a spectral decomposition of the corresponding