Low Rank Matrix Approximations are essential tools in the application of kernel methods to large-scale learning problems.
Contents
- Nystrm approximation
- Theorem for kernel approximation
- Proof
- General theorem for kernel approximation for a feature map
- Application for Regularized Least Squares
- Randomized feature maps approximation
- Random Fourier Features
- Random Binning Features
- Comparison of approximation methods
- References
Kernel methods (for instance, support vector machines or gaussian processes) project data points into a high-dimensional or infinite-dimensional feature space and find the optimal splitting hyperplane. In the kernel method the data is represented in a kernel matrix (or, Gram matrix). Lots of algorithms can solve machine learning problems using the kernel matrix. The main problem of kernel method is its high computational cost associated with kernel matrices. The cost is at least quadratic in the number of training data points, but most kernel methods include computation of matrix inversion or eigenvalue decomposition and the cost becomes cubic in the number of training data. Large training sets cause large storage and computational costs. In spite of low rank decomposition methods (Cholesky decomposition) reduce this cost, they continue to require computing the kernel matrix. One of the approaches to deal with this problem is Low Rank Matrix Approximations. The most popular examples of them are Nyström method and the Random Features. Both of them have been successfully applied to efficient kernel learning.
Nyström approximation
Kernel methods become unfeasible when the number of points
If
It reduces the storage and complexity requirements to
Theorem for kernel approximation
and
Proof
Singular Value Decomposition application
Applying Singular Value Decomposition (SVD) to matrix
If
then the matrix
Further proof
Applying Singular Value Decomposition (SVD) to these matrices:
Applying Singular Value Decomposition (SVD) to these matrices:
Since
Replacing
However, defining
By the characterization for orthogonal matrix
Then the expression for
Defining
General theorem for kernel approximation for a feature map
For a feature map
Application for Regularized Least Squares
In a vector and kernel notation, the problem of Regularized Least Squares can be rewritten as:
Computing the gradient and setting in to 0, the minimum can be obtained:
The inverse matrix
It has the desired storage and complexity requirements.
Randomized feature maps approximation
Let
where
Since
Random Fourier Features
Random Fourier Features map produces a Monte-Carlo approximation to the feature map. Monte-Carlo method is considered to be randomized. These random features consists of sinusoids
Random Binning Features
Random Binning Features map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bins in which it falls. The grids are constructed so that the probability that two points
Comparison of approximation methods
The approaches for large-scale kernel learning (Nyström method and Random Features) differs in the fact that the Nyström method uses data dependent basis functions while in Random Features approach the basis functions are sampled from a distribution independent from the training data. This difference leads to an improved analysis for kernel learning approaches based on the Nyström method. When there is a large gap in the eigen-spectrum of the kernel matrix, approaches based on the Nyström method can achieve better results than Random Features based approach.