In operator theory, a branch of mathematics, a positive definite kernel is a generalization of a positive definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving integral operator equations. Since then positive definite functions and their various analogues and generalizations have arisen in diverse parts of mathematics. They occur naturally in Fourier analysis, probability theory, operator theory, complex function-theory, moment problems, integral equations, boundary-value problems for partial differential equations, machine learning, embedding problem, information theory, and other areas.
Contents
- Definition
- Some general properties
- Examples of pd kernels
- History
- Connection with reproducing kernel Hilbert spaces and feature maps
- Kernels and distances
- Kernels in machine learning
- Kernels in probabilistic models
- Numerical solution of partial differential equations
- Other applications
- References
This article will discuss some of the historical and current developments of the theory of positive definite kernels, starting with the general idea and properties before considering practical applications.
Definition
Let
holds for any
In mathematical literature, kernels are usually complex valued functions, but in this article we assume real-valued functions, which is the common practice in machine learning and other applications of p.d. kernels.
Some general properties
Examples of p.d. kernels
can be used to define p.d. kernels using the following formula
History
PD kernels, as defined in (1.1), have arisen first in 1909 in a paper on integral equations by James Mercer. Several other authors made use of this concept in the following two decades, but none of them explicitly used kernels
In particular, Hilbert had shown that
where
holds absolutely and uniformly.
At about the same time W. H. Young., motivated by a different question in the theory of integral equations, showed that for continuous kernels condition (1.1) is equivalent to
E.H. Moore initiated the study of a very general kind of p.d. kernel. If
Another line of development in which p.d. kernels played a large role was the theory of harmonics on homogeneous spaces as begun by E. Cartan in 1929, and continued by H. Weyl and S. Ito. The most comprehensive theory of p.d. kernels in homogeneous spaces is that of M. Krein which includes as special cases the work on p.d. functions and irreducible unitary representations of locally compact groups.
In probability theory p.d. kernels arise as covariance kernels of stochastic processes.
Connection with reproducing kernel Hilbert spaces and feature maps
Positive definite kernels provide a framework that encompasses some basic Hilbert space constructions. In the following we present a tight relationship between positive definite kernels and two mathematical objects, namely reproducing Hilbert spaces and feature maps.
Let
Definition: Space
Every RKHS has a special function associated to it, namely the reproducing kernel:
Definition: Reproducing kernel is a function
The following result shows equivalence between RKHS and reproducing kernels:
Theorem: Every reproducing kernel
Now the connection between p.d. kernels and RKHS is given by the following theorem
Theorem: Every reproducing kernel is positive definite, and every p.d. kernel defines a unique RKHS, of which it is the unique reproducing kernel.
Thus given a positive definite kernel
As stated earlier, p.d. kernels can be constructed from inner products. This fact can be used to connect p.d. kernels with another interesting object that arises in machine learning applications, namely the feature map. Let
Indeed, positive definiteness of
Kernels and distances
Kernel methods, which are very popular machine learning applications of p.d. kernels, are often compared to distance based methods such as nearest neighbors. In this section we discuss parallels between their two respective ingredients, namely kernels
Here by a distance function between each pair of elements of some set
The link between distances and p.d. kernels is given by a particular kind of kernel, called negative definite kernel, and defined as follows
Definition: A symmetric function
holds for any
The parallel between n.d. kernels and distances is in the following: whenever a n.d. kernel vanishes on the set
On the other hand, n.d. kernels can be identified with a subfamily of p.d. kernels known as infinitely divisible kernels. A nonnegative-valued kernel
Kernels in machine learning
Positive definite kernels, through their equivalence with reproducing kernel Hilbert spaces, are particularly important in the field of statistical learning theory because of the celebrated representer theorem which states that every function in an RKHS can be written as a linear combination of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the empirical risk minimization problem from an infinite dimensional to a finite dimensional optimization problem.
Kernels in probabilistic models
There are several different ways in which kernels arise in probability theory.
Assume now that a noise variable
as a smooth estimate.
Numerical solution of partial differential equations
One of the greatest application areas of so-called meshfree methods is in the numerical solution of PDEs. Some of the popular meshfree methods are closely related to positive definite kernels (such as meshless local Petrov Galerkin (MLPG), Reproducing kernel particle method (RKPM) and smoothed-particle hydrodynamics (SPH)). These methods use radial basis kernel for collocation
Other applications
In the literature on computer experiments and other engineering experiments one increasingly encounters models based on p.d. kernels, RBFs or kriging. One such topic is response surface modeling. Other types of applications that boil down to data fitting are rapid prototyping and computer graphics. Here one often uses implicit surface models to approximate or interpolate point cloud data.
Applications of p.d. kernels in various other branches of mathematics are in multivariate integration, multivariate optimization, and in numerical analysis and scientific computing, where one studies fast, accurate and adaptive algorithms ideally implemented in high-performance computing environments.