In machine learning, the kernel embedding of distributions (also called the kernel mean or mean map) comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space
Contents
- Definitions
- Kernel embedding
- Empirical kernel embedding
- Joint distribution embedding
- Conditional distribution embedding
- Properties
- Convergence of empirical kernel mean to the true distribution embedding
- Universal kernels
- Parameter selection for conditional distribution kernel embeddings
- Rules of probability as operations in the RKHS
- Kernel sum rule
- Kernel chain rule
- Kernel Bayes rule
- Measuring distance between distributions
- Kernel two sample test
- Density estimation via kernel embeddings
- Measuring dependence of random variables
- Kernel belief propagation
- Nonparametric filtering in hidden Markov models
- Support measure machines
- Domain adaptation under covariate target and conditional shift
- Domain generalization via invariant feature representation
- Distribution regression
- Example
- References
The analysis of distributions is fundamental in machine learning and statistics, and many algorithms in these fields rely on information theoretic approaches such as entropy, mutual information, or Kullback–Leibler divergence. However, to estimate these quantities, one must first either perform density estimation, or employ sophisticated space-partitioning/bias-correction strategies which are typically infeasible for high-dimensional data. Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g. Gaussian mixture models), while nonparametric methods like kernel density estimation (Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or characteristic function representation (via the Fourier transform of the distribution) break down in high-dimensional settings.
Methods based on the kernel embedding of distributions sidestep these problems and also possess the following advantages:
- Data may be modeled without restrictive assumptions about the form of the distributions and relationships between variables
- Intermediate density estimation is not needed
- Practitioners may specify the properties of a distribution most relevant for their problem (incorporating prior knowledge via choice of the kernel)
- If a characteristic kernel is used, then the embedding can uniquely preserve all information about a distribution, while thanks to the kernel trick, computations on the potentially infinite-dimensional RKHS can be implemented in practice as simple Gram matrix operations
- Dimensionality-independent rates of convergence for the empirical kernel mean (estimated using samples from the distribution) to the kernel embedding of the true underlying distribution can be proven.
- Learning algorithms based on this framework exhibit good generalization ability and finite sample convergence, while often being simpler and more effective than information theoretic methods
Thus, learning via the kernel embedding of distributions offers a principled drop-in replacement for information theoretic approaches and is a framework which not only subsumes many popular methods in machine learning and statistics as special cases, but also can lead to entirely new learning algorithms.
Definitions
Let
Kernel embedding
The kernel embedding of the distribution
A kernel is characteristic if the mean embedding
Empirical kernel embedding
Given
Joint distribution embedding
If
By the equivalence between a tensor and a linear map, this joint embedding may be interpreted as an uncentered cross-covariance operator
Given
Conditional distribution embedding
Given a conditional distribution
Note that the embedding of
which given the feature mapping of
This assumption is always true for finite domains with characteristic kernels, but may not necessarily hold for continuous domains. Nevertheless, even in cases where the assumption fails,
Given training examples
where
Thus, the empirical estimate of the kernel conditional embedding is given by a weighted sum of samples of
Properties
Convergence of empirical kernel mean to the true distribution embedding
where
Universal kernels
on compact subsets of
and support of
Parameter selection for conditional distribution kernel embeddings
Rules of probability as operations in the RKHS
This section illustrates how basic probabilistic rules may be reformulated as (multi)linear algebraic operations in the kernel embedding framework and is primarily based on the work of Song et al. The following notation is adopted:
In practice, all embeddings are empirically estimated from data
Kernel sum rule
In probability theory, the marginal distribution of
The analog of this rule in the kernel embedding framework states that
In practical implementations, the kernel sum rule takes the following form
where
Kernel chain rule
In probability theory, a joint distribution can be factorized into a product between conditional and marginal distributions
The analog of this rule in the kernel embedding framework states that
In practical implementations, the kernel chain rule takes the following form
Kernel Bayes' rule
In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as
The analog of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution
In practical implementations, the kernel Bayes' rule takes the following form
where
Measuring distance between distributions
The maximum mean discrepancy (MMD) is a distance-measure between distributions
While most distance-measures between distributions such as the widely used Kullback–Leibler divergence either require density estimation (either parametrically or nonparametrically) or space partitioning/bias correction strategies, the MMD is easily estimated as an empirical mean which is concentrated around the true value of the MMD. The characterization of this distance as the maximum mean discrepancy refers to the fact that computing the MMD is equivalent to finding the RKHS function that maximizes the difference in expectations between the two probability distributions
Kernel two sample test
Given n training examples from
to obtain a two-sample test of the null hypothesis that both samples stem from the same distribution (i.e.
Density estimation via kernel embeddings
Although learning algorithms in the kernel embedding framework circumvent the need for intermediate density estimation, one may nonetheless use the empirical embedding to perform density estimation based on n samples drawn from an underlying distribution
where the maximization is done over the entire space of distributions on
Measuring dependence of random variables
A measure of the statistical dependence between random variables
and can be used as a principled replacement for mutual information, Pearson correlation or any other dependence measure used in learning algorithms. Most notably, HSIC can detect arbitrary dependencies (when a characteristic kernel is used in the embeddings, HSIC is zero if and only if the variables are independent), and can be used to measure dependence between different types of data (e.g. images and text captions). Given n i.i.d. samples of each random variable, a simple parameter-free unbiased estimator of HSIC which exhibits concentration about the true value can be computed in
Kernel belief propagation
Belief propagation is a fundamental algorithm for inference in graphical models in which nodes repeatedly pass and receive messages corresponding to the evaluation of conditional expectations. In the kernel embedding framework, the messages may be represented as RKHS functions and the conditional distribution embeddings can be applied to efficiently compute message updates. Given n samples of random variables represented by nodes in a Markov Random Field, the incoming message to node t from node u can be expressed as
where
Thus, if the incoming messages to node t are linear combinations of feature mapped samples from
Nonparametric filtering in hidden Markov models
In the hidden Markov model (HMM), two key quantities of interest are the transition probabilities between hidden states
One common use of HMMs is filtering in which the goal is to estimate posterior distribution over the hidden state
by computing the embeddings of the prediction step via the kernel sum rule and the embedding of the conditioning step via kernel Bayes' rule. Assuming a training sample
where
Support measure machines
The support measure machine (SMM) is a generalization of the support vector machine (SVM) in which the training examples are probability distributions paired with labels
which is computable in closed form for many common specific distributions
Under certain choices of the embedding kernel
Domain adaptation under covariate, target, and conditional shift
The goal of domain adaptation is the formulation of learning algorithms which generalize well when the training and test data have different distributions. Given training examples
- Covariate Shift in which the marginal distribution of the covariates changes across domains:
P t r ( X ) ≠ P t e ( X ) - Target Shift in which the marginal distribution of the outputs changes across domains:
P t r ( Y ) ≠ P t e ( Y ) - Conditional Shift in which
P ( Y ) remains the same across domains, but the conditional distributions differ:P t r ( X ∣ Y ) ≠ P t e ( X ∣ Y ) . In general, the presence of conditional shift leads to an ill-posed problem, and the additional assumption thatP ( X ∣ Y ) changes only under location-scale (LS) transformations onX is commonly imposed to make the problem tractable.
By utilizing the kernel embedding of marginal and conditional distributions, practical approaches to deal with the presence of these types of differences between training and test domains can be formulated. Covariate shift may be accounted for by reweighting examples via estimates of the ratio
To deal with location scale conditional shift, one can perform a LS transformation of the training points to obtain new transformed training data
In general, the kernel embedding methods for dealing with LS conditional shift and target shift may be combined to find a reweighted transformation of the training data which mimics the test distribution, and these methods may perform well even in the presence of conditional shifts other than location-scale changes.
Domain generalization via invariant feature representation
Given N sets of training examples sampled i.i.d. from distributions
Defining a probability distribution
so
Distribution regression
In distribution regression, the goal is to regress from probability distributions to reals (or vectors). Many important machine learning and statistical tasks fit into this framework, including multi-instance learning, and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). In practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. Distribution regression has been successfully applied for example in supervised entropy learning, and aerosol prediction using multispectral satellite images.
Given
where
The prediction on a new distribution
where
Example
In this simple example, which is taken from Song et al.,
The conditional distribution embedding operator
Thus, the embeddings of the conditional distribution under a fixed value of
In this discrete-valued setting with the Kronecker delta kernel, the kernel sum rule becomes
The kernel chain rule in this case is given by