In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be used.
Contents
- Information gain by better modelling
- Information gain by a conditional model
- Special cases and generalized usage
- Linear regression
- Correlation coefficient as measure of explained variance
- In principal component analysis
- Criticism
- References
The complementary part of the total variation is called unexplained or residual.
Information gain by better modelling
Following Kent (1983), we use the Fraser information (Fraser 1965)
where
Parameters are determined by maximum likelihood estimation,
The information gain of model 1 over model 0 is written as
where a factor of 2 is included for convenience. Γ is always nonnegative; it measures the extent to which the best model of family 1 is better than the best model of family 0 in explaining g(r).
Information gain by a conditional model
Assume a two-dimensional random variable
whereas in family 0, X and Y are assumed to be independent. We define the randomness of Y by
can be interpreted as proportion of the data dispersion which is "explained" by X.
Special cases and generalized usage
For special models, the above definition yields particularly appealing results. Regrettably, these simplified definitions of explained variance are used even in situations where the underlying assumptions do not hold.
Linear regression
The fraction of variance unexplained is an established concept in the context of linear regression. The usual definition of the coefficient of determination is based on the fundamental concept of explained variance.
Correlation coefficient as measure of explained variance
Let X be a random vector, and Y a random variable that is modeled by a normal distribution with centre
Note the strong model assumptions: the centre of the Y distribution must be a linear function of X, and for any given x, the Y distribution must be normal. In other situations, it is generally not justified to interpret
In principal component analysis
Explained variance is routinely used in principal component analysis. The relation to the Fraser–Kent information gain remains to be clarified.
Criticism
As the fraction of "explained variance" equals the correlation coefficient
In the words of one critic: "Thus