The sample mean or empirical mean and the sample covariance are statistics computed from a collection (the sample) of data on one or more random variables. The sample mean and sample covariance are estimators of the population mean and population covariance, where the term population refers to the set from which the sample was taken.
Contents
- Sample mean
- Sample covariance
- Discussion
- Variance of the sample mean
- Weighted samples
- Criticism
- References
The sample mean is a vector each of whose elements is the sample mean of one of the random variables – that is, each of whose elements is the arithmetic average of the observed values of one of the variables. The sample covariance matrix is a square matrix whose i, j element is the sample covariance (an estimate of the population covariance) between the sets of observed values of two of the variables and whose i, i element is the sample variance of the observed values of one of the variables. If only one variable has had values observed, then the sample mean is a single number (the arithmetic average of the observed values of that variable) and the sample covariance matrix is also simply a single value (a 1x1 matrix containing a single number, the sample variance of the observed values of that variable).
Due to their ease of calculation and other desirable characteristics, the sample mean and sample covariance are widely used in statistics and applications to numerically represent the location and dispersion, respectively, of a distribution.
Sample mean
Let
The sample mean vector
Thus, the sample mean vector contains the average of the observations for each variable, and is written
Sample covariance
The sample covariance matrix is a K-by-K matrix
where
Alternatively, arranging the observation vectors as the columns of a matrix, so that
which is a matrix of K rows and N columns. Here, the sample covariance matrix can be computed as
where
Like covariance matrices for random vector, sample covariance matrices are positive semi-definite. To prove it, one may notice its similarity with the covariance matrix associated with the random choice among the
Discussion
The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random vector
using the population mean, has
The maximum likelihood estimate of the covariance
for the Gaussian distribution case has N in the denominator as well. The ratio of 1/N to 1/(N − 1) approaches 1 for large N, so the maximum likelihood estimate approximately equals the unbiased estimate when the sample is large.
Variance of the sample mean
For each random variable, the sample mean is a good estimator of the population mean, where a "good" estimator is defined as being efficient and unbiased. Of course the estimator will likely not be the true value of the population mean since different samples drawn from the same distribution will give different sample means and hence different estimates of the true mean. Thus the sample mean is a random variable, not a constant, and consequently has its own distribution. For a random sample of N observations on the jth random variable, the sample mean's distribution itself has mean equal to the population mean
Weighted samples
In a weighted sample, each vector
(If they are not, divide the weights by their sum). Then the weighted mean vector
and the elements
If all weights are the same,
Criticism
The sample mean and sample covariance are not robust statistics, meaning that they are sensitive to outliers. As robustness is often a desired trait, particularly in real-world applications, robust alternatives may prove desirable, notably quantile-based statistics such as the sample median for location, and interquartile range (IQR) for dispersion. Other alternatives include trimming and Winsorising, as in the trimmed mean and the Winsorized mean.