The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.
Contents
Least squares estimator for β = b
Using matrix notation, the sum of squared residuals is given by
Where
Since this is a quadratic expression, the vector which gives the global minimum may be found via matrix calculus by differentiating with respect to the vector b (using denominator layout) and setting equal to zero:
By assumption matrix X has full column rank, and therefore X'X is invertible and the least squares estimator for β is given by
Unbiasedness and variance of β ^ {displaystyle {hat {eta }}}
Plug y = Xβ + ε into the formula for
where E[ε|X] = 0 by assumptions of the model.
For the variance, let
where we used the fact that
For a simple linear regression model, where
Expected value of σ ^ 2 {displaystyle {hat {sigma }}^{2}}
First we will plug in the expression for y into the estimator, and use the fact that X'M = MX = 0 (matrix M projects onto the space orthogonal to X):
Now we can recognize ε'Mε as a 1×1 matrix, such matrix is equal to its own trace. This is useful because by properties of trace operator, tr(AB)=tr(BA), and we can use this to separate disturbance ε from matrix M which is a function of regressors X:
Using the Law of iterated expectation this can be written as
Recall that M = I − P where P is the projection onto linear space spanned by columns of matrix X. By properties of a projection matrix, it has p = rank(X) eigenvalues equal to 1, and all other eigenvalues are equal to 0. Trace of a matrix is equal to the sum of its characteristic values, thus tr(P)=p, and tr(M) = n − p. Therefore,
Note: in the later section “Maximum likelihood” we show that under the additional assumption that errors are distributed normally, the estimator
Consistency and asymptotic normality of β ^ {displaystyle {hat {eta }}}
Estimator
We can use the law of large numbers to establish that
By Slutsky's theorem and continuous mapping theorem these results can be combined to establish consistency of estimator
The central limit theorem tells us that
Applying Slutsky's theorem again we'll have
Maximum likelihood approach
Maximum likelihood estimation is a generic technique for estimating the unknown parameters in a statistical model by constructing a log-likelihood function corresponding to the joint distribution of the data, then maximizing this function over all possible parameter values. In order to apply this method, we have to make an assumption about the distribution of y given X so that the log-likelihood function can be constructed. The connection of maximum likelihood estimation to OLS arises when this distribution is modeled as a multivariate normal.
Specifically, assume that the errors ε have multivariate normal distribution with mean 0 and variance matrix σ2I. Then the distribution of y conditionally on X is
and the log-likelihood function of the data will be
Differentiating this expression with respect to β and σ2 we'll find the ML estimates of these parameters:
We can check that this is indeed a maximum by looking at the Hessian matrix of the log-likelihood function.
Finite-sample distribution
Since we have assumed in this section that the distribution of error terms is known to be normal, it becomes possible to derive the explicit expressions for the distributions of estimators
so that by the affine transformation properties of multivariate normal distribution
Similarly the distribution of
where
Moreover, the estimators