![]() | ||
In statistics and mathematics, linear least squares is an approach fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model. The resulting fitted model can be used to summarize the data, to predict unobserved values from the same system, and to understand the mechanisms that may underlie the system.
Contents
- Example
- Using a quadratic model
- The general problem
- MATLAB
- Python
- Derivation of the normal equations
- Derivation directly in terms of matrices
- Derivation without calculus
- Generalization for complex equations
- Computation
- Inverting the matrix of the normal equations
- Orthogonal decomposition methods
- Properties of the least squares estimators
- Limitations
- Weighted linear least squares
- Parameter errors and correlation
- Parameter confidence limits
- Residual values and correlation
- Objective function
- Constrained linear least squares
- Typical uses and applications
- Uses in data fitting
- Further discussion
- Rounding errors
- References
Mathematically, linear least squares is the problem of approximately solving an overdetermined system of linear equations, where the best approximation is defined as that which minimizes the sum of squared differences between the data values and their corresponding modeled values. The approach is called linear least squares since the assumed function is linear in the parameters to be estimated. Linear least squares problems are convex and have a closed-form solution that is unique, provided that the number of data points used for fitting equals or exceeds the number of unknown parameters, except in special degenerate situations. In contrast, non-linear least squares problems generally must be solved by an iterative procedure, and the problems can be non-convex with multiple optima for the objective function. If prior distributions are available, then even an underdetermined system can be solved using the Bayesian MMSE estimator.
In statistics, linear least squares problems correspond to a particularly important type of statistical model called linear regression which arises as a particular form of regression analysis. One basic form of such a model is an ordinary least squares model. The present article concentrates on the mathematical aspects of linear least squares problems, with discussion of the formulation and interpretation of statistical regression models and statistical inferences related to these being dealt with in the articles just mentioned. See outline of regression analysis for an outline of the topic.
Example
As a result of an experiment, four
of four equations in two unknowns in some "best" sense.
The "error", at each point, between the curve fit and the data is the difference between the right- and left-hand sides of the equations above. The least squares approach to solving this problem is to try to make the sum of the squares of these errors as small as possible; that is, to find the minimum of the function
The minimum is determined by calculating the partial derivatives of
This results in a system of two equations in two unknowns, called the normal equations, which when solved give
and the equation
More generally, one can have
Using a quadratic model
Importantly, in "linear least squares", we are not restricted to using a line as the model as in the above example. For instance, we could have chosen the restricted quadratic model
The partial derivatives with respect to the parameters (this time there is only one) are again computed and set to 0:
and solved
leading to the resulting best fit model
The general problem
Consider an overdetermined system
of m linear equations in n unknown coefficients, β1,β2,…,βn, with m > n. (Note: for a linear model as above, not all of
where
Such a system usually has no solution, so the goal is instead to find the coefficients
where the objective function S is given by
A justification for choosing this criterion is given in properties below. This minimization problem has a unique solution, provided that the n columns of the matrix
The matrix
MATLAB
The following MATLAB code shows implementation of this approach on the data used in the first example above.
Python
Python code using the same variable naming as the MATLAB code above:
Derivation of the normal equations
Define the
Then
Given that S is convex, it is minimized when its gradient vector is zero (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further - see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:
The derivatives are
Substitution of the expressions for the residuals and the derivatives into the gradient equations gives
Thus if
Upon rearrangement, we obtain the normal equations:
The normal equations are written in matrix notation as
The solution of the normal equations yields the vector
Derivation directly in terms of matrices
The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize
Note that :
Differentiating this with respect to
which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that
Derivation without calculus
When
can be written as
where
It follows that
and therefore minimized exactly when
Generalization for complex equations
In general, the coefficients of the matrices
where
We should now take derivatives of
and the derivatives changes into
After rewriting
which, after adding it together and comparing to zero ( minimalization condition for
In matrix form:
Computation
A general approach to the least squares problem
simply because
is exactly a sought for orthogonal projection of
Inverting the matrix of the normal equations
The algebraic solution of the normal equations can be written as
where X+ is the Moore–Penrose pseudoinverse of X. Although this equation is correct and can work in many applications, it is not computationally efficient to invert the normal-equations matrix (the Gramian matrix). An exception occurs in numerical smoothing and differentiation where an analytical expression is required.
If the matrix XTX is well-conditioned and positive definite, implying that it has full rank, the normal equations can be solved directly by using the Cholesky decomposition RTR, where R is an upper triangular matrix, giving:
The solution is obtained in two stages, a forward substitution step, solving for z:
followed by a backward substitution, solving for
Both substitutions are facilitated by the triangular nature of R.
See example of linear regression for a worked-out numerical example with three parameters.
Orthogonal decomposition methods
Orthogonal decomposition methods of solving the least squares problem are slower than the normal equations method but are more numerically stable because they avoid forming the product XTX.
The residuals are written in matrix notation as
The matrix X is subjected to an orthogonal decomposition, e.g., the QR decomposition as follows.
where Q is an m×m orthogonal matrix (QTQ=I) and R is an n×n upper triangular matrix with
The residual vector is left-multiplied by QT.
Because Q is orthogonal, the sum of squares of the residuals, s, may be written as:
Since v doesn't depend on β, the minimum value of s is attained when the upper block, u, is zero. Therefore the parameters are found by solving:
These equations are easily solved as R is upper triangular.
An alternative decomposition of X is the singular value decomposition (SVD)
where U is m by m orthogonal matrix, V is n by n orthogonal matrix and
where P is obtained from
and thus,
is a solution of a least squares problem. This method is the most computationally intensive, but is particularly useful if the normal equations matrix, XTX, is very ill-conditioned (i.e. if its condition number multiplied by the machine's relative round-off error is appreciably large). In that case, including the smallest singular values in the inversion merely adds numerical noise to the solution. This can be cured with the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, a process closely related to factor analysis.
Properties of the least-squares estimators
The gradient equations at the minimum can be written as
A geometrical interpretation of these equations is that the vector of residuals,
Introducing
The equation and solution of linear least squares are thus described as follows:
If the experimental errors,
For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss-Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.
However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a maximum likelihood estimator.
These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.
Limitations
An assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares or more generally errors-in-variables models, or rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.
In some cases the (weighted) normal equations matrix XTX is ill-conditioned. When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases. In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is called ridge regression. If further information about the parameters is known, for example, a range of possible values of
Another drawback of the least squares estimator is the fact that the norm of the residuals,
Weighted linear least squares
In some cases the observations may be weighted—for example, they may not be equally reliable. In this case, one can minimize the weighted sum of squares:
where wi > 0 is the weight of the ith observation, and W is the diagonal matrix of such weights.
The weights should, ideally, be equal to the reciprocal of the variance of the measurement. The normal equations are then:
This method is used in iteratively reweighted least squares.
Parameter errors and correlation
The estimated parameter values are linear combinations of the observed values
Therefore an expression for the residuals (i.e., the estimated errors in the parameters) can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the parameters by Mβ. Then
When W = M−1, this simplifies to
When unit weights are used (W = I, the identity matrix), it is implied that the experimental errors are uncorrelated and all equal: M = σ2I, where σ2 is the a priori variance of an observation. In any case, σ2 is approximated by the reduced chi-squared
where S is the minimum value of the (weighted) objective function:
The denominator,
In all cases, the variance of the parameter
Parameter confidence limits
It is often assumed, for want of any concrete evidence but often appealing to the central limit theorem—see Normal distribution#Occurrence—that the error on each observation belongs to a normal distribution with a mean of zero and standard deviation
The assumption is not unreasonable when m >> n. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with m − n degrees of freedom. When m >> n Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error.
When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.
Residual values and correlation
The residuals are related to the observations by
where H is the idempotent matrix known as the hat matrix:
and I is the identity matrix. The variance-covariance matrix of the residuals, Mr is given by
Thus the residuals are correlated, even if the observations are not.
When
The sum of residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by XT:
Say, for example, that the first term of the model is a constant, so that
Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero it is not accidental but is a consequence of the presence of the constant term, α, in the model.
If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals, but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.
Objective function
The optimal value of the objective function, found by substituting in the optimal expression for the coefficient vector, can be written as (assuming unweighted observations)
the latter equality holding, since (I − H) is symmetric and idempotent. It can be shown from this that under an appropriate assignment of weights the expected value of S is m − n. If instead unit weights are assumed, the expected value of S is
If it is assumed that the residuals belong to a normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a chi-squared (
These values can be used for a statistical criterion as to the goodness of fit. When unit weights are used, the numbers should be divided by the variance of an observation.
Constrained linear least squares
Often it is of interest to solve a linear least squares problem with an additional constraint on the solution. With constrained linear least squares, the original equation
must be fit as closely as possible (in the least squares sense) while ensuring that some other property of
When the constraint only applies to some of the variables, the mixed problem may be solved using separable least squares by letting
back into the original expression gives (following some rearrangement) an equation that can be solved as a purely constrained problem in
where
Typical uses and applications
Uses in data fitting
The primary application of linear least squares is in data fitting. Given a set of m data points
Here, the functions
Ideally, the model function fits the data exactly, so
for all
so to minimize the function
After substituting for
and the best fit can be found by solving the normal equations.
Further discussion
The numerical methods for linear least squares are important because linear regression models are among the most important types of model, both as formal statistical models and for exploration of data-sets. The majority of statistical computer packages contain facilities for regression analysis that make use of linear least squares computations. Hence it is appropriate that considerable effort has been devoted to the task of ensuring that these computations are undertaken efficiently and with due regard to round-off error.
Individual statistical analyses are seldom undertaken in isolation, but rather are part of a sequence of investigatory steps. Some of the topics involved in considering numerical methods for linear least squares relate to this point. Thus important topics can be
Fitting of linear models by least squares often, but not always, arise in the context of statistical analysis. It can therefore be important that considerations of computation efficiency for such problems extend to all of the auxiliary quantities required for such analyses, and are not restricted to the formal solution of the linear least squares problem.
Rounding errors
Matrix calculations, like any other, are affected by rounding errors. An early summary of these effects, regarding the choice of computation methods for matrix inversion, was provided by Wilkinson.