In statistics, the Gauss–Markov theorem (or simply Gauss theorem for some authors)[1] states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero.[2] The errors do not need to be normal for the theorem to apply, nor do they need to be independent and identically distributed (only uncorrelated with mean zero and homoscedastic with finite variance).
The requirement for unbiasedness cannot be dropped, since biased estimators exist with lower variance and mean squared error. For example, the James–Stein estimator (which also drops linearity) and ridge regression typically outperform ordinary least squares. In fact, ordinary least squares is rarely even an admissible estimator, as Stein's phenomenon shows--when estimating more than two unknown variables, ordinary least squares will always perform worse (in mean squared error) than Stein's estimator.
Moreover, the Gauss-Markov theorem does not apply when considering more principled loss functions, such as the assigned likelihood or Kullback–Leibler divergence, except in the limited case of normally-distributed errors.
As a result of these discoveries, statisticians typically motivate ordinary least squares by the principle of maximum likelihood instead, or by considering it as a kind of approximate Bayesian inference.
The theorem is named after Carl Friedrich Gauss and Andrey Markov. Gauss provided the original proof,[3] which was later substantially generalized by Markov.[4]
Suppose we are given two random variable vectors, .
Suppose want to find the best linear estimator of Y given X, such that
- would be the estimator,
and parameters such that it would be the best linear estimator of X.
Such an estimator would have the same linear properties of Y, .
Therefore, if the vector X has properties of , the best linear estimator would be
since it has the same mean and variance as Y.
Suppose we have, in matrix notation, the linear relationship
expanding to,
where are non-random but unobservable parameters, are non-random and observable (called the "explanatory variables"), are random, and so are random. The random variables are called the "disturbance", "noise" or simply "error" (will be contrasted with "residual" later in the article; see errors and residuals in statistics). Note that to include a constant in the model above, one can choose to introduce the constant as a variable with a newly introduced last column of X being unity i.e., for all . Note that though as sample responses, are observable, the following statements and arguments including assumptions, proofs and the others assume under the only condition of knowing but not
The Gauss–Markov assumptions concern the set of error random variables, :
- They have mean zero:
- They are homoscedastic, that is all have the same finite variance: for all and
- Distinct error terms are uncorrelated:
A linear estimator of is a linear combination
in which the coefficients are not allowed to depend on the underlying coefficients , since those are not observable, but are allowed to depend on the values , since these data are observable. (The dependence of the coefficients on each is typically nonlinear; the estimator is linear in each and hence in each random which is why this is "linear" regression.) The estimator is said to be unbiased if and only if
regardless of the values of . Now, let be some linear combination of the coefficients. Then the mean squared error of the corresponding estimation is
in other words, it is the expectation of the square of the weighted sum (across parameters) of the differences between the estimators and the corresponding parameters to be estimated. (Since we are considering the case in which all the parameter estimates are unbiased, this mean squared error is the same as the variance of the linear combination.) The best linear unbiased estimator (BLUE) of the vector of parameters is one with the smallest mean squared error for every vector of linear combination parameters. This is equivalent to the condition that
is a positive semi-definite matrix for every other linear unbiased estimator .
The ordinary least squares estimator (OLS) is the function
of and (where denotes the transpose of ) that minimizes the sum of squares of residuals (misprediction amounts):
The theorem now states that the OLS estimator is a best linear unbiased estimator (BLUE).
The main idea of the proof is that the least-squares estimator is uncorrelated with every linear unbiased estimator of zero, i.e., with every linear combination whose coefficients do not depend upon the unobservable but whose expected value is always zero.
Proof that the OLS indeed minimizes the sum of squares of residuals may proceed as follows with a calculation of the Hessian matrix and showing that it is positive definite.
The MSE function we want to minimize is
for a multiple regression model with p variables. The first derivative is
where is the design matrix
The Hessian matrix of second derivatives is
Assuming the columns of are linearly independent so that is invertible, let , then
Now let be an eigenvector of .
In terms of vector multiplication, this means
where is the eigenvalue corresponding to . Moreover,
Finally, as eigenvector was arbitrary, it means all eigenvalues of are positive, therefore is positive definite. Thus,
is indeed a global minimum.
Or, just see that for all vectors . So the Hessian is positive definite if full rank.