Christophe Hurlin
University of Orléans
November 23, 2013
Introduction
The objectives of this chapter are the following:
1 De…ne the multiple linear regression model.
2 Introduce the ordinary least squares (OLS) estimator.
The outline of this chapter is the following:
Section 2: The multiple linear regression model Section 3: The ordinary least squares estimator Section 4: Statistical properties of the OLS estimator Subsection 4.1: Finite sample properties
Subsection 4:2: Asymptotic properties
References
Amemiya T. (1985), Advanced Econometrics. Harvard University Press.
Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil (recommended)
Pelgrin, F. (2010), Lecture notes Advanced Econometrics, HEC Lausanne (a special thank)
Ruud P., (2000) An introduction to Classical Econometric Theory, Oxford University Press.
notation.
fY (y) probability density or mass function FY (y) cumulative distribution function Pr() probability
y vector
Y matrix
Be careful: in this chapter, I don’t distinguish between a random vector (matrix) and a vector (matrix) of deterministic elements. For more appropriate notations, see:
The Multiple Linear Regression Model
Objectives
1 De…ne the concept of Multiple linear regression model.
2 Semi-parametric andParametric multiple linear regression model.
3 The multiple linear Gaussianmodel.
De…nition (Multiple linear regression model)
The multiple linear regression model is used to study the relationship between a dependent variable and one or more independent variables. The generic form of the linear regression model is
y =x1β1+x2β2+..+xKβK +ε
where y is the dependent or explained variable and x1, .., xK are the independent or explanatory variables.
Notations
1 y is the dependent variable, the regressand or the explained variable.
2 xj is an explanatory variable, a regressor or a covariate.
3 ε is the error term or disturbance.
IMPORTANT: do not use the term "residual"..
Notations (cont’d)
The term ε is a random disturbance, so named because it “disturbs” an otherwise stable relationship. The disturbance arises for several reasons:
1 Primarily because we cannot hope to capture every in‡uence on an economic variable in a model, no matter how elaborate. The net e¤ect, which can be positive or negative, of theseomitted factors is captured in the disturbance.
2 There are many other contributors to the disturbance in an empirical model. Probably the most signi…cant iserrors of measurement. It is easy to theorize about the relationships among precisely de…ned variables; it is quite another to obtain accurate measures of these variables.
Notations (cont’d)
We assume that each observation in a sample fyi, xi 1, xi 2..xiKg for i =1, .., N is generated by an underlying process described by
yi =xi 1β1+xi 2β2+..+ xiKβK +εi
Remark:
xik =value of the kth explanatory variable for the ith unit of the sample xunit,variable
Notations (cont’d)
Let the N 1 column vector xk be the N observations on variable xk, for k =1, .., K .
Let assemble these data in an N K data matrix, X.
Let y be the N 1 column vector of the N observations, y1,y2, .., yN. Let ε be the N 1 column vector containing the N disturbances.
Notations (cont’d)
y
N 1
= 0 BB BB BB
@ y1 y2 ..
yi ..
yN
1 CC CC CC A
xk
N 1
= 0 BB BB BB
@ x1k x2k ..
xik ..
xNk
1 CC CC CC A
Nε1= 0 BB BB BB
@ ε1
ε2
..
εi
..
εN
1 CC CC CC A
β
K 1
= 0 BB
@ β1 β2 ..
βK 1 CC A
Notations (cont’d)
NXK = (x1 : x2 : .. : xK) or equivalently
NXK = 0 BB BB BB
@
x11 x12 .. x1k .. x1K x21 x22 .. x2k .. x2K .. .. .. .. .. ..
xi 1 xi 2 .. xik .. xiK
.. .. .. .. ..
xN 1 xN 2 .. xNk .. xNK 1 CC CC CC A
Fact
In most of cases, the …rst column of X is assumed to be a column of 1s so that β1 is the constant term in the model.
x1
N 1
= 1
N 1
X
N K =
0 BB BB BB
@
1 x12 .. x1k .. x1K 1 x22 .. x2k .. x2K .. .. .. .. .. ..
1 xi 2 .. xik .. xiK
.. .. .. .. ..
1 xN 2 .. xNk .. xNK 1 CC CC CC A
Remark
More generally, the matrix X may as well contain stochastic and non stochastic elements such as:
Constant;
Time trend;
Dummy variables (for speci…c episodes in time);
Etc.
Therefore, X is generally a mixture of …xed and random variables.
De…nition (Simple linear regression model)
The simple linear regression model is a model with only one stochastic regressor: K =1 if there is no constant
yi = β1xi+εi
or K =2 if there is a constant:
yi =β1+β2xi 2+εi
for i =1, .., N, or
y= β1+β2x2+ε
De…nition (Multiple linear regression model)
The multiple linear regression model can be written y
N 1
= X
N K β
K 1
+ ε
N 1
One key di¤erence for the speci…cation of the MLRM:
Parametric/semi-parametric speci…cation Parametric model: the distribution of the error terms is fully characterized, e.g. ε N (0,Ω)
Semi-Parametric speci…cation: only a few moments of the error terms are speci…ed, e.g. E(ε) =0 andV(ε) =E εε> = Ω.
This di¤erence does not matter for the derivation of the ordinary least square estimator
But this di¤erence matters for (among others):
1 The characterization of the statistical properties of the OLS estimator (e.g., e¢ ciency);
2 The choice of alternative estimators (e.g., the maximum likelihood estimator);
3 Etc.
De…nition (Semi-parametric multiple linear regression model) The semi-parametric multiple linear regression model is de…ned by
y =Xβ+ε where the error term ε satis…es
E(εjX) = 0
N 1
V(εjX) =σ2 IN
N N
and IN is the identity matrix of order N.
Remarks
1 If the matrix X is non stochastic (…xed), i.e. there are only …xed regressors, then the conditions on the error term u read:
E(ε) =0 V(ε) =σ2IN
2 If the (conditional) variance covariance matrix of ε is not diagonal, i.e. if
V(εjX) =Ω
the model is called the Multiple Generalized Linear Regression Model
Remarks (cont’d)
The two conditions on the error term ε E(εjX) =0N 1 V(εjX) =σ2IN
are equivalent to
E(yjX) =Xβ V(yjX) =σ2IN
De…nition (The multiple linear Gaussian model)
The (parametric) multiple linear Gaussian model is de…ned by y =Xβ+ε
where the error term ε is normally distributed ε N 0, σ2IN
As a consequence, the vector y has a conditional normal distribution with yjX N Xβ, σ2IN
Remarks
1 The multiple linear Gaussian model is (by de…nition) a parametric model.
2 If the matrix X is non stochastic (…xed), i.e. there are only …xed regressors, then the vector y has marginal normal distribution:
y N Xβ, σ2IN
The classical linear regression model consists of a set of assumptions that describes how the data set is produced by a data generating process (DGP) Assumption 1: Linearity
Assumption 2: Full rank condition or identi…cation Assumption 3: Exogeneity
Assumption 4: Spherical error terms Assumption 5: Data generation Assumption 6: Normal distribution
De…nition (Assumption 1: Linearity)
The model is linear with respect to the parameters β1, .., βK.
variable and the regressors. For instance, the models y =β0+β1x+u
y = β0+β1cos(x) +v y = β0+β1 1
x +w are all linear (with respect to β).
In contrast, the model y =β0+β1xβ2+ε is non linear
Remark
The model can be linear after some transformations. Starting from y =Axβexp(ε), one has a log-linear speci…cation:
ln(y) =ln(A) +β ln(x) +ε
De…nition (Log-linear model) The loglinear model is
ln(yi) =β1ln(xi 1) +β2ln(xi 2) +..+βK ln(xiK) +εi
This equation is also known as the constant elasticity form as in this equation, the elasticity of y with respect to changes in x does not vary with xik:
βk = ∂ ln(yi)
∂ ln(xik) = ∂yi
∂xik
xik yi
The classical linear regression model consists of a set of assumptions that describes how the data set is produced by a data generating process (DGP) Assumption 1: Linearity
Assumption 2: Full rank condition or identi…cation Assumption 3: Exogeneity
Assumption 4: Spherical error terms Assumption 5: Data generation Assumption 6: Normal distribution
De…nition (Assumption 2: Full column rank) X is an N K matrix with rank K .
Interpretation
1 There is no exact relationship among any of the independent variables in the model.
2 The columns of X are linearly independent.
Example
Suppose that a cross-section model satis…es:
yi = β0+β1non labor incomei +β2salaryi
+β3total incomei+εi
The identi…cation condition does not hold since total income is exactly equal to salary plus non labor income (exact linear dependency in the model).
Remarks
1 Perfect multi-collinearity is generally not di¢ cult to spot and is signalled by most statistical software.
2 Imperfect multi-collinearity is a more serious issue (see further).
De…nition (Identi…cation)
The multiple linear regression model is identi…able if and only if one the following equivalent assertions holds:
(i) rank(X) =K
(ii) The matrix X>X is invertible
(iii) The columns of X form a basis of L (X)
(iv) Xβ1 =Xβ2 =)β1 = β2 8 (β1, β2)2 RK RK (v) Xβ=0=)β=0 8β2RK
(vi) ker(X) = f0g
The classical linear regression model consists of a set of assumptions that describes how the data set is produced by a data generating process (DGP) Assumption 1: Linearity
Assumption 2: Full rank condition or identi…cation Assumption 3: Exogeneity
Assumption 4: Spherical error terms Assumption 5: Data generation Assumption 6: Normal distribution
De…nition (Assumption 3: Strict exogeneity of the regressors) The regressors are exogenous in the sense that:
E(εjX) =0N 1 or equivalently for all the units i 2 f1, ..Ng
E(εijX) =0 or equivalently
E(εijxjk) =0
for any explanatory variable k 2 f1, ..Kgand any unit j 2 f1, ..Ng.
CommentsComments
1 The expected value of the error term at observation i (in the sample) is not a function of the independent variables observed at any
observation (including the ith observation). The independent variables are not predictors of the error terms.
2 The strict exogeneity condition can be rewritten as:
E(y j X) =Xβ
3 If the regressors are …xed, this condition can be rewritten as:
E(ε) =0N 1
Implications
The (strict) exogeneity conditionE(εjX) =0N 1 has two implications:
1 The zero conditional mean of ε implies that the unconditional mean of u is also zero (the reverse is not true):
E(ε) =EX(E(εjX)) =EX (0) =0
2 The zero conditional mean of ε implies that (the reverse is not true):
E(εixjk) =0 8i , j, k or
Cov(εi, X) =0 8i
The classical linear regression model consists of a set of assumptions that describes how the data set is produced by a data generating process (DGP) Assumption 1: Linearity
Assumption 2: Full rank condition or identi…cation Assumption 3: Exogeneity
Assumption 4: Spherical error terms Assumption 5: Data generation Assumption 6: Normal distribution
De…nition (Assumption 4: Spherical disturbances) The error terms are such that:
V(εijX) =E ε2i X = σ2 for all i 2 f1, ..Ng and
Cov(εi, εjjX) =E(εi εjjX) =0 for all i 6=j
The condition of constant variances is called homoscedasticity. The uncorrelatedness across observations is called nonautocorrelation.
Comments
1 Spherical disturbances =homoscedasticity +nonautocorrelation
2 If the errors are not spherical, we call them nonspherical disturbances.
3 The assumption of homoscedasticity is a strong one: this is the exception rather than the rule!
Comments
Let us consider the (conditional) variance covariance matrix of the error terms:
V(εjX)
| {z }
N N
= E εε> X
| {z }
N N
= 0 BB BB BB
@
E ε21 X E(ε1ε2jX) .. E(ε1εjjX) .. E(ε1εNjX) E(ε2ε1jX) E ε22 X .. E(ε2εjjX) .. E(ε2εNjX)
.. .. .. .. ..
E(εiε1jX) .. E(εiεjjX) .. E(εiεNjX)
.. .. .. .. ..
E(εNε1jX) .. E(εNεjjX) .. E ε2N X 1 CC CC CC A
Comments
The two assumptions (homoscedasticity and nonautocorrelation) imply that:
V(εjX)
| {z }
N N
= E εε> X
| {z }
N N
= σ2 IN
= 0 BB BB BB
@
σ2 0 .. 0 .. 0 0 σ2 .. 0 .. 0 .. .. .. .. ..
.. .. .. .. 0 .. .. .. .. ..
0 .. 0 .. σ2 1 CC CC CC A
The classical linear regression model consists of a set of assumptions that describes how the data set is produced by a data generating process (DGP) Assumption 1: Linearity
Assumption 2: Full rank condition or identi…cation Assumption 3: Exogeneity
Assumption 4: Spherical error terms Assumption 5: Data generation Assumption 6: Normal distribution
De…nition (Assumption 5: Data generation)
The data in (xi 1 xi 2 ...xiK)may be any mixture of constants and random variables.
Comments
1 Analysis will be done conditionally on the observed X, so whether the elements in X are …xed constants or random draws from a stochastic process will not in‡uence the results.
2 In the case of stochastic regressors, the unconditional statistical properties of are obtained in two steps: (1) using the result conditioned on X and (2) …nding the unconditional result by
”averaging” (i.e., integrating over) the conditional distributions.
Comments
Assumptions regarding(xi 1 xi 2 ...xiK yi)for i =1, .., N is also required. This is a statement about how the sample is drawn.
In the sequel, we assume that (xi 1 xi 2 ...xiK yi)for i =1, .., N are independently and identically distributed (i.i.d).
The observations are drawn by a simple random sampling from a large population.
The classical linear regression model consists of a set of assumptions that describes how the data set is produced by a data generating process (DGP) Assumption 1: Linearity
Assumption 2: Full rank condition or identi…cation Assumption 3: Exogeneity
Assumption 4: Spherical error terms Assumption 5: Data generation Assumption 6: Normal distribution
De…nition (Assumption 6: Normal distribution) The disturbances are normally distributed.
εijX N 0, σ2 or equivalently
εjX N 0N 1, σ2IN
Comments
1 Once again, this is a convenience that we will dispense with after some analysis of its implications.
2 Normality is not necessary to obtain many of the results presented below.
3 Assumption 6 implies assumptions 3 (exogeneity) and 4 (spherical disturbances).
Summary
The main assumptions of the multiple linear regression model A1: linearity The model is linear with β
A2: identi…cation X is an N K matrix with rank K A3: exogeneity E(εjX) =0N 1
A4: spherical error terms V(εjX) =σ2IN
A5: data generation X may be …xed or random A6: normal distribution εjX N 0N 1, σ2IN
Key Concepts
1 Simple linear regression model
2 Multiple linear regression model
3 Semi-parametric multiple linear regression model
4 Multiple linear Gaussian model
5 Assumptions of the multiple linear regression model
6 Linearity (A1), Identi…cation (A2), Exogeneity (A3), Spherical error terms (A4), Data generation (A5) and Normal distribution (A6)
The ordinary least squares estimator
Introduction
1 The simple linear regressionmodel assumes that the following speci…cation is true in the population:
y=Xβ+ε
where other unobserved factors determining y are captured by the error term ε.
2 Consider asample fxi 1, xi 2, .., xiK, yigNi=1 of i .i .d . random variables (be careful to the change of notations here) and only one realization of this sample (your data set).
3 How to estimatethe vector of parameters β?
Introduction (cont’d)
1 If we assume that assumptions A1-A6 hold, we have a multiple linear Gaussian model (parametric model), and a solution is to use the MLE. The MLE estimator for β coincides to the ordinary least squares (OLS) estimator (cf. chapter 2).
2 If we assume that only assumptions A1-A5 hold, we have a semi-parametric multiple linear regression model, the MLE is unfeasible.
3 In this case, the only solution is to use theordinary least squares estimator (OLS).
Intuition
Let us consider the simple linear regression model and for simplicity denote xi =xi 2:
yi = β1+β2xi+εi
The general idea of the OLS consists in minimizing the ”distance”
between the points (xi, yi)and the regression line byi =bβ1+bβ2xi or the points (xi,byi)for all i =1, .., N
Estimates of β1 and β2 are chosen by minimizing the sum of the squared residuals (SSR):
∑
N i=1bε2i
This SSR can be written as:
∑
N i=1bε2i = yi bβ1 bβ2xi 2
Therefore, bβ1 and bβ2 are the solutions of the minimization problem bβ1, bβ2 = arg min
∑
N (yi β1 β2xi)2De…nition (OLS - simple linear regression model)
In the simple linear regression model yi = β1+β2xi+εi, the OLS estimators bβ1 and bβ2 are the solutions of the minimization problem
bβ1, bβ2 = arg min
(β1,β2)
∑
N i=1(yi β1 β2xi)2
The solutions are:
bβ1 =yN bβ2xN bβ2 = ∑
N
i=1(xi xN) (yi yN)
∑Ni=1(xi xN)2
where y =N 1∑N yi and x =N 1∑N xi respectively denote the
Remark
The OLS estimator is a linear estimator (cf. chapter 1) since it can be expressed as a linear function of the observations yi :
bβ2 =
∑
N i=1ωiyi with
ωi = (xi xN)
∑Ni=1(xi xN)2 in the case where yN =0.
De…nition (Fitted value)
The predicted or …tted value for observation i is:
byi =bβ1+bβ2xi
with a sample mean equal to the sample average of the observations byN = 1
N
∑
N i=1byi = yN = 1 N
∑
N i=1yi
De…nition (Fitted residual) The residual for observation i is:
bεi =yi bβ1 bβ2xi
with a sample mean equal to zero by de…nition.
bεN = 1 N
∑
N i=1bεi =0
Remarks
1 The …t of the regression is ”good” if the sum ∑Ni=1bε2i (or SSR) is
”small”, i.e., the unexplained part of the variance of y is ”small”.
2 The coe¢ cient of determination or R2 is given by:
R2 = ∑
N
i=1(byi yN)2
∑Ni=1(yi yN)2 =1 ∑Ni=1bε2i
∑Ni=1(yi yN)2
Orthogonality conditions
Under assumption A3 (strict exogeneity), we have E(εijxi) =0. This condition implies that:
E(εi) =0 E(εixi) =0
Using the sample analog of this moment conditions (cf. chapter 6, GMM), one has:
1 N
∑
N i=1yi bβ1 bβ2xi =0 1
N
∑
N i=1yi bβ1 bβ2xi xi =0
De…nition (Orthogonality conditions)
The ordinary least squares estimator can be de…ned from the two sample analogs of the following moment conditions:
E(εi) =0 E(εixi) =0
The corresponding system of equations is just-identi…ed.
OLS and multiple linear regression model Now consider the multiple linear regression model
y =Xβ+ε or
yi =
∑
K k=1βkxik +εi
Objective: …nd an estimator (estimate) of β1, β2, .., βK and σ2 under the assumptions A1-A5.
OLS and multiple linear regression model
Di¤erent methods:
1 Minimize the sum of squared residuals (SSR)
2 Solve the same minimization problem with matrix notation.
3 Use moment conditions.
4 Geometrical interpretation
1. Minimize the sum of squared residuals (SSR):
As in the simple linear regression, bβ= arg min
β
∑
N i=1ε2i = arg min
β
∑
N i=1yi
∑
K k=1βkxik
!2
One can derive the …rst order conditions with respect to βk for k =1, .., K and solve a system of K equations with K unknowns.
2. Using matrix notations:
De…nition (OLS and multiple linear regression model) In the multiple linear regression model yi =x>i β+εi, with xi = (xi 1, .., xiK)>, the OLS estimator bβ is the solution of
bβ= arg min
β
∑
N i=1yi xi>β
2
The OLS estimators of β is:
bβ=
∑
N i=1xix>i
! 1
∑
N i=1xiyi
!
2. Using matrix notations:
De…nition (Normal equations)
Under suitable regularity conditions, in the multiple linear regression model yi =x>i β+εi, with xi = (xi 1 : .. : xiK)>, the normal equations are
∑
N i=1xi yi x>i bβ =0K 1
2. Using matrix notations:
De…nition (OLS and multiple linear regression model)
In the multiple linear regression model y=Xβ+ε, the OLS estimator bβ is the solution of the minimization problem
bβ= arg min
β
ε>ε = arg min
β
(y Xβ)>(y Xβ) The OLS estimators of β is:
bβ= X>X 1 X>y
2. Using matrix notations:
De…nition
The ordinary least squares estimator bβ of β minimizes the following criteria s(β) = k(y Xβ)k2IN = (y Xβ)>(y Xβ)
2. Using matrix notations:
The FOC (normal equations) are de…ned by:
∂s(β)
∂β bβ = 2 X|{z}>
K N
y Xbβ
| {z }
N 1
=0K 1
The second-order conditions hold:
∂s(β)
∂β∂β> bβ
=2 X| {z }>X
K K
is de…nite positive
since by de…nition X>X is a positive de…nite matrix. We have a minimum.
2. Using matrix notations:
De…nition (Normal equations)
Under suitable regularity conditions, in the multiple linear regression model y =Xβ+ε, the normal equations are given by:
X>
|{z}
K N
(y Xβ)
| {z }
N 1
=0K 1
De…nition (Unbiased variance estimator)
In the multiple linear regression model y=Xβ+ε, the unbiased estimator of σ2 is given by:
bσ2 = 1
N K
∑
N i=1bε2i
SSR
N K
2. Using matrix notations:
The estimator bσ2 can also be written as:
bσ2 = 1
N K
∑
N i=1yi x>i bβ 2
bσ2= (y Xβ)>(y Xβ)
N K
bσ2 = k(y Xβ)k2IN
N K
3. Using moment conditions:
Under assumption A3 (strict exogeneity), we have E(εjX) =0. This condition implies:
E(εixi) =0K 1
with xi = (xi 1 : .. : xiK)>. Using the sample analogs, one has:
1 N
∑
N i=1xi yi x>i bβ =0K 1
We have K (normal) equations with K unknown parameters bβ1, .., bβK. The system is just identi…ed.
4. Geometric interpretation:
1 The ordinary least squares estimation methods consists in determining the adjusted vector, by, which is the closest to y (in a certain space...) such that the squared norm between y andby is minimized.
2 Findingby is equivalent to …nd an estimator of β.
4. Geometric interpretation:
De…nition (Geometric interpretation)
The adjusted vector,by, is the (orthogonal) projection of y onto the column space of X. The …tted error terms,bε, is the projection of y onto the orthogonal space engendered by the column space of X. The vectors by andbε are orthogonal.
Source: F. Pelgrin (2010), Lecture notes, Advanced Econometrics
4. Geometric interpretation:
De…nition (Projection matrices) The vectorsby and bε are de…ned to be:
by=P y bε=M y
where P and M denote the two following projection matrices:
P=X X>X
1
X>
M=IN P=IN X X>X 1X>
Other geometric interpretations:
Suppose that there is a constant term in the model.
1 The least squares residuals sum to zero:
∑
N i=1bεi =0
2 The regression hyperplane passes through the point of means of the data (xN, yN).
3 The mean of the …tted (adjusted) values of y equals the mean of the actual values of y :
byN =yN
De…nition (Coe¢ cient of determination)
The coe¢ cient of determination of the multiple linear regression model (with a constant term) is the ratio of the total (empirical) variance explained by model to the total (empirical) variance of y:
R2 = ∑
N
i=1(byi yN)2
∑Ni=1(yi yN)2 =1 ∑Ni=1bε2i
∑Ni=1(yi yN)2
Remark
1 The coe¢ cient of determination measures the proportion of the total variance (or variability) in y that is accounted for by variation in the regressors (or the model).
2 Problem: the R2 automatically and spuriously increases when extra explanatory variables are added to the model.
De…nition (Adjusted R-squared)
The adjusted R-squared coe¢ cient is de…ned to be:
R2 =1 N 1
N p 1 1 R2
where p denotes the number of regressors (not counting the constant term, i.e., p =K 1 if there is a constant or p=K otherwise).
Remark
One can show that
1 R2<R2
2 if N is large R2 'R2
3 The adjusted R-squared R2 can be negative.
Key Concepts
1 OLS estimator and estimate
2 Fitted or predicted value
3 Residual or …tted residual
4 Orthogonality conditions
5 Normal equations
6 Geometric interpretations of the OLS
Coe¢ cient of determination and adjusted R-squared
Statistical properties of the OLS estimator
In order to study the statistical properties of the OLS estimator, we have to distinguish (cf. chapter 1):
1 The…nite sample properties
2 The large sample orasymptotic properties
But, we have also to distinguish the properties given the assumptions made on the linear regression model
1 Semi-parametric linear regression model(the exact distribution of ε is unknown) versus parametric linear regression model(and especially Gaussian linear regression model, assumption A6).
2 X is a matrix of random regressorsversus X is a matrix of …xed regressors.
Fact (Assumptions)
In the rest of this section, we assume that assumptions A1-A5 hold.
A1: linearity The model is linear with β
A2: identi…cation X is an N K matrix with rank K A3: exogeneity E(εjX) =0N 1
A4: spherical error terms V(εjX) =σ2IN
A5: data generation X may be …xed or random
Finite sample properties of the OLS estimator
Objectives
The objectives of this subsection are the following:
1 Compute the two …rst moments of the (unknown) …nite sample distribution of the OLS estimators bβ andbσ2
2 Determine the …nite sample distribution of the OLS estimators bβ and bσ under particular assumptions (A6).
3 Determine if the OLS estimators are "good": e¢ cient estimator versus BLUE.
4 Introduce the Gauss-Markov theorem.
First moments of the OLS estimators
Moments
In a …rst step, we will derive the …rst moments of the OLS estimators
1 Step 1: computeE bβ and V bβ
2 Step 2: computeE bσ2 andV bσ2
De…nition (Unbiased estimator)
In the multiple linear regression model y=Xβ0+ε, under the assumption A3 (strict exogeneity), the OLS estimator bβ is unbiased:
E bβ =β0
where β0 denotes the true value of the vector of parameters. This result holds whether or not the matrix X is considered as random.
Proof
Case 1: …xed regressors (cf. chapter 1)
bβ= X>X 1 X>y =β0+ X>X 1 X>ε So, if X is a matrix of …xed regressors:
E bβ = β0+ X>X 1 X>E(ε)
Under assumption A3 (exogeneity),E(εjX) =E(ε) =0. Then, we get:
E bβ =β0
Proof (cont’d)
Case 2: random regressors
bβ= X>X 1 X>y =β0+ X>X 1 X>ε If X is includes some random elements:
E bβ X =β0+ X>X 1 X>E(εjX) Under assumption A3 (exogeneity),E(εjX) =0. Then, we get:
E bβ X =β0
Case 2: random regressors
The OLS estimator bβ is conditionally unbiased.
E bβ X =β0 Besides, we have:
E bβ =EX E bβ X =EX (β0) =β0
where EX denotes the expectation with respect to the distribution of X.
So, the OLS estimator bβ is unbiased.
E bβ
De…nition (Variance of the OLS estimator, non-stochastic regressors) In the multiple linear regression model y=Xβ+ε, if the matrix X is non-stochastic, the unconditional variance covariance matrix of the OLS estimator bβ is:
V bβ =σ2 X>X 1
Proof
bβ= X>X 1 X>y =β0+ X>X 1 X>ε So, if X is a matrix of …xed regressors:
V bβ = E bβ β0 > bβ β0
= E X>X 1X>εε>X X>X 1
= X>X 1X>E εε> X X>X 1
Under assumption A4 (spherical disturbances), we have:
V(ε) =E εε> =σ2IN
The variance covariance matrix of the OLS estimator is de…ned by:
V bβ = X>X 1X>E εε> X X>X 1
= X>X 1X>σ2INX X>X 1
= σ2 X>X 1 X>X X>X 1
= σ2 X>X 1
De…nition (Variance of the OLS estimator, stochastic regressors) In the multiple linear regression model y=Xβ0+ε, if the matrix X is stochastic, the conditional variance covariance matrix of the OLS estimator bβ is:
V bβ X =σ2 X>X 1 The unconditional variance covariance matrix is equal to:
V bβ =σ2 EX X>X 1
where EX denotes the expectation with respect to the distribution of X.
Proof
bβ= X>X 1 X>y =β0+ X>X 1 X>ε So, if X is a stochastic matrix:
V bβ X = E bβ β0 > bβ β0 X
= E X>X 1X>εε>X X>X 1 X
= X>X 1X>E εε> X X X>X 1
Proof (cont’d)
Under assumption A4 (spherical disturbances), we have:
V(εjX) =E εε> X =σ2IN
The conditional variance covariance matrix of the OLS estimator is de…ned by:
V bβ X = X>X 1X>E εε> X X X>X 1
= X>X 1X>σ2INX X>X 1
= σ2 X>X 1
Proof (cont’d) We have:
V bβ X =σ2 X>X 1
The (unconditional) variance covariance matrix of the OLS estimator is de…ned by:
V bβ =EX V bβ X =σ2 EX X>X 1
where EX denotes the expectation with respect to the distribution of X.
Summary
Case 1: X stochastic Case 2: X nonstochastic
Mean E bβ =β0 E bβ = β0
Variance V bβ =σ2 EX X>X 1 V bβ =σ2 X>X 1
Cond. mean E bβ X =β0 —
Cond. var V bβ X =σ2 X>X 1 —
Question
How to estimate the variance covariance matrix of the OLS estimator?
V bβOLS =σ2 X>X 1 if X is nonstochastic V bβOLS =σ2 EX X>X 1 if X is stochastic
Question (cont’d)
De…nition (Variance estimator)
An unbiased estimator of the variance covariance matrix of the OLS estimator is given:
V bβb OLS = bσ2 X>X 1
where bσ2 = (N K) 1bε>bε is an unbiased estimator of σ2. This result holds whether X is stochastic or non stochastic.
Summary
Case 1: X stochastic Case 2: X nonstochastic Variance V bβ =σ2 EX X>X 1 V bβ =σ2 X>X 1 Estimator V bβb OLS = bσ2 X>X 1 V bβb OLS =bσ2 X>X 1
De…nition (Estimator of the variance of disturbances)
Under the assumption A1-A5, in the multiple linear regression model y =Xβ+ε, the estimator bσ2 is unbiased:
E bσ2 =σ2 where
bσ2 = 1
N K
∑
N i=1bε2i = bε
>bε
N K
This result holds whether or not the matrix X is considered as random.
We assume that X is stochastic. Let M denotes the projection matrix (“residual maker”) de…ned by:
M=IN X X>X 1X>
with
(N ,1bε )= M
(N ,N) y
(N ,1)
The N N matrix M satis…es the following properties:
1 if X is regressed on X, a perfect …t will result and the residuals will be zero, so M X=0
The matrix M is symmetric M>=M and idempotent M M=M
Proof (cont’d)
The residuals are de…ned as to be:
bε=M y Since y=Xβ+ε, we have
bε=M(Xβ+ε) =MXβ+Mε Since MX =0, we have
bε=Mε
Proof (cont’d)
The estimator bσ2 is based on the sum of squared residuals (SSR) bσ2 = bε
>bε
N K = ε>Mε
N K
The expected value of the SSR is
E bε>bε X =E ε>Mε X
The scalar ε>Mε is a 1 1 scalar, so it is equal to its trace.
tr E ε>Mε X =tr E εε>M X =tr E Mεε> X since tr(AB) = (AB).
Proof (cont’d)
Since M =IN X X>X 1X> depends on X, we have:
E bε>bε X =tr E Mεε> X =tr M E εε> X Under assumptions A3 and A4, we have
E εε> X =σ2IN As a consequence
E bε>bε X =tr σ2M IN =σ2tr(M)
Proof (cont’d)
E bε>bε X = σ2 tr(M)
= σ2 tr IN X X>X 1X>
= σ2 tr(IN) σ2 tr X X>X 1X>
= σ2 tr(IN) σ2 tr X>X X>X 1
= σ2 tr(IN) σ2 tr(IK)
= σ2(N K)
Proof (cont’d)
By de…nition of bσ2, we have:
E bσ2 X = E bε
>bε X
N K = σ
2(N K)
N K =σ2
So, the estimator bσ2 is conditionally unbiased.
E bσ2 =EX E bσ2 X =EX σ2 =σ2 The estimator bσ2 is unbiased:
E bσ2 =σ2
Remark
Given the same principle, we can compute the variance of the estimator bσ2. bσ2 = bε
>bε
N K = ε>Mε
N K
As a consequence, we have:
V bσ2 X = 1
(N K)V ε
>Mε X
V bσ2 =EX V bσ2 X But, it takes... at least ten slides...
De…nition (Variance of the estimator bσ2)
In the multiple linear regression model y=Xβ0+ε, the variance of the estimator bσ2 is
V bσ2 = 2σ
4
N K
where σ2 denotes the true value of variance of the error terms. This result holds whether or not the matrix X is considered as random.