Linear Regression Models
2.1 Relationship between Two Variables
In this section, we describe the basic method of explicating the relation-ship between a variable that represents the outcome of a phenomenon and a variable suspected of affecting this outcome, based on observed data.
The relationship used in our example is actually Hooke’s well-known
15
Table 2.1 The length of a spring under different weights.
xg 5 10 15 20 25 30 35 40 45 50
ycm 5.4 5.7 6.9 6.4 8.2 7.7 8.4 10.1 9.9 10.5
law of elasticity, which states, essentially, that a spring changes shape under an applied force and that within the spring’s limit of elasticity the change is proportional to the force.
2.1.1 Data and Modeling
Table 2.1 shows ten observations obtained by measuring the length of a spring (y cm) under different weights (x g). The data are plotted in Fig-ure 2.1. The plot suggests a straight-line relationship between the two variables of spring length and suspended weight. If the measurements were completely free from error, all of the data points might actually lie in a straight line. As shown in Figure 2.1, measurement data generally include errors commonly referred to as noise, and modeling is therefore required to explicate the relationship between variables. To find the re-lationship between the two variables of spring length (y) and weight (x) from the data including the measurement errors, let us therefore attempt the modeling based on an initially unknown function y= u(x).
We first consider a more specific expression for the unknown func-tion u(x) that represents the true structure of the spring phenomenon.
The data plot, as well as our a priori knowledge that the function should be linear, suggests that the function should describe a straight line. We therefore adopt a linear model as our specified model, so that
y = u(x) = β0+ β1x. (2.1)
We then attempt to apply this linear model in order to explicate the re-lationship between the spring length (y) and the weight (x) as a physical phenomenon.
If there were no errors in the data shown in Table 2.1, then all 10 data points would lie on a straight line with an appropriately selected inter-cept (β0) and slope (β1). Because of measurement errors, however, many of the actual data points will depart from any straight line. To include consideration for this departure (ε) from a straight line by data points obtained with different weights, we therefore assume that they satisfy
:HLJKW
[
J6SULQJOHQJWK
\
FPFigure 2.1 Data obtained by measuring the length of a spring (y cm) under different weights (x g).
Table 2.2 The n observed data.
No. 1 2 · · · i · · · n
Experiment points (x) x1 x2 · · · xi · · · xn
Observed data (y) y1 y2 · · · yi · · · yn
the relation
Spring length = β0+ β1× Weight + Error. (2.2) For the individual data points, we then have 5.4= β0+β15+ε1,· · ·, 8.2 = β0+β125+ε5, · · ·. Figure 2.2 illustrates the relationship considering the fifth data point (25, 8.2).
In general, let us assume that measurements are performed for n periment points, as in Table 2.2, and that a measurement at a given ex-periment point xiis yi. The general model corresponding to (2.2) is then yi= β0+ β1xi+ εi, i= 1, 2, · · · , n, (2.3)
\ E E [
E E
H
E E H
\
FP:HLJKW
[
J6SULQJOHQJWK
Figure 2.2 The relationship between the spring length (y) and the weight (x).
where β0 and β1are regression coefficients, εiis the error term, and the equation in (2.3) is called the linear regression model. The variable y, which represents the length of the spring in the above experiment, is the response variable and the variable x, which represents the weight in that experiment, is the predictor variable. Variables y and x are also often referred to as the dependent variable and the independent variable or the explanatory variable, respectively.
This brings us to the question of how to fit a straight line to observed data in order to obtain a model that appropriately expresses the data. It is essentially a question of how to determine the regression coefficients β0
and β1. Various model estimation procedures can be used to determine the appropriate parameter values. One of these is the method of least squares.
2.1.2 Model Estimation by Least Squares
The underlying concept of the linear regression model (2.3) is that the true value of the response variable at the i-th point xiis β0+ β1xiand that the observed value yiincludes the error εi. The method of least squares
consists essentially of finding the values of regression coefficients β0and
Differentiating (2.4) with respect to the regression coefficients β0and β1, and setting the resulting derivatives equal to zero, we have
n The regression coefficients that minimize the sum of squared errors can be obtained by solving the above simultaneous equations. This solution is called the least squares estimates and is denoted by ˆβ0 and ˆβ1. The equation
y = ˆβ0+ ˆβ1x, (2.6)
having its coefficients determined by the least squares estimates, is the estimated linear regression model. We can thus find the model that best fits the data by minimizing the sum of squared errors.
The value of ˆyi = ˆβ0 + ˆβ1xi at each xi (i = 1, 2, · · · , n) is called the predicted value. The difference between this value and the observed value yiat xi, ei= yi− ˆyi, is called the residual, and the sum of the squares of the residuals is given by ni=1e2i (Figure 2.3)ɽ
Example 2.1 (Hooke’s law of elasticity) For the data shown in Ta-ble 2.1, the sum of squared errors in the linear regression model is S (β0, β1) = {5.4 − (β0 + β15)}2 + {5.7 − (β0 + β110)}2+ · · · + {10.5 − (β0+ β150)}2, in which S (β0, β1) is the function of the regression coef-ficients β0, β1. The least squares estimates that minimize this function are ˆβ0 = 4.65 and ˆβ1 = 0.12, and the estimated linear regression model is therefore y = 4.65 + 0.12x. In this way, by modeling from a set of observed data, we have derived in approximation a physical law repre-senting the relationship between the weight and the spring length.
2.1.3 Model Estimation by Maximum Likelihood
In the least squares method, the regression coefficients are estimated by minimizing the sum of squared errors. Maximum likelihood estimation
Ö Ö
\ E
E
[
Ö Ö
Ö
L L\ E
E
[ H
L\
L[
L [\
/LQHDUUHJUHVVLRQPRGHOHVWLPDWHGE\WKHOHDVW
VTXDUHVPHWKRG
5HVLGXDO
3UHGLFWHGYDOXH
Figure 2.3 Linear regression and the predicted values and residuals.
is an alternative method for the same purpose in which the regression coefficients are determined so as to maximize the probability of getting the observed data, for which it is assumed that yiobserved at xiemerges in accordance with some type of probability distribution.
Figure 2.4 (a) shows a histogram of 80 measured values obtained while repeatedly suspending a load of 25 g from one end of a spring.
Figure 2.4 (b) represents the errors (i.e., noise) contained in these mea-surements in the form of a histogram having its origin at the mean value of the measurements. This histogram clearly shows a region containing a high proportion of the obtained measured values. A mathematical model that approximates a histogram showing the probabilistic distribution of a phenomenon is called a probability distribution model.
Of the various distributions that may be adopted in probability distri-bution models, the most representative is the normal distridistri-bution (Gaus-sian distribution), which is expressed in terms of mean μ and variance σ2 and denoted by N(μ, σ2). In the normal distribution model, the observed value yiat xiis regarded as the realization of the random variable Yi= yi, and Yiis normally distributed with mean μiand variance σ2
f (yi|xi; μi, σ2)= 1
√2πσ2exp
−(yi− μi)2 2σ2
, (2.7)
0
Figure 2.4 (a) Histogram of 80 measured values obtained while repeatedly sus-pending a load of 25 g and its approximated probability model. (b) The errors (i.e., noise) contained in these measurements in the form of a histogram hav-ing its origin at the mean value of the measurements and its approximated error distribution.
where μi for a given xi is the conditional mean value (true value) E[Yi|xi]= u(xi)= μiof random variable Yi. In the normal distribution, as may be clearly seen in Figure 2.4 (a), the proportion of measured val-ues may be expected to decline sharply with increasing distance from the true value.
In the linear regression model, it is assumed that the true values μ1, μ2,· · ·, μnat the various data points lie on a straight line, and it follows This function decreases with increasing deviation of the observed value yi from the true value β0 + β1xi. Assuming that the observed data yi
around the true value β0+ β1xi at xithus follow the probability distri-bution f (yi|xi; β0, β1, σ2), it is then an expression of the plausibility or certainty of the occurrence of a given value of yi, called the likelihood of yi.
Assuming that the observed data y1, y2, · · ·, yn are mutually inde-pendent and identically distributed (i.i.d.), the likelihood with n data and thus the plausibility with n specific data is given by the product of the likelihoods of all observed data
n
≡ L(β0, β1, σ2). (2.9) Given the data{(xi, yi); i= 1, 2, · · · , n} in (2.9), the function L(β0, β1, σ2) of the parameters β0, β1, σ2 is then the likelihood function. Maximum likelihood is a method of finding the parameter values that maximize this likelihood function, and the resulting estimates are called the maximum likelihood estimates. For ease of calculation, the maximum likelihood estimates are usually obtained by maximizing the log-likelihood function
(β0, β1, σ2)≡ log L(β0, β1, σ2)
= −n
2log(2πσ2)− 1 2σ2
n i=1
{yi− (β0+ β1xi)}2. (2.10)
The parameter values ˆβ0, ˆβ1, ˆσ2that maximize the log-likelihood func-tion are thus obtained by solving the equafunc-tions
∂ (β0, β1, σ2)
∂β0 = 0, ∂ (β0, β1, σ2)
∂β1 = 0, ∂ (β0, β1, σ2)
∂σ2 = 0. (2.11) Specific solutions will be given in Section 2.2.2.
The first term of the log-likelihood function defined in (2.10) does not depend on β0, β1, and the sign of the second term is always negative since σ2 > 0. Accordingly, the values of the regression coefficients β0
and β1that maximize the log-likelihood function are those that minimize
n i=1
{yi− (β0+ β1xi)}2. (2.12)
With the assumption of a normal distribution model for the data, the max-imum likelihood estimates of the regression coefficients are thus equiv-alent to the least squares estimates of the regression coefficients, that is, the minimizer of (2.4).