Business Analytics
Linear Regression
4.b.Introduction to Regression Analysis
• Regression analysis is used to:
– Predict the value of a dependent variable based on the value of at least one independent variable
– Explain the impact of changes in an independent variable on the dependent variable
• Dependent variable: the variable we wish to
explain usually denoted by Y
• Independent variable: the variable used to
explain the dependent variable. Denoted by X. it
is sometimes also referred as predictor variable
4.b.Simple Linear Regression Model
(SLRM)
– Only one independent variable, x (When there is only one predictor variable, the prediction
method is called simple regression)
– Relationship between x and y is described by a linear function (in other words, a simple linear
regression is the one where predictions of Y when plotted as a function of X, form a straight line).
– Changes in y are generally assumed to be caused by changes in x
4.b.SLRM Example
• I have a data set which contains 25,000
records of human heights and weights
(source:
http://socr.ucla.edu/docs/resources/SOCR_Da
ta/SOCR_Data_Dinov_020108_HeightsWeight
s.html
)
• You can download csv file of the data from:
(
https://app.box.com/s/10z9keaeyucqc0nks8u
q
)
4.b.Assumptions
1. A linear relationship exists between the dependent and the independent variable.
2. The independent variable is uncorrelated with the residuals. 3. The expected value of the residual term is zero
4. The variance of the residual term is constant for all observations (Homoskedasticity)
5. The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another
observation
6. The residual term is normally distributed.
5
E() 0
2 2
i E i] j 0, ) ε [E( εi j 4.b.Types of Regression Models
6
Positive Linear Relationship Negative Linear Relationship
No Relationship Relationship NOT Linear
4.b.Population Linear Regression
7 Predicted Value of Y for Xi Intercept = β0 (continued)Random Error for this x value
Y
X
u
X
β
β
Y
0
1
xi Slope = β1u
i Individual person's marks4.b.Population Regression Function
8 Linear component Population y intercept Population Slope Coefficient Random Error term, or residual Dependent Variable Independent Variable Random Error componentu
X
β
β
Y
0
1
But can we actually get this equation? If yes what all information we will need?
4.b.Sample Regression Function
9 Predicted Value of Y for Xi Intercept = β0 (continued)Random Error for this x value
Y
X
xi Slope = β1e
x
b
b
y
0
1
e
i Observed Value of y for xi4.b.Sample Regression Function
10e
x
b
b
y
i
0
1
Estimate of the regression intercept Estimate of the regression slope Independent variable Error termNotice the similarity with the Population Regression Function Can we do something of the error term?
4.b.The error term (residual)
– Represents the influence of all the variable which we have not accounted for in the equation
– It represents the difference between the actual "y" values as compared the predicted y values from the Sample Regression Line
– Wouldn't it be good if we were able to reduce this error term?
– What are we trying to achieve by Sample Regression?
x
b
b
y
ˆ
i
0
1
u
X
β
β
Y
0
1
To Predict PRL from SRL4.b.Our Objective
124.b.One method to find b
0
and b
1
–Method of Ordinary Least Squares (OLS)
–b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared
residuals
–Are there any advantages of minimizing the squared errors?
–Why don't we take the sum?
–Why don't we take absolute values instead?
13 2 1 0 2 2
x))
b
(b
(y
)
yˆ
(y
e
4.b.OLS Regression Properties
– The sum of the residuals from the least squares regression line is 0.
– The sum of the squared residuals is a minimum.
Minimize( )
– The simple regression line always passes through the mean of the y variable and the mean of the x variable – The least squares coefficients are unbiased estimates
of β0 and β1 14 0 ) ˆ (
y y 2 ) ˆ (y y
4.b.Interpretation of the Slope and
the Intercept
– b0 is the estimated average value of y when the value of x is zero. More often than not it does not have a physical interpretation
– b1 is the estimated change in the average value of y as a result of a one-unit change in x
15 y x b0 X b b Y 0 1
4.b.Limitations of Regression
Analysis
– Parameter Instability - This happens in situations where correlations change over a period of time. This is very common in financial markets where economic, tax, regulatory, and political factors change frequently.
– Public knowledge of a specific regression relation may cause a large number of people to react in a similar fashion towards the variables, negating its future usefulness.
– If any regression assumptions are violated,
predicted dependent variables and hypothesis tests will not hold valid.
4.b.General Multiple Linear
Regression Model
– In simple linear regression, the dependent
variable was assumed to be dependent on only one variable (independent variable)
– In General Multiple Linear Regression model, the dependent variable derive sits value from two or more than two variable.
– General Multiple Linear Regression model take the following form:
where:
Y
i= i
thobservation of dependent variable Y
X
ki= i
thobservation of k
thindependent variable X
b
0= intercept term
b
k= slope coefficient of k
thindependent variable
ε
i= error term of i
thobservation
n = number of observations
k = total number of independent variables
17 i ki k i i i b b X b X b X Y 0 1 1 2 2 ...
4.b.Estimated Regression Equation
– As we calculated the intercept and the slope
coefficient in case of simple linear regression by minimizing the sum of squared errors, similarly we estimate the intercept and slope coefficient in
multiple linear regression.
• Sum of Squared Errors is minimized and the slope coefficient is estimated.
– The resultant estimated equation becomes:
– Now the error in the ith observation can be written
as: 18
n i i 1 2 ki k i i i b b X b X b X Y 0 1 1 2 2 ... i i i i i k ki i Y Y Y b0 b1 X1 b2 X2 ... b X 4.b.Interpreting the Estimated
Regression Equation
– Intercept Term (b0): It's the value of dependent variable when the value of all independent
variables become zero.
– Slope coefficient (bk): It's the change in the dependent variable from a unit change in the
corresponding independent (Xk) variable keeping all other independent variables constant.
• In reality when the value of the independent variable changes by one unit, the change in the dependent variable is not equal to the slope coefficient but depends on the correlation among the independent variables as well. • Therefore, the slope coefficient are called partial slope coefficients as well
19 0 ... 2 1 0 k X X X when Y of Value b
4.b.Assumptions of Multiple
Regression Model
– There exists a linear relationship between the dependent and independent variables.
– The expected value of the error term, conditional on the independent variables is zero.
– The error terms are homoskedastic, i.e. the
variance of the error terms is constant for all the observations.
– The expected value of the product of error terms is always zero, which implies that the error terms are uncorrelated with each other.
– The error term is normally distributed.
– The independent variables doesn't have any linear relationships between each other.
Thank you!
© Pristine – www.edupristine.com
Pristine
702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA
www.edupristine.com