Business Analytics

(1)

Business Analytics

Linear Regression

(2)

4.b.Introduction to Regression Analysis

• Regression analysis is used to:

– Predict the value of a dependent variable based on the value of at least one independent variable

– Explain the impact of changes in an independent variable on the dependent variable

• Dependent variable: the variable we wish to

explain usually denoted by Y

• Independent variable: the variable used to

explain the dependent variable. Denoted by X. it

is sometimes also referred as predictor variable

(3)

4.b.Simple Linear Regression Model

(SLRM)

– Only one independent variable, x (When there is only one predictor variable, the prediction

method is called simple regression)

– Relationship between x and y is described by a linear function (in other words, a simple linear

regression is the one where predictions of Y when plotted as a function of X, form a straight line).

– Changes in y are generally assumed to be caused by changes in x

(4)

4.b.SLRM Example

• I have a data set which contains 25,000

records of human heights and weights

(source:

http://socr.ucla.edu/docs/resources/SOCR_Da

ta/SOCR_Data_Dinov_020108_HeightsWeight

s.html

)

• You can download csv file of the data from:

(

https://app.box.com/s/10z9keaeyucqc0nks8u

q

)

(5)

4.b.Assumptions

1. A linear relationship exists between the dependent and the independent variable.

2. The independent variable is uncorrelated with the residuals. 3. The expected value of the residual term is zero

4. The variance of the residual term is constant for all observations (Homoskedasticity)

5. The residual term is independently distributed; that is, the residual for one observation is not correlated with that of another

observation

6. The residual term is normally distributed.

5



E()  0



 



2 2



  _i  E i] j 0, ) ε [E( ε_i _j  

(6)

4.b.Types of Regression Models

6

Positive Linear Relationship Negative Linear Relationship

No Relationship Relationship NOT Linear

(7)

4.b.Population Linear Regression

7 Predicted Value of Y for X_i Intercept = β₀ (continued)

Random Error for this x value

Y

X

u

X

β

Y



₀



₁



x_i Slope = β₁

u

_i Individual person's marks

(8)

4.b.Population Regression Function

8 Linear component Population y intercept Population Slope Coefficient Random Error term, or residual Dependent Variable Independent Variable Random Error component

u

X

β

Y



₀



₁



But can we actually get this equation? If yes what all information we will need?

(9)

4.b.Sample Regression Function

9 Predicted Value of Y for X_i Intercept = β₀ (continued)

Random Error for this x value

Y

X

x_i Slope = β₁

e

x

b

y



₀



₁



e

_i Observed Value of y for xi

(10)

4.b.Sample Regression Function

10

e

x

b

y

_i



₀



₁



Estimate of the regression intercept Estimate of the regression slope Independent variable Error term

Notice the similarity with the Population Regression Function Can we do something of the error term?

(11)

4.b.The error term (residual)

– Represents the influence of all the variable which we have not accounted for in the equation

– It represents the difference between the actual "y" values as compared the predicted y values from the Sample Regression Line

– Wouldn't it be good if we were able to reduce this error term?

– What are we trying to achieve by Sample Regression?

(12)

x

b

y

ˆ

_i



₀



₁

u

X

β

Y



₀



₁



To Predict PRL from SRL

4.b.Our Objective

12

(13)

4.b.One method to find b

₀

and b

₁

–Method of Ordinary Least Squares (OLS)

–b₀ and b₁ are obtained by finding the values of b₀ and b₁ that minimize the sum of the squared

residuals

–Are there any advantages of minimizing the squared errors?

–Why don't we take the sum?

–Why don't we take absolute values instead?

13 2 1 0 2 2

x))

b

(b

(y

)

yˆ

(y

e













(14)

4.b.OLS Regression Properties

– The sum of the residuals from the least squares regression line is 0.

– The sum of the squared residuals is a minimum.

Minimize( )

– The simple regression line always passes through the mean of the y variable and the mean of the x variable – The least squares coefficients are unbiased estimates

of β₀ and β₁ 14 0 ) ˆ (  



y y 2 ) ˆ (y y





(15)

4.b.Interpretation of the Slope and

the Intercept

– b₀ is the estimated average value of y when the value of x is zero. More often than not it does not have a physical interpretation

– b₁ is the estimated change in the average value of y as a result of a one-unit change in x

15 y x b₀ X b b Y  ₀  ₁ 

(16)

4.b.Limitations of Regression

Analysis

– Parameter Instability - This happens in situations where correlations change over a period of time. This is very common in financial markets where economic, tax, regulatory, and political factors change frequently.

– Public knowledge of a specific regression relation may cause a large number of people to react in a similar fashion towards the variables, negating its future usefulness.

– If any regression assumptions are violated,

predicted dependent variables and hypothesis tests will not hold valid.

(17)

4.b.General Multiple Linear

Regression Model

– In simple linear regression, the dependent

variable was assumed to be dependent on only one variable (independent variable)

– In General Multiple Linear Regression model, the dependent variable derive sits value from two or more than two variable.

– General Multiple Linear Regression model take the following form:

where:

Y

_i

= i

th

_{observation of dependent variable Y}

X

_ki

= i

th

_{observation of k}

th

_{independent variable X}

b

₀

= intercept term

b

_k

= slope coefficient of k

th

independent variable

ε

_i

= error term of i

th

_observation

n = number of observations

k = total number of independent variables

17 i ki k i i i b b X b X b X Y  ₀  ₁ ₁  ₂ ₂ ...  

(18)

4.b.Estimated Regression Equation

– As we calculated the intercept and the slope

coefficient in case of simple linear regression by minimizing the sum of squared errors, similarly we estimate the intercept and slope coefficient in

multiple linear regression.

• Sum of Squared Errors is minimized and the slope coefficient is estimated.

– The resultant estimated equation becomes:

– Now the error in the ith_{observation can be written}

as: 18



 n i i 1 2  ki k i i i b b X b X b X Y           ₀ ₁ ₁ ₂ ₂ ...       _ _ _ _     _i _i _i   _i  _i _k _ki i Y Y Y b0 b1 X1 b2 X2 ... b X 

(19)

4.b.Interpreting the Estimated

Regression Equation

– Intercept Term (b₀): It's the value of dependent variable when the value of all independent

variables become zero.

– Slope coefficient (b_k): It's the change in the dependent variable from a unit change in the

corresponding independent (X_k) variable keeping all other independent variables constant.

• In reality when the value of the independent variable changes by one unit, the change in the dependent variable is not equal to the slope coefficient but depends on the correlation among the independent variables as well. • Therefore, the slope coefficient are called partial slope coefficients as well

19 0 ... 2 1 0     k X X X when Y of Value b

(20)

4.b.Assumptions of Multiple

Regression Model

– There exists a linear relationship between the dependent and independent variables.

– The expected value of the error term, conditional on the independent variables is zero.

– The error terms are homoskedastic, i.e. the

variance of the error terms is constant for all the observations.

– The expected value of the product of error terms is always zero, which implies that the error terms are uncorrelated with each other.

– The error term is normally distributed.

– The independent variables doesn't have any linear relationships between each other.

(21)

Thank you!

Pristine

702, Raaj Chambers, Old Nagardas Road, Andheri (E), Mumbai-400 069. INDIA

www.edupristine.com