Dr. Nisha Arora
Forecasting TECHNIQUES:
CORRELATION & Regression
Models
Outline
Concept
Concept
Applications
Applications
Scatter Plot
Scatter Plot
Correlation/ Rank Correlation
Correlation/ Rank Correlation
Coefficient of Determination
Coefficient of Determination
Linear Regression
General Applications
To predict a relation between
Height & Weight
Attendance in a particular course & Marks scored [other
factors – I.Q. level, hours of study, difficulty level of the course etc.]
Average rainfall & Yield of crop [other factors – soil
quality, fertilizer used etc.]
Advertisement & Sales [other factors - Product quality,
competitors strategy etc.]
Demand & Price
BMI of pregnant mothers & birth weight of new born
babies
Engineering Applications
Surface roughness for steel alloy as a function of
cutting speed and feed rate
Physico-chemical quality of groundwater relating them
to specific hydro geological processes
Feature selection in Machine learning
Fundamental quantum aspects of spin phenomena in
Nano magnetic structures
Sonar, displacement or velocity determination and
Scatter Plot
Response variable
• Suicide rates
Explanatory variable
• % of adults with guns
Relationship
• Linear, Positive, & • Moderately Strong
Scatter Plot:
Other examples
Response variable
• Rating
performance
Explanatory variable
• Blood Glucose level
Relationship
• Linear, Negative, & • weak
Correlation
It is the measure of
the strength of the
linear relationship
between
two
variables, denoted
by “r” or “R”
It may be positive,
negative or zero
Properties of Correlation
Magnitude (absolute value) of correlation coefficient measures the strength of linear association
between two numerical variables
Magnitude (absolute value) of correlation coefficient measures the strength of linear association
between two numerical variables
The sign of correlation coefficient indicates the
direction of association
The sign of correlation coefficient indicates the
direction of association
The range of correlation coefficient is between -1 to
The range of correlation coefficient is between -1 to +1.
Properties of Correlation
The correlation coefficient is unitless, and is not affected by change in the centre or scale of either variable, e.g., unit conversions.
The correlation coefficient is unitless, and is not affected by change in the centre or scale of either variable, e.g., unit conversions.
The correlation coefficient is symmetric with respect to variables.
The correlation coefficient is symmetric with respect to variables.
The correlation coefficient is sensitive to outliers.
Quiz time
Does R = 0 means the variables are independent ?
Does R = 0 means the variables are independent ?
Quiz time
Does R = 0 means the variables are independent ?
Does R = 0 means the variables are independent ?
No, because there may be non-linear relationship between the variables.
No, because there may be non-linear relationship between the variables.
Quiz time
Does correlation indications causation (cause & effect relationship)?
Does correlation indications causation (cause & effect relationship)?
Quiz time
Does correlation indications causation (cause & effect relationship)?
Does correlation indications causation (cause & effect relationship)?
No, as it makes no difference which variable is the explanatory variable and which is the response variable.
No, as it makes no difference which variable is the explanatory variable and which is the response variable.
Quiz time
If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.
What does that mean?
If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.
Quiz time
If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.
What does that mean?
If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.
What does that mean?
Nothing, as it is the case of Spurious / Non-sense correlation.
Nothing, as it is the case of Spurious / Non-sense correlation.
Quiz time
A national consumer magazine reported the following correlations.
• The correlation between car weight and car reliability is -0.30.
• The correlation between car weight and annual maintenance cost is 0.20.
Which of the following statements are true?
I. Heavier cars tend to be less reliable. II. Heavier cars tend to cost more to maintain. III. Car weight is related more strongly to reliability than to maintenance cost.
Quiz time
Options are (A) I only (B) II only (C) III only
(D) I and II only (E) I, II, and III
Quiz time
Solution is
The correct answer is (E).
Why
• The correlation between car weight and reliability is negative.
• This means that reliability tends to decrease as car weight increases.
• The correlation between car weight and maintenance cost is positive.
Quiz time
A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70.
What is the correct interpretation of this finding?
I. 70% of the variability in home heating bills can be explained by home size.
II. 49% of the variability in home heating bills can be explained by home size.
III. For each added square foot of home size, heating bills increased by 70 cents.
Quiz time
Solution is
The correct answer is II.
Why
• The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable.
• The coefficient of determination is equal to R2; in this case, (0.70)2 or 0.49.
• Therefore, 49% of the variability in heating bills can be explained by home size.
Correlation Example
Calculate and analyse the correlation coefficient
between the number of study hours and the number of sleeping hours of different students.
Number of study hours
(X) 2 4 6 8 10
Number of sleeping
Correlation Example
Calculate and analyse the correlation coefficient
between the number of study hours and the number of sleeping hours of different students.
Number of study hours
(X) 2 4 6 8 10
Number of sleeping
Solution
Correlation Example
The heights (in inches) of fathers and their sons are as
follows
Is there any relationship between heights of fathers
and their sons ?
Find the coefficient of correlation.
Height of father
(X) 65 66 67 67 68 69 70 72
R
2
The coefficient of determination R
2 The most common measure of strength of the fit of
a linear model
Calculated as the square of the correlation
coefficient
It is always between 0 and 1.
Interpretation
Correlation blunders
Correlation is not causation.
The observed correlation between two variables might
be due to the action of a third, unobserved variable.
Yule (1926) gave an example of high positive correlation
between yearly number of suicides and membership in the Church of England due not to cause and effect, but to other variables that also varied over time. (Can you suggest some?)
Mosteller and Tukey (1977, p. 318) give an example of
aiming errors made during World War II bomber flights in Europe. Bombing accuracy had a high positive correlation with amount of fighter opposition, that is, the more enemy fighters sent up to distract and shoot down the bombers, the more accurate the bombing run!
Limitation of Correlation
Though R measures how closely the two
variables approximate a straight line, it does not validly measures the strength of nonlinear relationship
When the sample size, n, is small we also
have to be careful with the reliability of the correlation
Regression
Regression analysis is a statistical tool that gives us
the ability to estimate the mathematical relationship
between a dependent variable (usually called y) and an independent variable(s) (usually called x).
The dependent variable is the variable for which we
want to make a prediction.
While various non-linear forms may be used, simple
linear regression models are the most common.
Correlation V/s Regression
Correlation
Correlation Coefficient, R, measures the strength of bivariate association
Correlation does not infer cause and effect, it simply quantify how well two variables relate to each other
Correlation is symmetric with respect to the variables
Correlation is almost always used when you measure both variables. It rarely is appropriate when one
Regression
The regression line is a prediction equation that estimates the values of y for any given x
It infer cause and effect as the regression line is determined as the best way to predict Y from X. It is not symmetric with respect
to the variables as we get a different best-fit line if you swap the two. (unless you have perfect data with no scatter.)
With regression, the X variable is usually something we
Types of Regression Models
Simple regression
Simple regression
•
Estimates
the
relationship between the
dependent variable and
one
independent variable
Multiple regression
Multiple regression
•
Estimates
the
relationship
between
the
dependent
variable and
two or more
independent variables
Simple Linear Regression Model
Simple Linear regression models are of the form
Where
Y = Dependent variable (response)
X = Independent variable (predictor or explanatory) a = Intercept
b = Slope of the regression line e = Random error
a and b are called parameters of the model, which are
e
bX
a
Least Square Regression Line
If the
scatter plot of our sample data suggests a linear
relationship
between two variables, we can
summarize the relationship by drawing a straight line
on the plot.
To fit a straight line
to the data that makes the
residuals in a scatter plot from the line as small as
possible, there are two options
Minimize the sum of magnitudes of the residuals
Minimize the sum of squared residuals (least squares)
We will write an estimated regression line based on
sample data as
The method of least squares chooses the values for a
0,
and b
0to minimize the sum of squared errors
Least Square Regression Line
X
b
a
y
ˆ
0
0
21 0 0 1 2 ) ˆ (
n i n i ii y y a b x
y SSE
Using calculus, we obtain estimating formulas:
Or
Least Square Regression Line
Where
= mean of x = mean of y
Sy = Standard deviation of y Sx = Standard deviation of x r = cor (x, y)
n i n i i i n i n i n i i i i i n i i n i i i x x n y x y x n x x y y x x b 1 1 2 21 1 1
1 2 1 1 ) ( ) ( ) )( (
x
b
y
b
0
1x y
S
S
r
b
1
x y
Properties of Regression Lines
Both the lines of regression pass through (, .
Both the lines of regression pass through (, .
b
yxb
xy= r
2b
yxb
xy= r
20 <= b
b
<=1
0 <= b
b
<=1
Both the lines of regression pass through (, .
Both the lines of regression pass through (, .
b
yxb
xy= r
2b
yxb
xy= r
20 <= b
b
<=1
Point Estimation of Mean Response
Fitted regression line can be used to
estimate
the mean
value of y for a given value of x.
Example
The weekly advertising expenditure (x) and weekly
sales (y) are presented in the following table.
Point Estimation of Mean Response
From previous table we have:
The least squares estimates of the regression
coefficients are
818755 14365 32604 564 10 2 xy y x x n 8 . 10 ) 564 ( ) 32604 ( 10 ) 14365 )( 564 ( ) 818755 ( 10 )( 2 2
2
0
x x n y x xy n b 828 ) 4 . 56 ( 8 . 10 5 .1436
Point Estimation of Mean Response
The estimated regression function is:
Interpretation
If the weekly advertising expenditure is increased by $1 we
would expect the weekly sales to increase by $10.8.
For example if the advertising expenditure is $50, then the
estimated Sales is:
e
Expenditur
8
.
10
828
Sales
10.8x
828
yˆ
1368
)
50
(
8
.
10
828
Sales
Simple linear regression:
More example
Retail sales and floor space
It is customary in retail operations to asses the
performance of stores partly in terms of their annual sales relative to their floor area (square feet).
We might expect sales to increase linearly as stores
get larger, with of course individual variation among stores of the same size.
The slope b is as usual a rate of change: it is the
expected increase in annual sales associated with each additional square foot of floor space.
The intercept a is needed to describe the line but has no
statistical importance because no stores have area close to zero.
Floor space does not completely determine sales. The
term in the model accounts for difference among individual stores with the same floor space. A store’s location, for example, is important.
Example: weekly advertising expenditure
y
x
y-hat
Residual (e)
1250
41
1270.8
-20.8
1380
54
1411.2
-31.2
1425
63
1508.4
-83.4
1425
54
1411.2
13.8
1450
48
1346.4
103.6
1300
46
1324.8
-24.8
1400
62
1497.6
-97.6
1510
61
1486.8
23.2
Example: Do wages rise with experience?
• What is the least-squaresregression line for predicting Wages from Los?
• What is the least-squares
regression line for predicting Wages from Los?
• Suppose a woman has been with
her bank for 125 months. What do you predict she will earn
(response variable)?
• Suppose a woman has been with
her bank for 125 months. What do you predict she will earn
(response variable)?
• If her actual wages are $433, then
what is her residual?
• If her actual wages are $433, then
Example: Do wages rise with experience?
The least-squares regression line for predicting
Wages from LOS is
If a woman has been with her bank for 125
months(x = 125), the her expected earning is
= 349.4 + 0.5905 * 125 = 423.2125 ≈ $ 423 per week
If her actual wages are $433, then her residual is
= observed – predicted = 433 – 423 = $ 10
x
y
ˆ
349
.
4
.
5905
One Dep. variable – Selling price; Two indep. Variables – Square Footage & Age. So Input X Range will contain
Both dataset (Square footage & Age together i.e. B2:C15). And remaining process is same.