• No results found

Correlation & Regression - Dr Nisha.ppsx

N/A
N/A
Protected

Academic year: 2020

Share "Correlation & Regression - Dr Nisha.ppsx"

Copied!
61
0
0

Loading.... (view fulltext now)

Full text

(1)

Dr. Nisha Arora

Forecasting TECHNIQUES:

CORRELATION & Regression

Models

(2)

Outline

Concept

Concept

Applications

Applications

Scatter Plot

Scatter Plot

Correlation/ Rank Correlation

Correlation/ Rank Correlation

Coefficient of Determination

Coefficient of Determination

Linear Regression

(3)

General Applications

To predict a relation between

Height & Weight

Attendance in a particular course & Marks scored [other

factors – I.Q. level, hours of study, difficulty level of the course etc.]

 Average rainfall & Yield of crop [other factors – soil

quality, fertilizer used etc.]

Advertisement & Sales [other factors - Product quality,

competitors strategy etc.]

Demand & Price

BMI of pregnant mothers & birth weight of new born

babies

(4)

Engineering Applications

 Surface roughness for steel alloy as a function of

cutting speed and feed rate

Physico-chemical quality of groundwater relating them

to specific hydro geological processes

Feature selection in Machine learning

 Fundamental quantum aspects of spin phenomena in

Nano magnetic structures

Sonar, displacement or velocity determination and

(5)

Scatter Plot

Response variable

Suicide rates

Explanatory variable

% of adults with guns

Relationship

Linear, Positive, &Moderately Strong

(6)

Scatter Plot:

Other examples

Response variable

Rating

performance

Explanatory variable

Blood Glucose level

Relationship

Linear, Negative, &weak

(7)

Correlation

It is the measure of

the strength of the

linear relationship

between

two

variables, denoted

by “r” or “R”

It may be positive,

negative or zero

(8)

Properties of Correlation

Magnitude (absolute value) of correlation coefficient measures the strength of linear association

between two numerical variables

Magnitude (absolute value) of correlation coefficient measures the strength of linear association

between two numerical variables

The sign of correlation coefficient indicates the

direction of association

The sign of correlation coefficient indicates the

direction of association

The range of correlation coefficient is between -1 to

The range of correlation coefficient is between -1 to +1.

(9)

Properties of Correlation

The correlation coefficient is unitless, and is not affected by change in the centre or scale of either variable, e.g., unit conversions.

The correlation coefficient is unitless, and is not affected by change in the centre or scale of either variable, e.g., unit conversions.

The correlation coefficient is symmetric with respect to variables.

The correlation coefficient is symmetric with respect to variables.

The correlation coefficient is sensitive to outliers.

(10)
(11)

Quiz time

Does R = 0 means the variables are independent ?

Does R = 0 means the variables are independent ?

(12)

Quiz time

Does R = 0 means the variables are independent ?

Does R = 0 means the variables are independent ?

No, because there may be non-linear relationship between the variables.

No, because there may be non-linear relationship between the variables.

(13)

Quiz time

Does correlation indications causation (cause & effect relationship)?

Does correlation indications causation (cause & effect relationship)?

(14)

Quiz time

Does correlation indications causation (cause & effect relationship)?

Does correlation indications causation (cause & effect relationship)?

No, as it makes no difference which variable is the explanatory variable and which is the response variable.

No, as it makes no difference which variable is the explanatory variable and which is the response variable.

(15)

Quiz time

If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.

What does that mean?

If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.

(16)

Quiz time

If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.

What does that mean?

If I get R = 0.85 for the variables “hours of study” of college students in December in a particular city and “number of road accidents” in that time period and in the same area.

What does that mean?

Nothing, as it is the case of Spurious / Non-sense correlation.

Nothing, as it is the case of Spurious / Non-sense correlation.

(17)

Quiz time

A national consumer magazine reported the following correlations.

• The correlation between car weight and car reliability is -0.30.

• The correlation between car weight and annual maintenance cost is 0.20.

Which of the following statements are true?

I. Heavier cars tend to be less reliable.  II. Heavier cars tend to cost more to maintain.  III. Car weight is related more strongly to reliability than to maintenance cost.

(18)

Quiz time

Options are (A) I only (B) II only (C) III only

(D) I and II only (E) I, II, and III

(19)

Quiz time

Solution is

The correct answer is (E).

Why

• The correlation between car weight and reliability is negative.

• This means that reliability tends to decrease as car weight increases.

• The correlation between car weight and maintenance cost is positive.

(20)

Quiz time

A researcher uses a regression equation to predict home heating bills (dollar cost), based on home size (square feet). The correlation between predicted bills and home size is 0.70.

What is the correct interpretation of this finding?

I. 70% of the variability in home heating bills can be explained by home size.

II. 49% of the variability in home heating bills can be explained by home size.

III. For each added square foot of home size, heating bills increased by 70 cents.

(21)

Quiz time

Solution is

The correct answer is II.

Why

• The coefficient of determination measures the proportion of variation in the dependent variable that is predictable from the independent variable.

• The coefficient of determination is equal to R2; in this case, (0.70)2 or 0.49.

• Therefore, 49% of the variability in heating bills can be explained by home size.

(22)

Correlation Example

Calculate and analyse the correlation coefficient

between the number of study hours and the number of sleeping hours of different students.

Number of study hours

(X) 2 4 6 8 10

Number of sleeping

(23)

Correlation Example

Calculate and analyse the correlation coefficient

between the number of study hours and the number of sleeping hours of different students.

Number of study hours

(X) 2 4 6 8 10

Number of sleeping

(24)

Solution

(25)

Correlation Example

The heights (in inches) of fathers and their sons are as

follows

Is there any relationship between heights of fathers

and their sons ?

Find the coefficient of correlation.

Height of father

(X) 65 66 67 67 68 69 70 72

(26)

R

2

The coefficient of determination R

2

 The most common measure of strength of the fit of

a linear model

Calculated as the square of the correlation

coefficient

 It is always between 0 and 1.

Interpretation

(27)

Correlation blunders

Correlation is not causation.

The observed correlation between two variables might

be due to the action of a third, unobserved variable.

Yule (1926) gave an example of high positive correlation

between yearly number of suicides and membership in the Church of England due not to cause and effect, but to other variables that also varied over time. (Can you suggest some?)

Mosteller and Tukey (1977, p. 318) give an example of

aiming errors made during World War II bomber flights in Europe. Bombing accuracy had a high positive correlation with amount of fighter opposition, that is, the more enemy fighters sent up to distract and shoot down the bombers, the more accurate the bombing run!

(28)

Limitation of Correlation

 Though R measures how closely the two

variables approximate a straight line, it does not validly measures the strength of nonlinear relationship

 When the sample size, n, is small we also

have to be careful with the reliability of the correlation

(29)

Regression

 Regression analysis is a statistical tool that gives us

the ability to estimate the mathematical relationship

between a dependent variable (usually called y) and an independent variable(s) (usually called x).

 The dependent variable is the variable for which we

want to make a prediction.

 While various non-linear forms may be used, simple

linear regression models are the most common.

(30)

Correlation V/s Regression

Correlation

 Correlation Coefficient, R, measures the strength of bivariate association

 Correlation does not infer cause and effect, it simply quantify how well two variables relate to each other

Correlation is symmetric with respect to the variables

 Correlation is almost always used when you measure both variables. It rarely is appropriate when one

Regression

 The regression line is a prediction equation that estimates the values of y for any given x

 It infer cause and effect as the regression line is determined as the best way to predict Y from X.  It is not symmetric with respect

to the variables as we get a different best-fit line if you swap the two. (unless you have perfect data with no scatter.)

 With regression, the X variable is usually something we

(31)

Types of Regression Models

Simple regression

Simple regression

Estimates

the

relationship between the

dependent variable and

one

independent variable

Multiple regression

Multiple regression

Estimates

the

relationship

between

the

dependent

variable and

two or more

independent variables

(32)

Simple Linear Regression Model

Simple Linear regression models are of the form

Where

Y = Dependent variable (response)

X = Independent variable (predictor or explanatory) a = Intercept

b = Slope of the regression line e = Random error

a and b are called parameters of the model, which are

e

bX

a

(33)

Least Square Regression Line

If the

scatter plot of our sample data suggests a linear

relationship

between two variables, we can

summarize the relationship by drawing a straight line

on the plot.

To fit a straight line

to the data that makes the

residuals in a scatter plot from the line as small as

possible, there are two options

Minimize the sum of magnitudes of the residuals

Minimize the sum of squared residuals (least squares)

(34)

We will write an estimated regression line based on

sample data as

The method of least squares chooses the values for a

0

,

and b

0

to minimize the sum of squared errors

Least Square Regression Line

X

b

a

y

ˆ

0

0

2

1 0 0 1 2 ) ˆ (

       n i n i i

i y y a b x

y SSE

(35)

 Using calculus, we obtain estimating formulas:

Or

Least Square Regression Line

Where

= mean of x = mean of y

Sy = Standard deviation of y Sx = Standard deviation of x r = cor (x, y)

 

              n i n i i i n i n i n i i i i i n i i n i i i x x n y x y x n x x y y x x b 1 1 2 2

1 1 1

1 2 1 1 ) ( ) ( ) )( (

x

b

y

b

0

1

x y

S

S

r

b

1

x y

(36)

Properties of Regression Lines

Both the lines of regression pass through (, .

Both the lines of regression pass through (, .

b

yx

b

xy

= r

2

b

yx

b

xy

= r

2

0 <= b

b

<=1

0 <= b

b

<=1

Both the lines of regression pass through (, .

Both the lines of regression pass through (, .

b

yx

b

xy

= r

2

b

yx

b

xy

= r

2

0 <= b

b

<=1

(37)

Point Estimation of Mean Response

Fitted regression line can be used to

estimate

the mean

value of y for a given value of x.

Example

The weekly advertising expenditure (x) and weekly

sales (y) are presented in the following table.

(38)

Point Estimation of Mean Response

From previous table we have:

The least squares estimates of the regression

coefficients are

   818755 14365 32604 564 10 2 xy y x x n 8 . 10 ) 564 ( ) 32604 ( 10 ) 14365 )( 564 ( ) 818755 ( 10 )

( 2 2

2

0

    

 

x x n y x xy n b 828 ) 4 . 56 ( 8 . 10 5 .

1436   

(39)

Point Estimation of Mean Response

The estimated regression function is:

Interpretation

If the weekly advertising expenditure is increased by $1 we

would expect the weekly sales to increase by $10.8.

 For example if the advertising expenditure is $50, then the

estimated Sales is:

e

Expenditur

8

.

10

828

Sales

10.8x

828

1368

)

50

(

8

.

10

828

Sales

(40)

Simple linear regression:

More example

Retail sales and floor space

 It is customary in retail operations to asses the

performance of stores partly in terms of their annual sales relative to their floor area (square feet).

 We might expect sales to increase linearly as stores

get larger, with of course individual variation among stores of the same size.

(41)

The slope b is as usual a rate of change: it is the

expected increase in annual sales associated with each additional square foot of floor space.

 The intercept a is needed to describe the line but has no

statistical importance because no stores have area close to zero.

 Floor space does not completely determine sales. The

term  in the model accounts for difference among individual stores with the same floor space. A store’s location, for example, is important.

(42)

Example: weekly advertising expenditure

y

x

y-hat

Residual (e)

1250

41

1270.8

-20.8

1380

54

1411.2

-31.2

1425

63

1508.4

-83.4

1425

54

1411.2

13.8

1450

48

1346.4

103.6

1300

46

1324.8

-24.8

1400

62

1497.6

-97.6

1510

61

1486.8

23.2

(43)
(44)

Example: Do wages rise with experience?

What is the least-squares

regression line for predicting Wages from Los?

What is the least-squares

regression line for predicting Wages from Los?

• Suppose a woman has been with

her bank for 125 months. What do you predict she will earn

(response variable)?

Suppose a woman has been with

her bank for 125 months. What do you predict she will earn

(response variable)?

If her actual wages are $433, then

what is her residual?

• If her actual wages are $433, then

(45)
(46)
(47)

Example: Do wages rise with experience?

The least-squares regression line for predicting

Wages from LOS is

 If a woman has been with her bank for 125

months(x = 125), the her expected earning is

= 349.4 + 0.5905 * 125 = 423.2125 ≈ $ 423 per week

 If her actual wages are $433, then her residual is

= observed – predicted = 433 – 423 = $ 10

x

y

ˆ

349

.

4

.

5905

(48)
(49)
(50)

One Dep. variable – Selling price; Two indep. Variables – Square Footage & Age. So Input X Range will contain

Both dataset (Square footage & Age together i.e. B2:C15). And remaining process is same.

(51)
(52)
(53)
(54)

References

Related documents

useful: upon learning of the oracles about his father, Hyllos vows to learn “the whole truth” (πᾶσαν ἀλήθειαν, 91); the Messenger claims several times to know and be telling

The breadth and depth dimensions for the model of reflection detection that are used for this research have been derived from the model descriptions of the following 24 models

Daylesford Secondary College’s new Learning Community structure for 2012, will enable more contact with students and the development of curriculum and individual pathways to

associated with an invisible, elusive essence that hypothetically connects all category members to category features. This particular form of abductive reasoning is what I

Since not transferring the important part of the resources into numerical medium, informatics systems’ not giving opportunity to data sharing, and data models’, used

The pseudo out-of-sample forecasting results are generally in line with the in-sample results, and suggest that in the one-quarter ahead forecasts a model combining the term spread,

Dense thickets, flowering shrubs, and a variety of berries, nuts, and other fruits make shrubs, vines, and small trees important places for winter cover, protection from predators,

If the motor carrier op- erating the commercial motor vehicles did not perform the commercial motor vehicle’s last annual inspection, or if an intermodal equipment provider did