Unit 8 - Linear Regression and Correlation

(1)

Statistics For Social Sciences – MATH1208

Unit 8 – Linear Regression and Correlation

The Least Squares Regression Line (or Equation)

We draw a least-square line by ensuring that the line is close to as many data points as is possible. The distance from each data point to the line is called the deviation or error of the line. The distance between the predicted value of Y, and the actual value of Y, y is called the deviation or error.

Least-Squares Line

(2)

Thus, we want to find a line that somehow minimizes the overall deviation of the data points from the line. The technique that we use to find the equation of the line that best fits the data and minimizes the total or sum of the squared deviations between the actual data points and the line is called

least-squares method.

The least-squares method finds the equation of the line y = a + bx that minimizes

or the total of the squared

(3)

when x = 0) and b (the slope or gradient of the line) are found with these equations:

b = and a = or

a = or a = .

The table for calculating the least-squares line is:

Observation Number

x y xy x2

1 x1 y1 x1y1 x1x1

2 x2 y2 x2y2 x2x2

3 x3 y3 x3y3 x3x3

… … … … …

n xn yn xnyn xnxn

Total

(4)

The equation of the least-squares regression line (y = a + bx ) gives information about the relationship between the independent

variable Y and the dependent variable X. The sign of the slope estimate, tells whether the relationship is positive or negative. The value of the slope gives the change that will occur in Y when X is changed by one unit.

Example – Finding the Least-Squares Regression Line (Page 511, Pelosi and Sandifer)

A real estate company wishes to determine the relationship between house size and

(5)

the numbers manageable they decide to use the listing price in thousands of dollars (i.e., $665,000 is recorded as 665) and the size in thousands of square feet (i.e., 2350 square feet is 2.35). Find the equation of the least-squares regression line given: = 199.2,

= 33,177, = 113,999.80,

= 692.033 and n = 66.

Answer: b1 = 152.68584, b0 = 41.84818, the equation of the regression line that

relates the listing price (Y) and the house

size (X) is y = 41.84818 + 152.68584x.

Translated back to the original numbers is

(6)

Interpretation – The equation indicates that there is positive relationship between the size of the house (X) and the listing price (Y). That is, as the size of the house (X)

increases so does the price (Y). The equation indicates that for every thousand square foot change in the size of the house, the listing price increases by 152.68584 thousand

dollars. This change is the slope of the line.

Exercise – (Page 512, Pelosi and Sandifer)

Some analysts are interested in the relationship between ‘An Insurance

(7)

squares regression line and interpret your data, using the table below.

Observation Number

Members, X (millions)

Revenues, Y (billions of $)

1 4.24 5.49

2 3.19 4.63

3 1.83 3.86

4 1.62 3.60

5 2.07 3.43

6 2.30 2.91

7 1.83 2.74

8 2.15 2.40

9 0.97 1.71

10 0.89 1.20

Total

Answer : b = 1.143924 ,

(8)

The equation means that for an increase of 1 million members, revenues will increase by $1.14 billion. The y-intercept of the line is 0.78, which means that when there are no members, the revenue generated is $0.78 billion.

Extrapolate x (independent variable)

values and y (dependent variable) values for bivariate data using the Least Squares

Regression Equation (e.g. x on y or y on

x)

(9)

values of X outside the observed range is called extrapolation.

Plotting Scatter Diagrams for Bivariate Data

The process of obtaining a linear regression relationship for a given set of bivariate data is often referred to as fitting a regression line. Three methods commonly used to fit a regression line to a given set of bivariate data are inspection, semi-averages and least squares.

(10)

data. The main disadvantage of this method is different people would probably draw

different lines using the same data. It

sometimes helps to plot the mean point of the data for x and y; then ensure that the regression line passes through this point.

Semi-average. This technique involves

splitting the data into two equal groups, plot the mean point for each group and join these points with a straight line.

Least squares. This is the standard method of obtaining a regression line.

(11)

1. The table below shows the output (in thousands of tons) and the expenditure on energy (£) for a firm over ten monthly periods.

Output (x) 20 22 25 26 21 23 28 20 25 29

Expenditure (y)

106 138 158 172 120 142 184 102 164 192

a. Draw a scatter diagram for the data.

Calculate the mean point of the data and plot it on the diagram.

Answer: = 23.9, = 147.8

c. Fit a regression line which passes through the mean point.

(12)

Ans: y = -83.8 + 9.69x

Answer: Try this! Using y = m(x - ) + ,

where the gradient (m) = , the

regression line is y = -100.76x + 10.4.

e. Estimate the expenditure on energy if the following month’s output is planned at

27,000 tons.

Answer: The estimated energy

expenditure at 27,000 tons is y = £177.83.

(13)

period. The data are shown below in the table.

a. Construct a scatter plot to represent the given information. [4 mks]

b. Find the equation of the least squares regression line. [12 mks]

c. Predict the number of calls when the temperature is 80. [1 mk]

Summary

Temperature (x)

68 74 82 88 93 99 101

No. of calls (y)

(14)

Regression is concerned with obtaining a mathematical equation which describes the relationship between two variables.

The regression equation can be used for description, control and prediction.

Description is important when the user is simply trying to understand the way that two variables are related. Control describes when the model is used to set standards or reduce variability. Prediction is when the model is used to determine what the resulting Y value should be when X takes on certain values.

(15)

Both regression and correlation deal with bivariate quantitative data and the

relationship between the two variables. Correlation analysis simply measures the strength of the linear relationship between two quantitative variables by measuring the degree of ‘scatter’ of the data values. The less scattered (or varied) the data values are, the stronger the correlation between the two variables. The output of the analysis is a

single number. In correlation analysis, there is no need to identify which variable is

dependent and which variable is

(16)

Associate the Coefficient of Correlation with the corresponding Scatter Diagrams

Three types of relationships that can exist between two variables are perfect negative, none and perfect positive. The correlation coefficient is a measure of the strength of a linear relationship. A correlation of – 1

corresponds to a perfect negative

relationship (negative linear correlation). A correlation of 0 corresponds to no

relationship (no correlation). A correlation of +1 corresponds to a perfect positive

(17)

– 1 to 1. A value of r = 0 signifies that there is no correlation present, while further away from zero towards – 1 or +1, r is the

stronger correlation.

 Graphically

 Analytic

The correlation coefficient can be used as a statistics in its own right when prediction of one variable as a function of the other is not

X Y

No Relationship

Y

X Perfect Positive X

Y

(18)

appropriate or necessary. For example,

suppose a company is interested in knowing whether there is a relationship between the score on an aptitude test and the number of months that a person remains in an entry-level position. The company does not

necessarily want to predict the number of months in the entry-level position; it simply wants to know whether the test score is

related to that variable. In this case, the company could calculate the correlation coefficient between test score and the number of months in the entry level job.

(19)

A positive correlation exists in such a way that increasing the value of one variable tend to be associated with increases in the value of the other value. The correlation

coefficient (r) will range from 0 to +1. Some examples of bivariate data which would be expected to be positively correlated are:

a. Years of employment and salary. As an employee gains experience and

qualification, it is expected that the salary will increase over a time.

b. Number of vehicles licensed and road deaths.

(20)

d. Age of insured person and amount for premium. The older a person is the greater the cost of the premium, since the chance of the person dying increases.

Negative Correlation

A negative correlation exists when increases in the value of one variable tend to be

associated with decreases in the value of the other variable and vice versa. The

correlation (r) ranges from – 1 to 0. Some examples of bivariate data which one would expect to be negatively correlated are:

a. Number of weeks of experience and number of errors made. The more

(21)

task, the less error one would be expected to make.

b. Amount of goods sold and average cost per good.

Correlation provides a measure of how well a least-squares regression line ‘fits’ the

given set of data. The better the correlation, the closer the data points are to the

regression line and hence the more

confidence one could have in using the regression line for estimation.

Calculating the Product Moment

(22)

The correlation coefficient, r, is calculated

using the formula:

or .

Exercise

1. A real estate company wonders what the correlation coefficient is for the relationship between price and house size.

a. Calculate the correlation coefficient

given: = 199.2, = 33,177,

= 113,999.80, = 692.033,

(23)

Answer: a) r = 0.8615

b. Interpret the result.

Answer: a) 0.8615 b) The

correlation coefficient is very close to +1, which indicates a strong positive

relationship between the price and the house size.

2. a) The data below relates the weekly

maintenance cost (£) to the age (in months) of ten machines of similar type in a

manufacturing company. Calculate the product moment correlation coefficient between age and cost.

Machine 1 2 3 4 5 6 7 8 9 10

(24)

Cost (y) 190 240 250 300 310 335 300 300 350 395

Answer: r = 0.88

b) Interpret your result.

Answer: The result indicate that a strong positive measure of correlation between machine maintenance cost and machine age.

Calculate the Coefficient of

Determination and Interpret these values

(25)

product moment correlation coefficient). The coefficient of determination gives the proportion of all the variation in the y-values that is explained by the variation in the x

-values.

Suppose that for turnover (y) measured against advertising expenditure (x), the

correlation coefficient was calculated as

r = 0.76. Then the coefficient of determination, r2_{= (0.76)}2_{= 0.58.}

This means that only 58% of the variation in turnover is due to advertising expenditure. In other words, 42% of the variation in

turnover is due to factors other than

(26)

quality, changing in trends or productivity. Coefficient of determination (cd) =

, where r is the

product moment correlation coefficient.

Since, , it follows that .

Exercise

a. The correlation coefficient between maintenance cost and age of a set of ten

machines was r = 0.88. Find the coefficient of determination. Answer: 0.77 (2 dec. pl.)

b. Interpret the result.

(27)

the variation in the machine’s age. The other 23% of the variation may be due to the amount of machine used and/or the operators’ experience.

2. An emergency service wishes to see whether a relationship exists between the outside temperature and the number of emergency calls it receives for a 7-hour period. The data are shown below in the table.

a. Compute the Product Moment Correlation Coefficient (r). [5 mks]

Temperature (x)

68 74 82 88 93 99 101

No. of calls (y)

(28)

b. Compute the Coefficient of