Statistics For Social Sciences – MATH1208
Unit 8 – Linear Regression and Correlation
The Least Squares Regression Line (or Equation)
We draw a least-square line by ensuring that the line is close to as many data points as is possible. The distance from each data point to the line is called the deviation or error of the line. The distance between the predicted value of Y, and the actual value of Y, y is called the deviation or error.
Least-Squares Line
Thus, we want to find a line that somehow minimizes the overall deviation of the data points from the line. The technique that we use to find the equation of the line that best fits the data and minimizes the total or sum of the squared deviations between the actual data points and the line is called
least-squares method.
The least-squares method finds the equation of the line y = a + bx that minimizes
or the total of the squared
when x = 0) and b (the slope or gradient of the line) are found with these equations:
b = and a = or
a = or a = .
The table for calculating the least-squares line is:
Observation Number
x y xy x2
1 x1 y1 x1y1 x1x1
2 x2 y2 x2y2 x2x2
3 x3 y3 x3y3 x3x3
… … … … …
n xn yn xnyn xnxn
Total
The equation of the least-squares regression line (y = a + bx ) gives information about the relationship between the independent
variable Y and the dependent variable X. The sign of the slope estimate, tells whether the relationship is positive or negative. The value of the slope gives the change that will occur in Y when X is changed by one unit.
Example – Finding the Least-Squares Regression Line (Page 511, Pelosi and Sandifer)
A real estate company wishes to determine the relationship between house size and
the numbers manageable they decide to use the listing price in thousands of dollars (i.e., $665,000 is recorded as 665) and the size in thousands of square feet (i.e., 2350 square feet is 2.35). Find the equation of the least-squares regression line given: = 199.2,
= 33,177, = 113,999.80,
= 692.033 and n = 66.
Answer: b1 = 152.68584, b0 = 41.84818, the equation of the regression line that
relates the listing price (Y) and the house
size (X) is y = 41.84818 + 152.68584x.
Translated back to the original numbers is
Interpretation – The equation indicates that there is positive relationship between the size of the house (X) and the listing price (Y). That is, as the size of the house (X)
increases so does the price (Y). The equation indicates that for every thousand square foot change in the size of the house, the listing price increases by 152.68584 thousand
dollars. This change is the slope of the line.
Exercise – (Page 512, Pelosi and Sandifer)
Some analysts are interested in the relationship between ‘An Insurance
squares regression line and interpret your data, using the table below.
Observation Number
Members, X (millions)
Revenues, Y (billions of $)
1 4.24 5.49
2 3.19 4.63
3 1.83 3.86
4 1.62 3.60
5 2.07 3.43
6 2.30 2.91
7 1.83 2.74
8 2.15 2.40
9 0.97 1.71
10 0.89 1.20
Total
Answer : b = 1.143924 ,
The equation means that for an increase of 1 million members, revenues will increase by $1.14 billion. The y-intercept of the line is 0.78, which means that when there are no members, the revenue generated is $0.78 billion.
Extrapolate x (independent variable)
values and y (dependent variable) values for bivariate data using the Least Squares
Regression Equation (e.g. x on y or y on
x)
values of X outside the observed range is called extrapolation.
Plotting Scatter Diagrams for Bivariate Data
The process of obtaining a linear regression relationship for a given set of bivariate data is often referred to as fitting a regression line. Three methods commonly used to fit a regression line to a given set of bivariate data are inspection, semi-averages and least squares.
data. The main disadvantage of this method is different people would probably draw
different lines using the same data. It
sometimes helps to plot the mean point of the data for x and y; then ensure that the regression line passes through this point.
Semi-average. This technique involves
splitting the data into two equal groups, plot the mean point for each group and join these points with a straight line.
Least squares. This is the standard method of obtaining a regression line.
1. The table below shows the output (in thousands of tons) and the expenditure on energy (£) for a firm over ten monthly periods.
Output (x) 20 22 25 26 21 23 28 20 25 29
Expenditure (y)
106 138 158 172 120 142 184 102 164 192
a. Draw a scatter diagram for the data.
Calculate the mean point of the data and plot it on the diagram.
Answer: = 23.9, = 147.8
c. Fit a regression line which passes through the mean point.
Ans: y = -83.8 + 9.69x
Answer: Try this! Using y = m(x - ) + ,
where the gradient (m) = , the
regression line is y = -100.76x + 10.4.
e. Estimate the expenditure on energy if the following month’s output is planned at
27,000 tons.
Answer: The estimated energy
expenditure at 27,000 tons is y = £177.83.
period. The data are shown below in the table.
a. Construct a scatter plot to represent the given information. [4 mks]
b. Find the equation of the least squares regression line. [12 mks]
c. Predict the number of calls when the temperature is 80. [1 mk]
Summary
Temperature (x)
68 74 82 88 93 99 101
No. of calls (y)
Regression is concerned with obtaining a mathematical equation which describes the relationship between two variables.
The regression equation can be used for description, control and prediction.
Description is important when the user is simply trying to understand the way that two variables are related. Control describes when the model is used to set standards or reduce variability. Prediction is when the model is used to determine what the resulting Y value should be when X takes on certain values.
Both regression and correlation deal with bivariate quantitative data and the
relationship between the two variables. Correlation analysis simply measures the strength of the linear relationship between two quantitative variables by measuring the degree of ‘scatter’ of the data values. The less scattered (or varied) the data values are, the stronger the correlation between the two variables. The output of the analysis is a
single number. In correlation analysis, there is no need to identify which variable is
dependent and which variable is
Associate the Coefficient of Correlation with the corresponding Scatter Diagrams
Three types of relationships that can exist between two variables are perfect negative, none and perfect positive. The correlation coefficient is a measure of the strength of a linear relationship. A correlation of – 1
corresponds to a perfect negative
relationship (negative linear correlation). A correlation of 0 corresponds to no
relationship (no correlation). A correlation of +1 corresponds to a perfect positive
– 1 to 1. A value of r = 0 signifies that there is no correlation present, while further away from zero towards – 1 or +1, r is the
stronger correlation.
Graphically
Analytic
The correlation coefficient can be used as a statistics in its own right when prediction of one variable as a function of the other is not
X Y
No Relationship
Y
X Perfect Positive X
Y
appropriate or necessary. For example,
suppose a company is interested in knowing whether there is a relationship between the score on an aptitude test and the number of months that a person remains in an entry-level position. The company does not
necessarily want to predict the number of months in the entry-level position; it simply wants to know whether the test score is
related to that variable. In this case, the company could calculate the correlation coefficient between test score and the number of months in the entry level job.
A positive correlation exists in such a way that increasing the value of one variable tend to be associated with increases in the value of the other value. The correlation
coefficient (r) will range from 0 to +1. Some examples of bivariate data which would be expected to be positively correlated are:
a. Years of employment and salary. As an employee gains experience and
qualification, it is expected that the salary will increase over a time.
b. Number of vehicles licensed and road deaths.
d. Age of insured person and amount for premium. The older a person is the greater the cost of the premium, since the chance of the person dying increases.
Negative Correlation
A negative correlation exists when increases in the value of one variable tend to be
associated with decreases in the value of the other variable and vice versa. The
correlation (r) ranges from – 1 to 0. Some examples of bivariate data which one would expect to be negatively correlated are:
a. Number of weeks of experience and number of errors made. The more
task, the less error one would be expected to make.
b. Amount of goods sold and average cost per good.
Correlation provides a measure of how well a least-squares regression line ‘fits’ the
given set of data. The better the correlation, the closer the data points are to the
regression line and hence the more
confidence one could have in using the regression line for estimation.
Calculating the Product Moment
The correlation coefficient, r, is calculated
using the formula:
or .
Exercise
1. A real estate company wonders what the correlation coefficient is for the relationship between price and house size.
a. Calculate the correlation coefficient
given: = 199.2, = 33,177,
= 113,999.80, = 692.033,
Answer: a) r = 0.8615
b. Interpret the result.
Answer: a) 0.8615 b) The
correlation coefficient is very close to +1, which indicates a strong positive
relationship between the price and the house size.
2. a) The data below relates the weekly
maintenance cost (£) to the age (in months) of ten machines of similar type in a
manufacturing company. Calculate the product moment correlation coefficient between age and cost.
Machine 1 2 3 4 5 6 7 8 9 10
Cost (y) 190 240 250 300 310 335 300 300 350 395
Answer: r = 0.88
b) Interpret your result.
Answer: The result indicate that a strong positive measure of correlation between machine maintenance cost and machine age.
Calculate the Coefficient of
Determination and Interpret these values
product moment correlation coefficient). The coefficient of determination gives the proportion of all the variation in the y-values that is explained by the variation in the x
-values.
Suppose that for turnover (y) measured against advertising expenditure (x), the
correlation coefficient was calculated as
r = 0.76. Then the coefficient of determination, r2 = (0.76)2 = 0.58.
This means that only 58% of the variation in turnover is due to advertising expenditure. In other words, 42% of the variation in
turnover is due to factors other than
quality, changing in trends or productivity. Coefficient of determination (cd) =
, where r is the
product moment correlation coefficient.
Since, , it follows that .
Exercise
a. The correlation coefficient between maintenance cost and age of a set of ten
machines was r = 0.88. Find the coefficient of determination. Answer: 0.77 (2 dec. pl.)
b. Interpret the result.
the variation in the machine’s age. The other 23% of the variation may be due to the amount of machine used and/or the operators’ experience.
2. An emergency service wishes to see whether a relationship exists between the outside temperature and the number of emergency calls it receives for a 7-hour period. The data are shown below in the table.
a. Compute the Product Moment Correlation Coefficient (r). [5 mks]
Temperature (x)
68 74 82 88 93 99 101
No. of calls (y)
b. Compute the Coefficient of