• No results found

1 Correlation and Regression Analysis

N/A
N/A
Protected

Academic year: 2022

Share "1 Correlation and Regression Analysis"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

1 Correlation and Regression Analysis

In this section we will be investigating the relationship between two continuous variable, such as height and weight, the concentration of an injected drug and heart rate, or the consumption level of some nutrient and weight gain.

The tools used to explore this relationship, is the regression and correlation analysis.

These tools can be used to find out if the outcome from one variable depends on the value of the other variable, which would mean a dependency from one variable on the other.

Regression and correlation analysis can be used to describe the nature and strength of the relationship between two continuous variables.

1.1 Scatterplot

The first step in the investigation of the relationship between two continuous variables is a scatterplot!

Create a scatterplot for the two variables and evaluate the quality of the relationship.

Example:

Does the number of years invested in schooling pay off in the job market?

Apparently so – the better educated you are, the more money you will earn. The data in the following table give the median annual income of full-time workers age 25 or older by the number of years of schooling completed.

x=Years of Schooling y=Salary (dollars)

8 18,000

10 20,500

12 25,000

14 28,100

16 34,500

19 39,700

Start of with creating a scatterplot for X and Y.

(2)

The scatterplot shows a strong, positive, linear association between years and salary.

Questions to be answered with the help of the scatterplot:

1. Does a relationship exist that can be described by a straight line (which means is there a linear relationship)?

2. Is there a relationship, that is not linear?

3. If the scatterplot of the variables look like a cloud there is no relationship between both variables and one would stop at this point.

1.2 Correlation

If the scatterplot shows a reasonable linear relationship (straight line) calculate Pearson’s correlation coefficient to evaluate the strength of the linear relationship.

Notation:

Let (x1, y1), (x2, y2), . . . , (xn, yn) denote a sample of (x, y) pairs.

Definition:

Given the following sum of squares

Sxy =Xxy − (Px)(Py) n

Sxx =Xx2(Px)2 n Syy =Xy2 (Py)2

n Pearson’s Correlation Coefficient can be calculated as:

r = Sxy

qSxxSyy

Pearson’s correlation coefficient (named after Karl Pearson, 1857-1936) is a number between -1 and 1, that measures the strength of a linear relationship between two continuous variables.

The absolute value of the coefficient measures how closely the variables are related. The closer it is to 1 the closer the relationship. A correlation coefficient over 0.8 indicates a strong correlation between the variables.

Data patterns and Pearson’s Correlation Coefficient

(3)

The sign of the correlation coefficient tells you of the trend in the relationship. A positive (negative) coefficient means that one variable increases (decreases), when the other increases.

Continue Example:

Calculate Pearson’s correlation coefficient for years and salary. First find ¯x = 13.17, sx = 4.02 and ¯y = 27633, sy = 8290.

xi=Years of Schooling yi=Salary (dollars) xi· yi x2i yi2

8 18,000 144000 64 324,000,000

10 20,500 205000 100 420,250,000

12 25,000 300000 144 625,000,000

14 28,100 393400 196 789,610,000

16 34,500 552000 256 1,190,250,000

19 39,700 754300 361 1,576,090,000

So that Pxi = 79, Pyi = 165800, Pxiyi = 2348700, Px2i = 1121, Pyi2 = 4, 925, 200, 000.

This leads to

Sxy =Xxiyi (Pxi) (Pyi)

n = 2348700 − (79) (165800)

6 = 165666.65 Sxx =Xx2i (Pxi)2

n = 1121 − (79)2

6 = 80.8333 Syy =Xyi2 (Pyi)2

n = 4, 925, 200, 000 −(165800)2

6 = 343593333.333 So that

r = Sxy

qSxx· Syy = 165666.65

√80.8333 · 343593333.333 = 0.994.

(4)

The Pearson correlation coefficient of Years of schooling and salary r = 0.994.

A correlation of 0.9942 is very high and shows a strong, positive, linear association between years of schooling and the salary.

1.3 Linear Regression

In the example we might want to predict the expected salary for different times of schooling, or calculate the increase in salary for every year of schooling. For this purpose we can do a regression analysis.

Terms and Definition: If we want to use a variable x to draw conclusions concerning a variable y:

y is called dependent or response variable.

x is called independent, predictor, os explanatory variable.

If the relationship between two variables is linear is can be summarized by a straight line. A straight line can be described by an equation:

y = a + b x a is called the intercept and b the slope of the equation.

The slope is the amount by which y increases when x increases by 1 unit.

Fitting a straight line

Given data points (xi, yi) a and b shall now be chosen in that way that the corresponding linear line will have the “best fit” for the given data.

The criteria for “best fit” used in regression analysis is the sum of the squared differences between the data points and the line itself, that is the y deviations.

For data points (xi, yi), 1 ≤ i ≤ n this can be written as mina,b

Xn i

(yi− (a + bxi))2

In words: minimize the sum by choosing the appropriate parameters a and b.

The resulting line is called the least square line or sample regression line.

After the problem is stated it can be solved mathematically and the results are formulas, how to calculate the best parameters.

b = Sxy

Sxx and a = ¯y − b · ¯x.

Write the equation of the least squares line as ˆ

y = a + bx ˆ

y gives an estimate for y for a given value of x.

(5)

Continue Example:

Since the salary and the years of schooling show such a strong linear relationship and the salary can be viewed as depending on the years of schooling, do a linear regression analysis with the salary as the response variable and the years of schooling as the predictor variable.

Calculate b = Sxy

xx

= 165666.65

80.8333 = 2050.28 and a = ¯y − b¯x = 27633 − 2050.28 · 13.17 = 630.81 Our result is the least squares line

ˆ

y = a + bx = 630.81 + 2050.28 x

The slope equals $2050.28, that is for every year of schooling the average salary increases by this amount.

To estimate the average salary after 18 years of schooling we calculate ˆy with x = 18 ˆ

y = 630.81 + 2050.28 · 18 = 37535.85$

Don’t use the regression line for values outside the range of the observed values. This is a model that only has been proved valid for the given range.

Properties of the regression or least squares line

1. The least squares line passes always through the balance point (¯x, ¯y) of the data set.

2. The regression line of y on x should not be used to predict x, since it is not the line that minimizes the sum of squared x deviations.

Assessing the fit of a line

Once the least squares line has been obtained, it is natural to examine how effectively the line summarizes the relationship between x and y.

The first question that has to be answered is, if the line is an appropriate way to summarize the relationship. In order to answer this question, we will calculate the coefficient of determination r2.

Definition: The coefficient of determination for he regression of y on x is r2 = Sxy2

SxxSyy the square of Pearson’s Correlation Coefficient.

It gives the proportion of variation in y that can be attributed to a linear relationship between x and y.

Is r2 greater than 0.8, the model has a good fit and can be used to calculate reliable predictions of the dependent variable by using the independent variable.

In the example, the variable Years of Schooling explains r2 = 98.8% of the variation in the variable Salary. Which is very high. The plot showed that the data points are almost on a straight line.

Use the least squares line for predicting the annual salary of a person with 13 years of schooling.

(6)

ˆ

y(13) = a + b · 13 = 630.81 + 2050.28 · 13 = 27284.45$

This is just an estimate, from the other parts of the class, we know that a confidence interval can be found that gives more information.

References

Related documents

This Special Issue, stemming from the 2018 Australian Collaborative Education Network National Conference, integrates a number of perspectives and approaches to WIL

Political Parties approved by CNE to stand in at least some constituencies PLD – Partido de Liberdade e Desenvolvimento – Party of Freedom and Development ECOLOGISTA – MT –

Facet joint arthropathy - osteophyte formation and distortion of joint alignment MRI Axial T2 L3-L4 disk Psoas Paraspinal muscles Psoas Paraspinal NP AF MRI Axial T2 PACS, BIDMC

The Cisco Wireless IP Phone 7920 is an easy-to-use IEEE 802.11b wireless IP phone that provides comprehensive voice communications in conjunction with Cisco CallManager and

They further argue that debtors financial problems are primarily the result of unexpected life events that put them in a precarious financial position that has little to do with

As such, combining the results of the microstructures and phase constituents of the as-sprayed and annealed samples, one can basically conclude that the particle size influences

• Ensure that all school staff routinely working with hazardous chemicals receives proper training in all phases of chemical management, including safe storage, proper use,

Fig. 7 shows how this might be done within the com- plementary/symmetry architecture. Within a column of this architecture, the phases of the global clock could be conveyed along