Chapter 5. Linear Regression and Correlation Simple with answer. 2020.ppt

(1)

(2)

Introduction



Relationship between education spending and test scores

The correlation is negative (-0.2). The United States spends in education the second most of any country, and has below average test-scores. Ethnically homogeneous Japan, South Korea and Finland spend at average rates and have the best test scores. Tiny, ethnically homogeneous and "hungry" Estonia spends less than half as much as the United States and Norway on education but has far better test scores. Source: Economy Industry USA View

The Organization for Economic Co-operation and Development

(OECD)

released the results of it s 2009 global rankings

on student performance in mathematics, reading, and science, on

the Program for

(3)

Introduction

(4)

Introduction

(5)

Introduction

(6)

Introduction

(7)

Introduction

(8)

Introduction



Relationship between per pupil spending and mean math scores in PISA 2012, by country

The figure shows the simple correlation between the mean scores in mathematics and the expenditure per pupil in secondary education for each of the countries that participated in PISA 2012. It is easy to see that students in countries like Qatar and Singapore spend similar amounts of Dollars per Student, achieving very different PISA math scores.

The Organization for Economic

Co-operation and Development (OECD)

released the results o f its 2012 global ranki ngs

on student performance in mathematics,

reading, and science, on the Program for International Student Assessment,

(9)

Ranking of top countries in math, reading, and science is out — and the US didn't crack the top 10



Source: OECD. China is represented by the

provinces of Beijing, Shanghai, Jiangsu, and Guangdong.

The PISA is a worldwide exam administered every three years that measures 15-year-olds

in 72 countries.

About 540,000 students took the exam in 2015.. Asian countries topped the rankings across all

(10)

PISA tests: Singapore top in global education rankings/2015



"If you think maths is a hard subject you won't succeed," 10-year-old Hai Yang tells me. striking feature of Singapore's education:

*The whole class has just been working on a problem, taking it in turns to stand up and explain how they worked it out. And they do this in English, one of several languages spoken in Singapore. It turns out there is more than one way to reach the right solution.

*What is impressive is their commitment to understanding exactly how to do it.

"If we just blindly look at the teacher's answer, when we grow up we might not know how to do it any more“ *Building blocks

This is an approach known as maths mastery which some schools in the UK have begun using in an adapted form.

*"We believe in Singapore in the fundamentals, that in order for a child to be well educated you need to give them the fundamental language and grammar in various disciplines, a language where you can read, a language where you can understand numbers." S

=ingapore has also thought a lot about how to make teaching a rewarding profession.

*Teachers can follow a career path that takes them towards being a principal, a researcher into education or a master classroom teacher. They get time to deepen their knowledge and prepare lessons.

*In Montfort Secondary school they are encouraging the teenage boys to make prototype products, ranging from a smart garden watering system to an electronic keyboard.

*Using your science and maths skills to solve real world problems is exactly the kind of ability the PISA tests are intended to measure. An empty room at the school is being turned into what they call a "makers lab". *Simple tools and materials will be available for the pupils to use in their spare time to make things to take home. If they want to work out how to light up their guitar with LED lights, this is where they can do it.

*Another striking feature of Singapore's education is that head teachers are rotated between schools every six to eight years. There is also an increasing emphasis on collaboration.

*"Today teachers work in teams, they grow together, they research together, they work together." High stakes

(11)

The Objective of Correlation and Regression



The objective for correlation is to establish the relationship between two or more quantitative variables without being able to infer causal relationships, and for

regression analysis is to establish a mathematical model to estimate the value of a variable based on the value of the other variables. This technique is appropriate when:

A mathematical function or equation linking two metric-scaled (interval or ratio) variables is to be constructed, under the assumption that values of one of the two variables is dependent on the values of the other.

Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof.

Suppose that one wanted to determine which program interventions were associated with a JOBS Program client's ability to get a job within six months of exiting the program. The outcome variable would be "job" or "no job” clearly a nominal variable. One could then use several independent variables such as job training, post-secondary education and the like to predict the odds of getting a job.

Multiple Regression Analysis Technique this technique is appropriate

(12)

Methodology



To perform a regression analysis and correlation is advisable to follow the following steps:

1. Collecting data from sources such as questionnaires, forms or databases, texts, brochures, magazines, internet, direct measurements, etc.

2. Draw the scatter diagram, which suggests that model could be used, is a graph showing the intensity and direction of the relationship between two variables. Only up to three-dimensional planes are best seen models suggested. This question is important: Does the relationship appear to be linear or curved? 3. Calculate the values of the correlation coefficient and the coefficient of determination (note: correlation coefficient measures the percentage of linear association between variables and coefficients of determination measures the percentage of variability of the dependent variable explained by the independent variable).

4. Set the model suggests the scatter diagram or suggested by the experience of the investigator.

5. Estimate the regression line using a processing program with statistical applications (Excel, SPSS, Statgraphics, Minitab, SAS, Statistics, etc.) or by formulas.

(13)



Techniques for Examining Associations

Spearman Correlation

The technique is appropriate

when:

The degree of association

between two sets of ranks

(pertaining to two variables) is

to be examined.

Illustrative research question(s)

this technique can answer

“Is

there a significant relationship

between motivation levels of

teachers and the quality of

their performance?“

Assume that the data on motivation and quality of performance are in the form of ranks, say, 1 through 50, for 50 teachers who were evaluated

subjectively by their administrators on each variable.

Pearson Correlation

This technique is appropriate

When:

The degree of association

between two metric-scaled

(interval or ratio) variables is to

be examined.

Illustrative research

question(s) this technique

can answer

“

Is there a

significant relationship between

parents' age (measured in

actual years) and their

perceptions of the school's

(14)

Spearman Rank Coefficient (r

s

)



• Used for non-linear relationships

• It is a non-parametric measure of correlation.

• This procedure makes use of the two sets of ranks that

may be assigned to the sample values of x and Y.

• Spearman Rank correlation coefficient could be

computed in the following cases:



Both variables are quantitative.



Both variables are qualitative ordinal.



One variable is quantitative and the other is

qualitative ordinal.

• The value of r

_s

denotes the magnitude and nature of

(15)

Spearman Correlation



Example: Quality of life

Fourteen cities have been rated on an index that measures the quality of life.

Also, the percentage of the population that has moved into each city over the

past year has been determined. Have cities with higher quality of life scores

attracted more new residents?

Association between quality of life and percentage of new residents

City

Quality of life

Percentage of New Residents

A 25 5

B 10 4

C 15 3

D 30 6

E 20 3

F 25 9

G 10 5

H 15 3

I 30 7

J 20 8

K 15 5

L 17 6

M 20 7

(16)

Steps in SPSS for Spearman correlation

(17)

OUTPUT DATA – Spearman correlation



Correlations

Quality of Life

Percentage of New Residents Spearman's

rho

Quality of LifeCorrelation

Coefficient 1.000 .586* Sig. (2-tailed) _.028

N 14 14

Percentage of New

Residents

Correlation

Coefficient .586* 1.000 Sig. (2-tailed) _.028

N 14 14

*. Correlation is significant at the 0.05 level (2-tailed).

These variables have a moderate direct or positive association. The

moderate of quality of life score is relate with the moderate the

percentage of new residents. The value of r

2

is (0.5862

2

=0.3434), which

(18)

Simple Correlation (r) Pearson



It is also called Pearson's correlation or product moment correlation coefficient. It measures the direction (the sign denotes the direction) and strength (the value of

r denotes the strength of association) between two variables of the quantitative variables.

Direct or positive, if the values of the two variables deviate in the same direction i.e. if an increase (or decrease) in the values of one variable results, on average, in a corresponding increase (or decrease) in the values of the other variable the correlation is said to be direct or positive. Examples:

•Student’s performance and number of hours studied •Satisfaction and loyalty at work.

Inverse or negative, if the variables deviate in opposite direction i.e. if increase in the values of one variable results on average, in corresponding decrease in the values of other variable. Examples:

•TV viewing and class grades-students who spend more time watching TV tend to have lower grades (or phrased as students with higher grades tend to spend less time watching TV)

(19)

Pearson Correlation



-1

-0.75 -0.25

₀

0.25 0.75

1

strong

moderate

weak

moderate

strong

no relation Inverse perfect

correlation

Direct

inverse

Direct perfect correlation



The value of “r” ranges between ( -1) and ( +1)



The value of “r” denotes the strength of the association

as illustrated by the following diagram. If r = 0 or close to

Zero this means no association or correlation between

the two variables.

(20)

Steps for Hypothesis Testing for

ρ



Step 1: Hypotheses. We specify the null and alternative hypotheses:

Null hypothesis Ho: ρ = 0 (there is no association between performance and the time that usually wake-up)

Alternative hypothesis Ha: ρ ≠ 0 (There is an association between them) Step 2: Test Statistic



 











 









  2 2 2

2 ₍ ₎ ₍ ₎

) ( ) ( y y n x x n y x xy n r y x xy xy

s

r

or



Step 3: Sig. (P-Value), we use the resulting test statistic to calculate the Sig. (P -value). Sig. (significance level) of the correlation can be determined : by using the correlation coefficient table for the degrees of freedom:

df=n−2, where n is the number of observation in x and y variables. Step 4: Make a decision:

If Sig. (P-value) is smaller than the significance level α, we reject the null

hypothesis in favor of the alternative. We conclude "there is sufficient evidence at the α level to conclude that there is a linear relationship in the population between the predictor x and response y."

(21)

Example



A sample of 12 students was selected, data about their performance and

the time that usually wake-up was recorded as shown in the following

table . It is required to find the correlation between performance and the

time that student usually wakes up.

Student

Wake-up

Time

Academic

Performance

Kalisa

5.30

13.0

Seraphine

10.00

9.0

Manasse

8.00

13.0

Odette

9.00

11.0

Laurence

6.00

16.0

Pascal

7.00

10.0

Gallican

7.30

13.0

Marcel

6.00

11.0

Sandrine

5.00

14.0

Acqueline

9.30

10.0

Judith

5.30

16.5

Innoncent

7.30

12.0

Hypothesis

Ho: ρ = 0 (there is no association between performance and the time that usually wake-up)

Ha:

There is an association between them

0

(22)

Steps in SPSS



Again to perform a correlation and regression analysis is advisable to

follow the following steps:

Step 1: Scatter Diagram (

After collecting the data, draw the scatter

diagram)

The starting point is to draw a scatter of points on a graph, with one

variable on the X-axis and the other variable on the Y-axis; it is

customary represent the dependent variable on the vertical axis and

independent on the horizontal axis. When studying the relationship

between two variables, one can be considered as cause and the other

as a result or effect of the other. Call the exogenous or independent

variable that causes, the effect is the endogenous variable. The scatter

plots or diagrams give an idea of the relationship (if any) between the

variables as suggested by the data. The closer the points of a straight

line are, the stronger the linear relationship between two variables will

be.

(23)

Steps and Output of scatter dot

(24)

Step 2. Correlation

(25)



OUTPUT - Correlation

Correlations Wake

up-Time

Academic performance Wake up-Time Pearson

Correlation 1 -.720** Sig.

(2-tailed) .008

N 12 12

Academic performance

Pearson

Correlation -.720** 1 Sig.

(2-tailed) .008

N 12 12

**. Correlation is significant at the 0.01 level (2-tailed).

Statistic Test: r = -.720. These variables have a strong inverse association. Sig.=.008

Decision and interpretation: We reject Null hypothesis, so we conclude "there is sufficient evidence at the 5% of level to conclude that there is a linear inverse relationship in the population between the predictor ‘wake up’ and response ‘academic performance’ i.e., the wake-up time is relate with the academic performance. Sig.=.008, means there is a strong inverse relationship between the time that students wake-up and their performance (the meaning is, later get up less score)

Coefficient of determination is the percentage of variation in the dependent variable ‘Y’ explained by the independent variable ‘X’.

How well does this line fit the data?

The value of r2_=(-.720)2_{=0.5184, 51.84 ≈ 52%}

The 'goodness of fit' indicates the percentage of the variation in performance which is accounted for by the variation of the wake-up time; in other hands 52% of the variance in performance is explained by the time that students wake up.

(26)

Example



Country % Immunization Mortality_rate

Bolivia 77 118

Brasil 69 65

Cambodia 32 184

Canadá 85 8

China 94 43

Czech_Republic 99 12

Egypt 89 55

Ethiopia 13 208

Finland 95 7

France 95 9

Greece 54 9

India 89 124

Italy 95 10

Japan 87 6

México 91 33

Poland 98 16

Russian_federation 73 32

Senegal 47 145

Turkey 76 87

United_Kingdom 90 9

A study was conducted to find whether there is any relationship between the mortality rate and percentage of the immunization in some countries of the world. The following set of data was found in the page "http://www.unicef.org/statistics/". Let us determine is there relationship for this set of data. The first column represents the countries and the second and third columns represent the % of immunization and mortality rate of each country.

(27)

Steps in SPSS for draw Scatter diagram



Graphs>Chart builder>OK>front the variable box, take the variable immunization to “x-axis” and Rate_mortality to “y-“x-axis” and click in Group Point ID> take the variable country to the Point ID>OK

1

3

4

5

OK

(28)

Step 3. Regression Analysis



Scatter diagram of the mortality rate by % immunization with regression line inserted in some countries in the world

(29)

Steps in SPSS for Regression



Analyze >Regression Linear>

1

2

3

4

5

6

(30)

Interpretation from outcome of SPSS



•Checking the Model Fit

Model Summary

Model R R Square

Adjusted R Square

Std. Error of the Estimate

1 .791a _.626 _.605 _40.13931

a. Predictors: (Constant), Immunization %

The model summary table reports the strength of the relationship between the model and the dependent variable. “R=.791”, correlation coefficient, is the linear correlation between the observed and model-predicted values of the dependent variable. Its large value indicates a strong relationship.

R Square = .626, the coefficient of determination, is the squared value correlation coefficient. It shows that about 62.6% the variation in mortality is explained by the model.

ANOVAa

Model

Sum of

Squares df

Mean

Square F Sig.

1 Regression 48497.050 1 48497.05 30.101 .000b

Residual 29000.950 18 1611.16

Total 77498.000 19

a. Dependent Variable: Mortality_rate b. Predictors: (Constant), Immunization %

The significance value of the F statistic (.000) is less than 0.05, which

means that the

variation explained by the model is not due to chance.

(31)

Checking the coefficients of the regression line

(parameter estimates)



This table shows the coefficients of the regression line:

•The first variable (constant) represents the constant, also referred to as the point to intercept the regression line when it crosses the Y axis. In other words this is the predicted value of mortality when all other variables are 0.

•The second, these are the values for the regression equation for predicting the dependent variable from the independent variable.

Coefficientsa

Model

Unstandardized Coefficients

Standardized Coefficients

t Sig.

B Std. Error Beta

1 (Constant) 224.316 31.440 7.135 .000

Immunization

% -2.136 .389 -.791 -5.486 .000

a. Dependent Variable: Mortality_rate

The regression equation can be presented in many different ways, for example:

Mortality predicted= 224.316 - 2.136* % of immunization

= 224.316 average mortality rate without any influence of the % of immunization (constant source).

= - 2.136 decreased mortality rate for each % of immunization as indicated nonzero correlation (slope of the line)1

0

(32)

Prediction of Mortality Rate



What rate of mortality could be predicted for the group of countries with

80% immunization?

The best estimate of the mortality is obtained by substituting the value of

80% for that of the independent variable, x, and calculating the

corresponding value of the Mortality.

Estimated Mortality:

mortality

of

rate

X

Y





224

.

316



2

.

136



224

.

316



2

.

136

*

80



53

.

436



53

Expected mortality would be 53 mortality rate.

With these results we conclude:

1. The variables are associated or related linearly in the population from which the sample comes (with a very small chance that the relationship found is explained by chance, less than one per thousand).

2. Found that the relationship is very good (r = - .791), in fact that the independent variable (% of immunization) explained 62.6% ( ) the variability of the dependent variable (mortality).

3. That the relationship is inverse or negative, decreasing in average mortality rate 2,136 per % increase in immunization in the countries under study.

626 .

2 _

(33)

Assignment 5



1. Find and interpret the relationship between Anxiety and Test Scores (follow all steps)

a. Draw and interpret the scatter diagram

b. Make a hypothesis for the correlation coefficient

c. Calculate and interpret the coefficient of determination (goodness of fit)

d. Calculate and interpret the hypothesis for regression (ANOVA). (Do the independent variables reliably predict the dependent variable?

e. Write the regression equation for the model

f. Prediction. What test score could be predicted with an anxiety level of 5.5? g. Check the assumptions about autocorrelation and normality distributed

OUTPUT FORM SPSS (

x

)

Anxiety 10 8 2 1 5 6 10 8 2 3 5 6

(

Y

) Test

score 2 3 9 7 6 5 2 4 7 7 6 4

Correlations

Test_score Anxiety Pearson

Correlation

Test_score 1.000 -.946 Anxiety -.946 1.000 Sig.

(1-tailed)

Test_score .000 Anxiety .000

N Test_score 12 12

Anxiety 12 12

Model Summaryb

Model R R Square

Adjusted R Square

Std. Error of the Estimate

Durbin-Watson 1 .946a _.895 _.884 _.75214 _3.187 a. Predictors: (Constant), Anxiety

b. Dependent Variable: Test_score

Coefficientsa Model Unstandardized Coefficients Standardized Coefficients t Sig.

B Std. Error Beta

1 (Consta

nt)

8.886 .458 19.385 .000

Anxiety -.676 .073 -.946 -9.212 .000

(34)

Assigment 5



Tests of Normality

Kolmogorov-Smirnova _Shapiro-Wilk

Statistic df Sig. Statistic df Sig. Standardized Residual .149 12 .200* _.973 ₁₂ _.937 *. This is a lower bound of the true significance.

a. Lilliefors Significance Correction

2. In a study of the relationship between level education and income the following data was obtained. Find the relationship between them and comment.

Compute the Spearman rank correlation coefficient (because 'level of education’ is categorical variable) and test it for significance at the .05 level. What conclusion may be reached?

People Level Education (x) Income (y)

A Preparatory 25

B Primary 10

C Master’s degree 8

D Secondary 10

E Bachelor degree 15

F Illiterate 50

G Postgraduate diploma 60

a. Interpret descriptive statistics such mean, standard deviation and

coefficient of variation.

b. Draw and Interpret scatter dot c. Make a hypothesis for the

(35)

Assignment 5



Descriptive Statistics

Mean

Std.

Deviation N Level_Education 4.5000 1.87083 6 Income 33.5714 24.10295 7

Correlations

Level_Educati

on Income Spearman's rhoLevel_Education Correlation

Coefficient

1.000 .928**

Sig. (2-tailed) .008

N 6 6

Income Correlation Coefficient

.928** _1.000

N 6 7

**. Correlation is significant at the 0.01 level (2-tailed).

(36)

Assignment 5



3. A psychologist believes that those who score high on a need-achievement test will likely have a high salary to match. To test this theory, the psychologist has given questionnaires to a random sample of 17 subjects and has ranked the data so that the highest value in each category has been assigned a 1.

Subject A B C D E F G H I J K L M N O P Q

Rank - Need

Achievement 1 8 4 10 12 2 13 6 16 11 14 3 9 7 15 17 5 Salary Rank 3 7 2 12 9 1 11 6 17 13 15 5 10 8 14 16 4

a. Compute and interpret the Spearman rank correlation coefficient and test it for significance at the .05 level.

What conclusion may be reached? b. Interpret the scatter dot

Correlations

Rank - Need Achievement

Salary Rank Spearman's

rho

Rank - Need Achievement

Correlation Coefficient

1.000 .949**

N 17 17

Salary Rank Correlation Coefficient

.949** _1.000

N 17 17

(37)

Assignment 5



,

(38)

Assignment 5



,

5. Multiple choice

5.1 The slope (B₁) represents: a. Predict value of y when x=0 b. Predict value of Y

c. Change in Y per unit change in X

5.2 The Y intercept (B₀) represents the: a.Change in Y per unit change in X b.Predict value of y when x=0

c.Variation around the regression line

5.3 The coefficient of determination (r2_{) tells you:} a. The proportion of total variation that is explained b. Whether the slope has any significance

c. Whether the regression sum of squares is greater than the total sum of squares

5.4 In performing a regression analysis involving two numerical variables, you assume: a. The variance of X and Y are equal

b. That X and Y are independent c. All of the above

5.5 The residuals represent:

a.The difference between the actual Y values and the mean of Y b.The square root of the slope

(39)

Assignment 5



,

5.6 If the coefficient of correlation (r) = -1.00, then:

a. All the data points must fall exactly on a straight line with a inverse or negative slope. b. All the data points must fall exactly on a straight line with a positive slope

c.All the data points must fall exactly on a horizontal straight line with a zero slope.

5.7 Assuming a straight line (linear) relationship between X and Y, if the coefficient of correlation (r) = -0.30:

a. There is no correlation

b. Variable X is larger than variable Y c. The slope is negative

5.8 In a simple linear regression model, the coefficient of correlation and the slope: a. May have opposite signs

released the results of its 2009 global rankings