multiple regression - Correlation and Regressions: Relations between Variables Correlation meth

rusan chen key words

5. Correlation and Regressions: Relations between Variables Correlation methods are often used to examine the possible linear

5.3. multiple regression

We introduced simple regression that allows the prediction of the depen-dent variable by one independepen-dent variable. But in the real world, one phenome-non is often related to multiple factors. For example, in addition to orientation for friendship, willingness to speak could also relate to the learner’s personality, the instructor’s teaching method, specific teaching material, and class size. The multiple regression analysis allows the researcher to predict the dependent vari-able using multiple predictors.

Multiple regression analysis is a flexible statistical procedure that serves dif-ferent research purposes. Besides prediction, multiple regression analysis is often used for selecting important predictors, controlling covariates, and de-tecting moderator effects between two variables. In this section, we will intro-duce multiple regression analysis when the selection of predictors is the major interest of the study.

Gradman and Hanania (1991) noticed that incoming ESL students enter-ing an intensive English program at Indiana University had very different

table 2.15

SPSS Output for the R-square from Simple Regression Analysis

Model R R-square Adjusted R-square Standard error of the estimate

1 0.568 0.323 0.266 3.53391

table 2.16

SPSS Output for Simple Regression Analysis

Model Variable B Standard error Beta t Significance

1 (Constant) 13.418 5.613 2.391 0.034

Friendship 1.046 0.438 0.568 2.390 0.034

Unstandardized coefficients

Standardized coefficients

English proficiency levels and achieved progress with varying degrees of suc-cess. The researchers were interested in identifying what language-learning background factors were associated with students’ ESL proficiency levels. In-stead of focusing on one type of variable, Gradman and Hanania collected data on different types of variables, such as formal learning variables (e.g., age at start of English learning), attitude and motivation variables (e.g., family en-couragement), exposure and use-in-class variables (e.g., native- or nonna-tive-speaking teacher), and extracurricular exposure variables (e.g., reading outside of class). The dependent variable in this study was the TOEFL score reported when the student entered the program. A major purpose of the study was to identify the most important factors that might have influenced the stu-dents’ TOEFL scores. Using multiple regression analyses, the study found that extracurricular reading and having native-speaking teachers were among the significant predictors, while oral speaking in and out of class was not im-portant when predicting TOEFL scores. Assume that we have replicated Gradman and Hanania’s (1991) study and obtained data from 50 ESL stu-dents with the five variables presented in Table 2.17.

Four possible predictors of TOEFL scores were chosen as independent variables in the multiple regression analysis. The variable reading was obtained from a survey questionnaire that measured the amount of exposure to extra-curricular reading of English literature. The variable month indicates how many months the student attended intensive English programs before taking the TOEFL test. Native is a categorical variable showing whether or not the student had a native English-speaker as a teacher. The variable oral measured students’ oral communication ability in and out of the classroom.

Before conducting multiple regression analysis, it is natural to first examine whether the TOEFL score is significantly correlated to each of the potential predictors separately. The correlation matrix for the five variables is presented in Table 2.18, with p values indicating the significance for each Pearson corre-lation coefficient. For example, having the correcorre-lation coefficient r at 0.52 for TOEFL and reading with p less than 0.01 indicates that more exposure to ex-tracurricular reading is associated with higher TOEFL scores.

From Table 2.18 we find that the TOEFL score is also significantly corre-lated with month and native. But the correlation coefficient between TOEFL and oral is not significant, with r at 0.112 and p at 0.438. Given that we al-ready know from the correlation matrix that the TOEFL scores are signifi-cantly correlated with reading, month, and native, the reader may ask why we still need to conduct multiple regression analysis to determine the significant predictors. The answer is that when independent variables are intercorrelated, significant predictors resulting from multiple regression analysis are not neces-sarily the same as those judging from individual correlations. This is because

the portion of variance in the dependent variable explained by each predictor will overlap when the predictors are correlated with each other.

You may also notice that the variable native is a dichotomous variable with only two possible values, with 1 indicating having native-speaker teachers and 0 indicating no native-speaker teacher. Dichotomous variables can be used as legitimate predictors in regression analysis with meaningful interpretations.

The correlation between a dichotomous variable and a continuous variable (on ratio or interval scales) has a special name: point-biserial correlation. The calculation and the significance testing for point-biserial correlation is the same as for Pearson product-moment correlation (Howell, 1992, p. 267).

Numerical calculation for multiple regression analysis depends on matrix algebra. Since the calculation can be cumbersome, it will not be introduced

table 2.17

Prediction of English Proficiency

TOEFL Reading Months Native Oral TOEFL Reading Months Native Oral

547 9 14 1 9 541 5 15 1 11

649 8 9 1 8 566 9 0 1 7

571 5 14 0 6 538 4 6 1 4

595 9 11 1 7 522 6 4 1 8

560 7 11 1 9 503 5 0 1 8

570 6 12 0 5 521 9 3 1 7

500 6 0 0 10 472 7 0 0 6

491 4 6 1 9 570 5 15 1 9

635 6 9 1 7 581 7 9 1 6

582 10 8 1 6 490 4 8 1 5

453 6 5 0 11 549 9 11 1 6

564 8 4 1 7 533 7 10 1 10

483 2 10 0 7 515 9 8 1 7

545 6 15 0 9 653 11 13 1 4

504 7 14 1 8 596 8 12 1 10

580 7 11 1 2 525 6 10 1 6

610 6 12 1 11 540 2 4 1 8

600 8 15 1 11 559 9 6 1 10

584 5 8 0 7 479 7 10 0 3

503 5 3 0 4 681 11 15 1 8

596 7 11 1 9 481 6 5 1 2

583 8 11 1 6 633 11 15 1 10

544 8 6 1 8 487 6 4 1 9

512 2 4 1 4 636 8 8 1 6

514 7 15 0 7 534 8 2 1 5

Note. Reading⫽ exposure to extracurricular reading; months ⫽ number of months attending in-tensive English program; native⫽ native-speaking teacher coded as 1, nonnative-speaking teacher as 0; oral⫽ use of oral English in and out of classroom.

here. Instead we present the SPSS output in Tables 2.19 and 2.20 using the data in Table 2.17. We can see from Table 2.19 that the R-square equals 0.464, indicating that almost half of the variance in TOEFL scores can be ex-plained by the four predictors combined.

Using the estimated regression coefficients from Table 2.20, we can con-struct the regression equation: expected TOEFL score⫽ 432.28 ⫹ 8.74 ⫻ Reading ⫹ 4.51 ⫻ Month ⫹ 31.40 ⫻ Native ⫺ 0.51 ⫻ Oral. Since the

table 2.18

Correlation Matrix for the Five Variables Used in the Hypothetical Study Variable Statistics TOEFL Reading Month Native Oral TOEFL Pearson

correlation 1 0.520(**) 0.479(**) 0.365(**) 0.112 Significance

(2-tailed) 0.000 0.000 0.009 0.438

N 50 50 50 50 50

Reading Pearson

correlation 0.520(**) 1 0.228 0.295(*) 0.087 Significance

(2-tailed) 0.000 0.111 0.038 0.550

N 50 50 50 50 50

Month Pearson

correlation 0.479(**) 0.228 1 0.030 0.200

Significance

(2-tailed) 0.000 0.111 0.836 0.163

N 50 50 50 50 50

Native Pearson

correlation 0.365(**) 0.295(*) 0.030 1 0.098 Significance

(2-tailed) 0.009 0.038 0.836 0.500

N 50 50 50 50 50

Oral Pearson

correlation 0.112 0.087 0.200 0.098 1

Significance

(2-tailed) 0.438 0.550 0.163 0.500

N 50 50 50 50 50

**Correlation is significant at the 0.01 level. *Correlation is significant at the 0.05 level.

table 2.19

SPSS Output for R-square from Multiple Regression Analysis Predicting TOEFL Scores

Model R R-square Adjusted R-square Standard error of the estimate

1 0.681 0.464 0.416 40.152

major interest of the study is to select the important predictors for the TOEFL scores, we need to identify significant predictors from the output. The values in the far right column in Table 2.20 are the p values for evaluating the null hypothesis that the regression coefficient is 0 in the population. From the p values we can determine that reading, month, and native are significant pre-dictors with p less than 0.05, but oral is not a significant predictor for TOEFL scores with p equaling 0.842.

The interpretation of the intercept at 432.28 is that for a student with all four independent variables at 0, the expected TOEFL score for that student is 432.28. Since it is not likely that a student will have all four predictors equaling 0, the intercept does not have a meaningful interpretation in this regression equation, although it is significantly different from 0 with p less than 0.001.

The interpretation of slope in multiple regression involves original mea-surement units. For example, the slope for month is 4.51, indicating that an increase of one additional month in an intensive English program is associated with an increase of 4.51 points in TOEFL score, while holding other predic-tors constant. To hold a predictor constant means that the value for that vari-able is the same for all participants. The slope for reading is 8.74, showing that an increase of one unit on the extracurricular reading questionnaire is associ-ated with an increase of TOEFL score by 8.74 points, holding other predic-tors constant. The interpretation of the slope for a dichotomous variable is straightforward. Native is a dichotomous variable, and it is a significant pre-dictor with slope at 31.4, indicating that ESL students with native-speaker teachers are expected to have an average 31.4 points higher on TOEFL than ESL students with nonnative-speaker teachers.

Using the regression equation, we are able to calculate the expected TOEFL score for a student using the scores on the four predictors. For example, first student in Table 2.17 has scores of 9 for reading, 14 for month, 1 for native, and 9 for oral. Using the regression equation, we obtain the expected TOEFL score: 432.28⫹ 8.74 ⫻ 9 ⫹ 4.51 ⫻ 14 ⫹ 31.40 ⫻ 1 ⫺ 0.51 ⫻ 9 ⫽ 600.89.

table 2.20

SPSS Output for Multiple Regression Analysis Predicting TOEFL Scores

Model Variable B Standard error Beta t Significance

1 (Constant) 432.284 25.823 16.740 0.000

Reading 8.741 2.863 0.358 3.053 0.004

Month 4.511 1.307 0.394 3.450 0.001

Native 31.402 14.409 0.250 2.179 0.035

Oral ⫺0.507 2.538 ⫺0.022 ⫺0.200 0.842

Unstandardized coefficients

Standardized coefficients

The correlation between the predicted values and the observed values is called the multiple correlation coefficient. In our example, the multiple corre-lation coefficient is 0.681. The difference between an obtained value and the predicted value is called a residual. For the first student in Table 2.17, the re-sidual is 432.28⫺ 600.89 ⫽ ⫺53.9. The sum of all residuals in the sample is always 0.

Stepwise regression selects the most important predictors one at a time to enter the regression equation and presents a final model with only the signifi-cant predictors in the equation. An important predictor is defined as the one which accounts for the biggest variance in the dependent variable compared with other predictors. Stepwise regression starts by calculating the variance ex-plained by each of the predictors, selects the predictor that accounts for the biggest variance, and enters that variable into the regression equation. To se-lect the second predictor, the variance explained by the remaining predictors is compared, and the predictor with the biggest and most significant variance will be chosen to enter into the equation. The procedure continues and then stops at the point where none of the remaining predictors accounts for any significant variance beyond the existing equation. Table 2.21 presents the SPSS output for the stepwise regression using the data in Table 2.17.

From Table 2.21, we see that reading was selected as the first predictor to enter into the equation, followed by month and native. Oral was excluded from the equation because the variance it explained was not significant. The final model resulting from this stepwise regression is that expected TOEFL⫽ 429.29⫹ 8.73 ⫻ Reading ⫹ 4.46 ⫻ Month ⫹ 31.16 ⫻ Native. All predic-tors in the final model resulting from a stepwise regression are statistically significant.

table 2.21

SPSS Stepwise Regression Output for the Prediction of TOEFL Scores

Model Variable B Standard error Beta t Significance

1 (Constant) 465.086 21.494 21.638 0.000

Reading 12.685 3.008 0.520 4.217 0.000

2 (Constant) 442.018 20.789 21.262 0.000

Reading 10.570 2.814 0.433 3.757 0.000

Month 4.350 1.320 0.380 3.296 0.002

3 (Constant) 429.291 20.820 20.620 0.000

Reading 8.732 2.833 0.358 3.082 0.003

Month 4.461 1.270 0.390 3.512 0.001

Native 31.158 14.206 0.248 2.193 0.033

Unstandardized coefficients

Standardized coefficient

In document Adult Second Language Acquisition (Page 63-69)