CHAPTER 4. RESULTS
4.1 The relationship among the EPT and self-assessment and the TOEFL iBT
4.1.2. Rasch model analysis and confirmatory factor analysis of the self-assessment
To evaluate the quality of the self-assessment, a Rasch measurement analysis using the rating scale model was conducted for each of the subscales. All the valid responses to the self- assessment were used in the Rasch model analysis to increase the modeling accuracy (N = 347). The decision to run separate Rasch analyses for each subscale was made based on the results of the pilot study exploratory factor analysis that was used to check the unidimensionality
assumption. This decision was further confirmed with the results of the Rasch principal component analysis of residuals. As shown in the two rightmost columns in Table 4.4, the proportion of variance explained by the first dimension is substantial (from 75.2% to 79.9%) and the eigenvalue for the first contrast is less than 2.0 in all four subscales (Linacre, 2011). This configuration of principal components supported the unidimensionality assumption since one component accounted for the majority of the variance.
Table 4.4
Results of Rasch Measurement Analysis: Reliability and Unidimensionality Check for the Self- assessment (N = 347) Self-assessment Subscale Person Separation Person reliability Item Separation Item reliability Variance explained Eigenvalue for 1st contrast Reading (k=5) 2.99 .90 8.35 .99 79.6% 1.6 Listening (k=5) 2.83 .89 8.23 .99 78.1% 1.9 Speaking (k=5) 3.55 .93 3.55 .93 79.9% 1.5 Writing (k=5) 3.35 .92 5.12 .96 75.2% 1.7
Table 4.4 also shows the reliability information of the four subscales. The person reliability, an equivalent of Cronbach’s alpha in the Rasch measurement analysis, ranged from
0.89 to 0.93, indicating a high reliability in each self-assessment section. The person separation index of the self-assessment sections ranged from 2.83 to 3.55, indicating that the extent to which the respondents were spread in standard error units on the measured variables (Bond & Fox, 2007; Iramaneerat, Smith, & Smith, 2008). A higher person separation value suggests a wider spread of person on the scale. Bond and Fox (2007) maintain that both item and person separation index should be at least two. Another useful reference for person separation index value is to use the formula Strata = (4× Person separation index + 1)/3 for the number of
statistically distinctive levels of person ability. For example, a value of 2.99 in person separation index shows an existence of four discernible levels of person ability. The item reliability in the Rasch measurement analysis refers to the replicability of the item ordering if the items are administered to a group of subjects with similar abilities. The item reliability of the subscales ranged from 0.93 to 0.99 and the item separation index ranged from 3.55 to 8.35. The item separation index can be interpreted in the same way as the person separation index. Overall, the four subscales of the self-assessment tool exhibited high reliability.
To better understand the psychometric quality of the self-assessment items, Rasch item measures, and item-level fit statistics were reported for each subscale in Table 4.5. The item measure or endorsability of the self-assessment statement is estimated on the common scale with the logit as the measurement unit. Overall, the self-assessment items showed a relatively narrow range in item measure (-1.25 to 1.04), meaning that the self-assessment items were not very different in terms of item difficulty.
Table 4.5
Results of Rasch Measurement Analysis: Item measure and Fit Statistics in the Self-assessment (N = 347)
SA Items Item measure (logit)
S.E. Infit MNSQ Outfit MNSQ Point-biserial coefficient R ea ding Rd1 0.39 0.08 0.95 0.92 .86 Rd2 -0.25 0.08 0.78 0.75 .87 Rd3 0.35 0.08 1.00 0.98 .85 Rd4 -1.25 0.08 0.99 0.94 .83 Rd5 0.76 0.08 1.26 1.28 .81 List ening Lsn1 -0.71 0.07 0.99 0.94 .81 Lsn2 -0.43 0.07 1.33 1.38 .75 Lsn3 -0.09 0.07 0.63 0.65 .87 Lsn4 0.20 0.07 0.81 0.78 .86 Lsn5 1.04 0.07 1.19 1.21 .82 S pe aking Spk1 0.14 0.09 1.32 1.30 .84 Spk2 0.15 0.09 0.89 0.87 .90 Spk3 -0.66 0.09 1.02 1.02 .87 Spk4 0.19 0.09 0.81 0.81 .90 Spk5 0.17 0.09 0.91 0.91 .89 W riting Wrt1 -0.14 0.09 0.98 0.99 .85 Wrt2 -0.52 0.09 1.19 1.19 .83 Wrt3 0.86 0.09 0.91 0.91 .88 Wrt4 0.01 0.09 1.06 1.05 .86 Wrt5 -0.21 0.09 0.82 0.78 .89
According to Bond and Fox (2007), the infit and outfit mean squares (MNSQ) as unstandardized fit statistics are used to assess whether an item functions as the Rasch model expects. Infit is a weighted fit statistic and is less sensitive to outliers, compared with outfit, an unweighted fit statistic. The expected mean square value of the infit and outfit statistics is 1.0. The value range from 0.5 to 1.5 is generally regarded as an acceptable fit to the Rasch model (Green, 2013). As shown in Table 4.5, all the mean square values of the infit and outfit statistics were within the range of 0.5 to 1.5, indicating that the self-assessment items were functioning as
the Rasch model predicted. Another relevant item quality index is the point-biserial correlation coefficient of each item, which is the discrimination index in the classical test theory (CTT) framework. All the self-assessment items had relatively high point-biserial correlation
coefficients (.75 to .90), which indicates that the self-assessment items had a good discrimination among the participants.
Similar to the consideration of item fit in Rasch analysis, the categories of the Likert scale should also exhibit a good model fit. In a rating scale model, an estimation of the threshold value of the Likert scale is more meaningful as it tests the functionality of the scale. According to Bond and Fox (2007), the following four characteristics of a rating scale should be checked: the count frequency, monotonicity of category average measures, threshold or step calibrations, and category fit. It is recommended that the count for each category should be no less than 10, the distance between thresholds should be at least 1.4 logits, but less than 5 logits, and the infit statistics for each category should be acceptable. Tables 4.6 to 4.9 contain the category functioning information for each subscale of the self-assessment items.
The categories 1-6 in Tables 4.6 to 4.9 correspond to the 6-point Liker scale levels from “Strongly Disagree” to “Strongly Agree”. The count information refers to the frequency of each category being selected for all the items by the ESL students. The total number of the counts should be close to 1735 for each self-assessment section (5 items × 347 ESL student
respondents) with some invalid or extreme responses automatically removed in the Rasch analysis. The infit and outfit mean square (MNSQ) can be interpreted in the same way as for the item infit and outfit MNSQ. Both fit indices indicate the degree to which each category functions as the model expects, with one being perfectly fit to the model the range from 0.5 to 1.5 being acceptable. Structure calibrations are the estimated difficulty for choosing one category over its
adjacent lower category. The bottom level of the Likert scale (1 Strongly Disagree) does not have a lower category for the transition and therefore marked ‘None” in structure calibration. Table 4.6
Results of Rasch Measurement Analysis: Rating Scale Category Statistics for the Self-assessment Reading Section (N = 347)
Category Count (%) Obs. average Infit/Outfit MNSQ Structure calibration
1 12 (1%) -4.66 2.01/1.41 None 2 120 (8%) -1.73 1.48/1.45 -5.80 3 264 (17%) -0.05 0.98/1.02 -1.65 4 408 (26%) 1.62 0.90/0.89 0.47 5 550 (35%) 3.15 0.85/0.80 2.10 6 231 (15%) 5.02 0.99/0.97 4.88
As shown in Table 4.6, the observed average ability of the participants increased monotonically from -4.66 to 5.02 on the logit scale, which indicates that the participants with higher ability tended to endorse or choose a higher category. For example, category 1 or “Strongly Disagree” on the 6-point Likert scale was selected 12 times for the five self- assessment reading items. The average ability of the ESL students who selected “Strongly Disagree” was -4.66 logits. Both infit and outfit mean squares of each category were within the acceptable range (from 0.5 to 1.5), except for the first item in reading (Infit: 2.01, Outfit: 1.41). The distances between thresholds were greater than 1.4 logits, but less than 5 logits. This clear monotonic change pattern in threshold value can be observed in the category probability curves for the Reading (SA_R1), Listening (SA_L1), Speaking (SA_S1) and Writing (SA_W1)
subscales respectively (see Figure 4.1). Different categories are represented in curves in different color in Figure 4.1. For example, the red curves represent category 1 (Strongly Disagree) and the light green curves on the rightmost side of each chart represent category 6 (Strongly Agree). The y-axis of the charts shows the probability of endorsing or selecting a category by a respondent with certain ability and the x-axis shows ability level in logits. Ideally, each curve will has certain overlap with its adjacent curves while maintaining a clear un-overlapped peak, if each
category functions as the model expects. The information from Table 4.6 and the first chart in Figure 4.1 jointly showed that the Likert scale for the reading items functioned well.
Figure 4.1. Category probability curves for the subscales of self-assessment. SA_R = reading
self-assessment, SA_L = listening self-assessment, SA_S = speaking self-assessment, SA_W = writing assessment.
The Likert scale for the listening items also exhibited good qualities, as shown in Table 4.7. As expected, the observed average ability of the participants increased monotonically along the Likert scale. Both infit and outfit mean squares of each category were within the acceptable range from 0.5 to 1.5. The distances between thresholds were greater than 1.4 logits, but less
than 5 logits. This clear monotonic change pattern in threshold can be observed in the category probability curve in Figure 4.1.
Table 4.7
Results of Rasch Measurement Analysis: Rating Scale Category Statistics for the Self-assessment Listening Section (N = 347)
Category Count (%) Obs. average Infit /Outfit MNSQ Structure calibration
1 62 (4%) -2.91 1.25/1.37 None 2 158 (10%) -1.66 1.03/1.13 -3.22 3 343 (21%) -0.38 0.93/0.94 -1.75 4 459 (28%) 0.68 0.83/0.80 -0.12 5 428 (26%) 2.12 0.96/0.93 1.43 6 192 (12%) 3.67 1.16/1.11 3.66
The Likert scale for the speaking and writing items functioned well as shown in Tables 4.8 and 4.7. The statistics about the revised scale are listed on the right side of Tables 4.6 and 4.7. As shown below, the observed average ability of the participants increased monotonically in the revised scale. Both infit and outfit mean squares were within the acceptable range from 0.5 to 1.5. The distances between thresholds were greater than 1.4 logits, but less than 5 logits. The monotonic change patterns in threshold can be observed in the category probability curve in Figure 4.1.
Table 4.8
Results of Rasch Measurement Analysis: Rating Scale Category Statistics for the Self-assessment Speaking Section (N = 347)
Category Count (%) Obs. average Infit/Outfit MNSQ Structure calibration 1 43 (3%) -5.26 1.32/1.28 None 2 137 (9%) -2.75 1.01/1.00 -5.22 3 338 (21%) -0.70 0.96/0.97 -2.54 4 502 (31%) 1.21 0.83/0.81 -0.13 5 444 (28%) 3.38 0.99/0.99 2.39 6 136 (9%) 5.24 1.25/1.17 5.51
Table 4.9
Results of Rasch Measurement Analysis: Rating Scale Category Statistics for the Self-assessment Writing Section (N = 347)
Category Count (%) Obs. average Infit/Outfit MNSQ Structure calibration 1 25 (2%) -4.54 1.16/1.16 None 2 115 (7%) -2.54 0.98/0.98 -5.03 3 339 (21%) -0.70 1.01/1.04 -2.69 4 532 (33%) 1.16 0.90/0.88 -0.20 5 467 (29%) 3.24 0.96/0.96 2.29 6 122 (8%) 5.30 1.14/1.0/ 5.64
Overall, the Rasch analysis results indicate that the self-assessment items have a good quality in terms of reliability, item discrimination, item fit statistics, and scale functioning.