results: Comparing person estimates from models The analysis in this phase was concerned with examining the practical implications of the different models fit in

Phase III. Supplementing the model comparisons in the previous phase, this section aimed to answer RQ2 (“How do the specified models (IRT and BN) compare in terms of possible

interpretations from the results and applicability to instructional and assessment outcomes?”) by looking at the information the models can offer beyond simple scoring. The analysis involved the simple scoring of reflection questions, followed by comparisons of student estimates from the different models, and comparisons of student groupings in terms of the distribution of student scores.

In order to obtain the simple scoring values for all of the students, the coded data from the reflection questions was summed. This includes the four drop-down responses (DD1-DD4) and the coded points for the reflection questions. The use of rubrics similar to those that were used to code the reflection data is a common practice in science classrooms to evaluate student answers (e.g., Lunsford & Melear, 2004). Student scores using the simple scoring method ranged from 1-10 with the average score being 5.06. Table 28 shows the average values for each of the student pairs in the actual dataset from simple scoring and the models in Phase III (see Appendix N for full table of values). The actual dataset, rather the simulated one, was used for the analysis

in this phase. The values for students from the IRT models was their theta (θ) value using MAP estimates derived from the ‘fscores’ function in the ‘MIRT’ package in R that estimates their ability levels. For the multidimensional IRT model, the student ability estimates for each of the dimensions (DCIs and SPs) was given. For the LLTM3 model, the estimates were obtained using the ‘randef’ function to get an estimate of the random effects (student ability). In contrast to IRT models where estimating student θ values is an established practice, the ranking of students on multiple variables using a BN is not as straightforward. For this analysis, the student results from cross-validation, where each pair was given a probability from the model as to whether the student pair should be coded for the class variable, were used to generate an average probability of students answering reflection questions correctly.

Table 28

Comparison of Mean Person Estimates from Phase III models by Simple Score Simple Score n (pairs) Uni. Rasch Multi- DCI Multi- SP LLTM3 Expert BN Empirical BN DBN 1 2 -0.44 -0.77 -0.30 -0.43 0.47 0.43 0.53 2 8 -0.27 -0.77 0.14 -0.26 0.42 0.42 0.54 3 5 -0.43 -0.46 -0.69 -0.42 0.45 0.44 0.49 4 12 -0.15 -0.29 -0.05 -0.15 0.45 0.46 0.47 5 15 -0.08 0.03 -0.25 -0.07 0.47 0.47 0.47 6 19 0.07 0.21 -0.01 0.07 0.53 0.53 0.52 7 10 0.33 0.51 0.37 0.32 0.53 0.57 0.55 8 5 0.52 0.90 0.46 0.49 0.54 0.59 0.46 10 1 1.38 1.89 1.85 1.34 0.66 0.76 0.62

In order to further explore the variations in scoring values among the models, the correlations of values between models were compared (see Table 29). The correlations among the IRT models were the strongest which makes sense because they are from the same family of

IRT model were not highly correlated with each other which echoes the low correlation of the dimensions from Phase III. The Multi-SP estimate was also not highly correlated with the simple score. The LTTM model and unidimensional Rasch model results were very highly correlated to each other, suggesting that the patterns of the student θ values from these models were nearly the same. Another high correlation was between the simple scoring and empirically-structured BN results. The DBN had the lowest level of correlation with all of the models. Although this could be attributed to the lower level of accuracy of this model relative to the other BNs, this could also suggest that the inclusion of temporal information provided different information about students than the other models.

Table 29

Correlations of Scoring Values from Simple Scoring and Phase III Models

Simple Uni-Rasch Multi-DCI Multi-SP LLTM3 Exp. BN Emp. BN Uni-Rasch 0.51 Multi-DCI 0.61 0.88 Multi-SP 0.25 0.84 0.49 LLTM3 0.53 0.97 0.86 0.84 Exp. BN 0.62 0.35 0.37 0.23 0.33 Emp. BN 0.83 0.58 0.61 0.38 0.56 0.69 DBN 0.08 -0.01 -0.01 0.01 -0.02 0.33 0.23

The relationship between simple score and the scoring values of the Phase III models is depicted in Figure 23. The two graphs are separated by IRT and BN models with the score (theta or average probability) on the vertical axis and simple score on the x-axis. If the estimates of ability from the models were linearly related to the simple scores, it would be expected that the scatterplot of estimates would trend upward in a linear manner in the graphs. However, neither of these graphs demonstrate a straightforward linear relationship. These graphs also illustrate the spread of abilities that can be seen within simple score values. The wide spread of values, such

as the one in the IRT graph at simple score 5, suggests that these models may be providing information that further differentiates students in the middle scoring groups.

Figure 23. Scatterplots of person estimates by simple score.

Comparing the estimates of different models within the graph, most of the models appear to have a similar spread and do not appear to have systematically higher or lower estimates across simple score levels. An exception is the DBN probabilities which appear to be higher than the other BN probabilities in simple scoring levels 1-3 and lower in levels 8-10. The estimates

for the SP dimension of the multidimensional IRT model follow in line with much of the IRT scatterplot, but also have outliers in each of the simple scoring levels above or below the group of estimates. These outliers represent students whose estimate of SP ability does not align with the estimate of their ability from the other models.

Another area of interest when considering the practical value of different models is considering potential student groupings suggested by the models. One way of exploring this was to examine the distribution of scores generated by the different models. Figure 24 provides histograms for each of the models that split the students into five quintiles defined by

equipartitioning the range of values for each model. The distributions of scores for the Empirical BN, for instance, suggests that the majority of the students were grouped into the same three categories based on their scores and that the model spread out two students at the top of the distribution.

Summary of Phase IV results. The fourth phase of analysis extended the comparisons

among models by considering the value of the models from Phase III against information that could be obtained from a simple scoring of the reflection responses. The analysis focused on the values for the pairs (theta from the IRT models and probabilities from the BN models) from the actual dataset. A direct comparison of the values through examining the means, correlations, and spread on scatterplots suggested similarities and differences amongst the person estimates for the models and the simple scores. The empirically-structured BN had the highest correlation

between the pairs’ probabilities and the simple score. The expert-structured BN and MIRT- DCI estimates were also strongly correlated with the simple score. By contrast, the DBN and MIRT- SP estimates had the weakest correlations to the other models. The scatterplots of estimates, organized by simple score, echoed these results by illustrating the wide spread of points for the estimates of the DBN and MIRT-SP models. In order to explore possible groupings of students, the distributions of the student values, split into quintiles, were compared. These results help answer the second research question by exploring the possible added practical values of the models to provide information beyond what could be obtained by hand-scoring.

Phase V: Applying methodology to a different simulation. The fifth and final phase of

In document Toutkoushian_unc_0153D_18818.pdf (Page 162-167)