Exploring the Predictive Power of Common Misconceptions with Tabulation Methods

Abstract

In this chapter, multiple tabling models to predict student correctness on the next problem are created from features focusing on the commonality of student misconceptions. These models were tested on two large datasets (open-response questions and multiple-choice questions) consisting of several hundred thousand rows of data on mathematical problems from the ASSISTments tutoring system. These models were compared to Knowledge Tracing and Performance Factor Analysis, which are existing standards for predicting student performance. We showed that our simplest tabling model predicted as well as the standard models for open- response questions indicating that the commonality of incorrect answers can potentially be a useful predictive factor. We also showed that there was no difference between the complexity of the tabling models with either open-response or multiple-choice questions. Although our models performed as well as KT and PFA, all models did poorly when compared to a simple baseline. Further research was done to determine the root problem of why our models did not predict well with a large dataset.

Introduction

Student incorrect responses were first studied in 1978 by John Seely Brown and Richard R. Burton (Brown & Burton, 1978). They manually analyzed student incorrect responses for multi- digit subtraction problems. These incorrect responses were then attributed to an incorrect process or “bug” performed by the student. From this analysis they created a procedural network, where a skill is broken down into sub-skills. Their main goal was to construct a database of all the possible buggy procedures with their series of “BUGGY” programs that helped to identify bugs (Burton, 1982). This work was extended by Brown and VanLehn with their introduction of “Repair Theory” (Brown & VanLehn, 1980). The idea of Repair Theory is that when a student realizes that he/she has performed a buggy operation he/she will attempt to repair the bug. This work looked at how bugs are caused and what bugs can be predicted.

Several researchers explored the area of finding common incorrect processes 1990’s summarized

attempted to predict student misconceptions and react appropriately. More recent algorithms have been made to automatically discover these buggy rules. One of the most recent examples is a machine learning algorithm developed in (Selent & Heffernan, 2014). In their work, a

machine-learning algorithm was developed to automatically discover all buggy rules for a given set of problems generated by the same template. This algorithm works by first taking the input symbols of the problem (the numbers in the problem), a set of basic operations (addition,

subtraction, multiplication, and division), and all the student incorrect responses for all problems generated by a single template. Next, the algorithm derives all possible incorrect processes for all the incorrect responses and generalizes the incorrect response across all problems generated by a given template. Finally the incorrect processes are assigned to the most likely generalized incorrect processes. The output of the algorithm is the machine learned process for the incorrect response (how the student arrived at their incorrect answer), as well as a percentage breakdown for all the incorrect processes. Although the method has a few weaknesses in terms of

computation time and computer memory, it is a sufficient solution to the problem of finding bugs.

Despite such a large amount of past work done on predicting student incorrect responses, there is little work done to use these incorrect response to predict future performance or correctness on the next question for open-response questions. The National Council on Measurement in Educational Measurement (NCME) community has come up with methods that weight incorrect responses differently. Work done by DeMars shows that polytomous models (weighting the incorrect responses differently) for multiple-choice questions predict better than dichotomous models (weighting all incorrect responses the same) (DeMars, 2008). Our contribution in this chapter is to use the idea of weighting incorrect responses differently and apply it to open-

response questions. DeMars notes that, “The polytomous models, which weighted the distractors differentially, yielded small increases in reliability compared to their dichotomous

counterparts.”. We believe there is value to this and hope to find larger improvements with open-response questions as opposed to the multiple-choice questions DeMars used.

It is expected that knowing the student’s common wrong answers will provide more predictive power, especially for open-response questions. These expectations are based on some relevant prior work done by Yigal Attali (Attali, 2014). In his work a randomized controlled trial was run to see what types of practice and feedback on the practice GRE test leads to better performance

on the real GRE test. It was shown that practice tests with open-response questions and immediate correctness lead to better GRE performance than practice with multiple-choice questions. This suggests that student attempts help them learn more and that multiple-choice causes some students to guess more and use less effort.

The goal of this chapter is to see how well we can predict student performance on the next question by knowing the commonality of their previous incorrect response. In this chapter we look at three questions. How well can we predict student performance on the next question given that (1) we know what the most common wrong answer is for open-response questions, (2) we know the top few common wrong answers for open-response questions, and (3) we know all the possible incorrect answers for multiple-choice questions. We want to answer the question; do common wrong answers on open-response questions provide more information than common wrong answers on multiple choice questions? If this is true, then we also want to answer the question, Can we better predict student performance on the next question given that we know what common wrong answers on previous question?

To answer these questions we use two datasets, described in the data section, and build a tabling model for each dataset that considers the commonality of the incorrect answers, described in the model section. We compare the predictive accuracy of our tabling models to the Knowledge Tracing (KT) model (Corbett & Anderson, 1994) and the Performance Factor Analysis (PFA) model (Pavlik, Cen, & Koedinger, 2009). Both KT and PFA are current standards in the field for predicting student performance. Knowledge tracing works by creating a Bayesian network with a latent student knowledge node (whether the student knows the concept, K) and an observable student performance node (whether the student answered the problem correctly, Q) for each time slice. The previous knowledge node transitions to the next knowledge node, creating the

Bayesian network shown in Figure 6.1.

Performance Factor Analysis works by creating a regression model, shown in Equation 6.1, that uses prior successes and failures for each student/skill pair and learns a skill difficulty parameter

(𝛽) as well as parameters (𝛾) and 𝜌 for the benefit of prior successes and failures. 𝑠 and 𝑓 respectively represent the number of successes and failures the student has had prior to seeing the current problem. 𝑚 is the logit value for the accumulated learning for a given student at the time the student encounters the current knowledge. Equation 6.2 is the logistic function that returns the student’s observed prediction.

Figure 6.1: Standard Bayesian Knowledge Tracing model (Corbett & Anderson, 1994)

Equation 6.1. 𝑚(𝑖, 𝑗 ∈ 𝐾𝐶𝑠, 𝑠, 𝑓) = ∑𝑗∈𝐾𝐶𝑠(𝛽𝑗+ 𝛾𝑗𝑆𝑖,𝑗+ 𝜌𝑗𝑓𝑖,𝑗) Equation 6.2. 𝑝(𝑚) = 1/(1 + 𝑒−𝑚)

Tabling Models

A tabling model provides a mapping from data to predictions based on the attributes of the data. This generates a probability table to use for predicting. Tabling has been applied in past research

in (Wang et al, 2011). Their tabling method provided a complement to the Knowledge Tracing

model by using past response sequences to predict future responses. Our tabling method expands on that idea by looking at what types of incorrect responses were generated in addition to basic correctness. There are three tabling method tested in this chapter. Each method incrementally increases in complexity compared to the previous model.

The first tabling method is built using four values calculated from the training set: 1) the average next problem correctness probability for all correct answers; 2) the average next problem

correctness probability of all cases where the students asked for hints; 3) the average next problem correctness probability of all common wrong answers; 4) the average next problem correctness probability of all uncommon wrong answers. When selecting common wrong

answers from a given problem, we pick the answers that cover more than 4 data entries and more than 10 percent of all wrong answer entries under the given problem. All other wrong answers

fall into the uncommon wrong answer category. When a student asks for a hint immediately, it is recorded as a blank response because their first response was to ask for a hint without typing any input in. We are focusing on incorrect answers and a blank answer (asking for a hint) is a special case that we consider.

By binning the student responses into four categories; correct entry, hint entry, common wrong answer entry, and uncommon wrong answer entry, we may use these four values to predict the students’ next problem correctness. In doing so we hope to create a rudimentary tabling model that considers the commonality of student incorrect responses to compare to the other models. We also use this simple tabling model to compare to our other tabling models to determine which added granularities contribute to better predictions.

The second tabling method adds more complexity to the prediction table where each common wrong answer and each problem will have an individual probability. This is different from our first model, which groups all the common incorrect responses together for a single probability value. Correct responses and hint use are modeled in the same way as the first tabling method where the probabilities are calculated using the entire training set. In making our predictions we prefer using only the good common wrong answers - common wrong answers that are

significantly more predictive than uncommon wrong answers. In selecting the good common wrong answers, we picked the common wrong answers whose prediction values differ from the average next problem correctness probability of all uncommon wrong answers by at least 0.3. We use this table method to compare to our first tabling method to see if the specific common incorrect responses give more predictive power than treating all common incorrect responses the same.

The third tabling method expands on the second tabling method. A probability value is

calculated for correct responses and hint use for each problem in order to predict correctness on the next question. This is different from models 1 and 2, which use global probabilities for correct responses and hint use. By implementing this table we hope to see if the same approach used in the second method works with correct answers and situations where students asked for hints.

An example of each tabling method is shown in Tables 6.1 - 6.3. Only a table for a single problem is shown where the entire model learns the probability tables for each problem by calculated all answers of the given type in the entire training set.

Answer Type Prediction

Correct 0.76775

Hint 0.43836

Common wrong answers 0.53248

Uncommon wrong answers 0.52669

Table 6.1: An example of what tabling model 1 looks like.

Problem ID Answer Prediction

71367 Correct 0.76775

71367 Hint 0.43836

71367 21/4 = 5/c 1

71367 Uncommon 0.707317

71360 Correct 0.76775

Table 6.2: An example of what tabling method 2 looks like. Common/uncommon wrong answers are used, where

we consider the common wrong answer text as shown as “21/4 = 5/c” in this example. A probability of 1 is used for students who make this common wrong answer, meaning that every student in the training set got the next question correct after making this mistake. Note that we are still using the same values from tabling model 1 in predicting

Correct and Hint entries, as presented in the last row the Correct prediction values are the same

Problem ID Answer Prediction

71367 Correct 0.931034

71367 Hint 0.321429

71367 21/4 = 5/c 1

71367 Uncommon 0.707317

Table 6.3: Table 6.3 takes one step further from Table 6.2, giving each problem its individual Correct and Hint predictions.

Data

For our experiments we use two skill builder datasets of responses on mathematical questions, one for open-response questions and one for multiple-choice questions. The data ranges from years 2008-2014 and grades 4-12 from the ASSISTments online tutoring system. A skill builder is a set of problems where a student must get a certain number of problems correct in a row (usually three) in order to complete. A skill builder typically consists of a bank of 50-100 possible questions that are randomly drawn from and given to the student. These questions are generated by one or more templates. For example, a template problem for the skill “Order of Operation”, looks like “a + b * c”. This template generates the problem instances “1 + 2 * 3”, and “5 + 4 * 9” as well as several other similar problems by substituting different numbers in for the variables ‘a’, ‘b’, and ‘c’ (Razzaq et al., 2009). The reason why we chose to only focus on data from skill builders is because this data is more realistic and less subject to noise, which is a common problem in online tutoring systems. Skill builders are used by several teachers and are not specific to a certain demographic. Due to the randomized questions, skill builders are also less subject to cheating by students. Students are given correctness feedback, on whether they got the problem right or wrong, after submitting an answer for a problem in.

We continued to refine our datasets based on a few requirements of the models we train and test on. To give all models a fair chance we needed a single data set where all the models can be trained equally on. Our tabling method focuses on the commonality of incorrect answers and does not require any data on knowledge components, which is analogous to what are called “skills” in the ASSISTments system. This is an advantage of our model, since PFA requires each question to have a knowledge component (skill) associated with it. Therefore we used datasets that PFA can be trained correctly on. We chose to remove all problems that were associated with more than a single skill for simplicity. For the problems that did not have an associated skill, we used the template id the problem was generated by to group those problems into an unnamed “skill”.

In order for our datasets to be modeled by KT, we had to limit the number of problems a student attempted for a given skill. If we did not limit the number of problems for a student/skill pair, the Bayesian network for KT would need a latent knowledge node and an observable node for all of the student responses for a student/skill pair. Due to outliers in our original datasets, this results in a Bayesian network of 229 knowledge/observable node pairs, which takes too long to

train on. For this reason, we removed all data where a student had completed more than ten problems for a given skill. To be clear, the first ten problems for a given student/skill were not removed. Only the data after ten problems for a given student/skill was removed. To get more accurate predictions we also removed data for skills we felt did not have sufficient data to train on. Any skill that had less than 10,000 data rows was removed from the dataset.

Our final dataset for open ended questions consisted of 343,024 rows of data with 6,796 unique problems, 32,068 students and 14 skills. Our final dataset for multiple-choice questions

consisted of 129,728 rows of data with 1,410 unique problems, 20,870 students and 7 skills. Both datasets are available at the following URL address7. The datasets are split up into pieces due to size restrictions.

We note that to have a single dataset to train and test all models equally on, we had to modify the original dataset to compensate for the weaknesses of the other standard models. We consider these weaknesses unquantifiable, but worth considering when analyzing the results.

Analysis

For both datasets we randomly split our data into a training set and test set based on the student ID. All of the student response data is associated with a unique ID for each student. If the student ID is divisible by five, all data of that student are in the test set and the remaining data are part of the training set. Our questions are about predicting student performance on the next question. Therefore all data where there is no next-question, such as the last question for a given student/skill pair, was not predicted on for the test set. This resulted in a training set of

consisting of 275,183 (80 %) rows of data and a test set of 67,842 (20 %) rows of data for open- response questions and a training set of 129,728 (84 %) rows of data and a test set of 25,261 (16 %) rows of data for multiple-choice questions.

To analyze our results we use four different metrics to measure the prediction accuracy of the models. We use Accuracy, Root Mean Squared Error (RMSE), Area Under the ROC Curve (AUC), and R2to measure the accuracy of the predictions from all the models. We also included the number of parameters each model needed to learn. Our predictions are a range of

probabilities from [0, 1], representing the prediction probability that the student will get the next question correct.

7_{https://sites.google.com/site/zlas2015/data}

Accuracy is calculated by the number of correct predictions divided by the total number of predictions. To calculate basic accuracy, we use 0.5 as a cutoff point to round for our accuracy metric. If our prediction value is < 0.5 we round to 0 and if our prediction value is > 0.5 we round to 1. Once our prediction is converted to a binary value (0 or 1), we then calculate the absolute difference from the actual value and use that as our accuracy measure. A higher accuracy indicates a better prediction.

The RMSE metric can range anywhere from [0, 1] where a ‘0’ represents a perfect model and a ‘1’ represents the worst possible model. A lower RMSE indicates a better prediction. Since RMSE is not normalized, high values of RMSE may not necessarily be poor predictions. AUC can range anywhere from [0, 1], where a higher value indicates a better prediction and a value of ‘1’ represents a perfect model. A model should never have a value below 0.5, since that would mean the model is predicting worse than chance.

R2 measures how much variance is explained by the model or how similar two models are. R2 is normalized on a [0, 1] scale where an R2 value of ‘1’ means that the two models are exactly the same. A higher R2 indicates better predictive accuracy when comparing predictions to the actual values.

ZeroR predictions (predicting the majority class) are provided as a baseline model to compare all the other model metrics to because “good” and “bad” metric values are dependent on the specific datasets

Since a single student attempts multiple problems, the data samples are not independent of each other. For this reason we report both overall metrics as well as metrics calculated per student.

In document Creating Systems and Applying Large-Scale Methods to Improve Student Remediation in Online Tutoring Systems in Real-time and at Scale (Page 125-138)