Comparison of machine learning methods for intelligent tutoring systems
Wilhelmiina H¨am¨al¨ainen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu,
P.O. Box 111, FI-80101 Joensuu FINLAND
{whamalai, mvinni}@cs.joensuu.fi
Abstract. To implement real intelligence or adaptivity, the models for intelligent tutoring systems should be learnt from data. However, the educational data sets are so small that machine learning methods cannot be applied directly. In this paper, we tackle this problem, and give general outlines for creating accurate classifiers for educational data. We describe our experiment, where we were able to predict course success with more than 80% accuracy in the middle of course, given only hundred rows of data.
1 Introduction
Ability to learn is often considered as one of the main characteristics of in- telligence. In this sense most of the intelligent tutoring systems are far from intelligent. They are like old-fashioned expert systems, which perform mechanic reasoning according to predefined rules. They may use very intelligent-sounding methods like Bayesian networks, decision trees or fuzzy logic, but almost always these methods are used only for model representation. They determine only, how to reason in the model, but the model itself is predefined.
Table 1. The basic terminology used in this paper.
Concept Meaning
Model structure Defines the variables and their relations in the model.
E.g. nodes and edges in a Bayesian network, or independent and dependent variables in linear regression.
Model parameters Assigned numerical values, which describe the variables and their relations in the given model. E.g. prior and conditional probabilities in a Bayesian network, or regression coefficients in linear regression.
Modelling paradigm General modelling principles like definitions, assumption and
techniques for constructing and using models. E.g. Bayesian
networks, linear regression, neural networks, or decision trees.
This situation is very surprising, compared to other fields, where machine learning methods are widely used. In modern adaptive systems both the model structure and model parameters (see terminology in Table 1) are learnt from data. However, in educational applications, it is rare that even the model pa- rameters are learnt from data.
The main problem is the lack of data. The educational data sets are very small – the size of class, which is often only 50-100 rows. In distance learning setting, the course sizes are larger, and it is possible to pool data from several years, if the course has remained unchanged. Still we can be happy, if we get data from 200-300 students. This is really little, when we recall that most machine learning methods require thousands of rows of data.
Another problem is that the data is often mixed, and contains both numeric and categorial variables. Numeric variables can always be transformed to catego- rial, but the opposite is generally not possible (we can transform all variables to binary, but in the cost of model complexity). This is not necessarily a problem, because the best methods for really small data sets use categorial data. Educa- tional data has also one advantage compared to several other domains: the data sets are usually very clean, i.e. the values are correct and do not contain any noise from measuring devices.
In the following, we will tackle these problems. In Section 2, we describe the general appoach, which allows us to infer also the model structure from data. In Section 3, we give general guidelines for modelling educational data. We concen- trate on constructing a classifier from real student data, alhtough some principles apply to other predicting tasks, as well. In Section 4, we report our empirical results, which demonstrate that quite accurate classifiers can be constructed from small data sets (around 100 rows data) with careful data preprocessing and selection of modelling paradigms. We compare the classification accuracy of multiple linear regression, support vector machines and three variations of naive Bayes classifiers. In Section 5, we introduce related research, and in Section 6, we draw the final conclusions.
2 Approach
Combination of descriptive and predictive modelling is often very fruitful, when
we do not have enough data to learn models purely from data but on the other
hand we do not want to rely on any ad hoc assumptions. The idea is to ana-
lyze the data first by descriptive techniques (classical data mining) to discover
existing patterns in data. The patterns can be for example association rules,
correlations or clusterings. Often the process is iterative and we have to try
with different feature sets, before we find any meaningful patterns. In a success-
ful case, we discover significant dependencies between the outcome variable and
other variables and get information about the form of dependency. All this infor-
mation is utilized in the predictive modelling phase (classical machine learning),
which produces the actual models for prediction.
In the second phase, the system constructor should first decide the most ap- propriate modelling paradigm or paradigms and then restrict the set of suitable model structures. The selection of modelling paradigm depends on several fac- tors like data size, number and type of attributes, form of dependencies (linear or non-linear). In the next section, we will give some general guidelines for the educational domain, but the most successful descriptive paradigms do also hint suitable predictive paradigms. For example, strong correlations support linear regression and strong associations support Bayesian methods.
If we have very little data, the model structure cannot be defined from data, but we can compose it according to descriptive patterns found in the first phase.
If the descriptive analysis suggests several alternative model structures, it is best to test all of them, and select the one with smallest generalization error.
The model parameters are always learnt from data. In small data sets it often happens that some parameters cannot be defined, because of the lack of data.
A couple of missing variables can be handled by well-known heuristic tricks, but several missing variables is a sign of too complex a model.
3 Classifiers for educational data
The main problem in ITSs is classification. Before we can select any tutoring action, we should classify the situation: whether the student masters the topic or not, whether she or he will pass the course or not. However, the problem contains so many uncertain or unknown factors that we cannot classify the students deterministically into two mutual classes. Rather, we should use additional class values (e.g. the mastering level is good, average, or poor) or estimate the class probabilities. The latter approach is more recommendable, because additional variables always increase the model complexity and make it more inaccurate. On the other hand, the class probabilities can always be interpreted as intermediate class values, if needed.
Typically, the student profile contains too many attributes for building ac- curate classifiers, and we should select only the most influencing factors for pre- dicting purposes. In addition, the domains of numeric attributes are large, and the data is not representative. In practice, we have to combine attributes and reduce their domains as much as possible, without losing their predictive power.
As a rule of thumb, it is suggested that the data set should contain at least 10 rows of data per each model parameter. In simplest models, like naive Bayes classifiers using binary attributes, this means about 20 n independent attributes, where n is the data size.
The other consequence of this simple calculation is the model complexity. We should select modelling paradigms, which use as simple models as possible. In the following, we will suggest good candidates for both numeric and categorial data.
All these classifiers are able to produce class probabilities or similar measures,
instead of deterministic class decisions. We have excluded such commonly used
methods like nearest neighbour classifiers, neural networks and variations of
decision trees, which would require much more data to work accurately.
3.1 Classifiers for numeric data
The simplest predictive model for numeric data is linear regression. In multiple linear regression we can predict the dependent variable Y , given independent variables X 1 , ..., X k , by a linear equation Y = α k X k +α k−1 X k−1 +...+α 1 X 1 +α 0 . Although the result is numeric value, it can be easily interpreted as a class value in binary class problem.
The main assumption in the linear regression is that the relationship should be approximately linear. However, the model can tolerate quite large deviations from linearity, if the trend is correct. On the other hand, we can use linear re- gression as a descriptive tool to identify, how linear or non-linear the relationship is. This is done by checking the square of multiple correlation coefficient
r 2 = V ar(Y ) − M SE V ar(Y ) ,
where V ar(Y ) is Y ’s variance and M SE is the mean of squared errors. As a rule of thumb, if r 2 > 0.4, the linear tendency is significant, and we can quite safely use linear regression.
The main limitations of applying linear regression in educational domain concern outliers and collinearity. The linear regression model is very sensitive to outliers, and educational data contains almost always exceptional students, who can pass the course with minimal effort or fail without any visible reason. If the data contains several clear outliers, robust regression [4] can be tried instead.
Collinearity means strong linear dependencies between independent variables.
These are very typical in educational data, where all factors are more or less related to each other. It is hard to give any exact values, when the correlations are harmless, but in our experiment the accuracy of linear regression model classifier suffered, when the correlation coefficient was r > 0.7. The weaker correlations did not have any significant effect.
Support vector machines (SVM) [7] are another good candidate for classifying educational data. The idea is to find data points (”support vectors”) which define the widest linear margin between two classes. Non-linear class boundaries can be handled by two tricks: first, the data can be mapped to a higher dimension, where the boundary is linear, and second, we can define a soft margin, which allows some misclassification. To avoid overfitting but still achieving good classification accuracy, a compromise of these two approaches is selected.
Support vector machines suit especially well for small data sets, because the
classification is based on only some data points, and data dimension–size ratio
has no effect on model complexity. In practice, SVMs have produced excellent
results, and they are generally considered as best classifiers. The only shortcom-
ing is the ”black-box” nature of the model. This is in contrast with the general
requirement that models in ITS should be transparent. In addition, selecting
appropriate kernel function and other parameters is difficult, and often we have
to test different settings empirically.
3.2 Bayesian classifiers for categorial data
The previously mentioned classifiers suit only for numeric and binary data. Now we will turn to classifiers, which use categorial data. This type of classifiers are more general, because we can always discretize numeric data into categorial with desired precision. The resulting models are simpler, more robust and generalize better, when the course content and students change.
Bayesian networks are very attractive method for educational domain, where uncertainty is always involved and we look for a transparent, easily understand- able model. Unfortunately, the general Bayesian networks are too complex for small data sets, and the models overfit easily. Naive Bayes models avoid this problem. The network structure consist of only two layers, the class variable in the root node and all the other variables in the leaf nodes. In addition, it is as- sumed that all leaf nodes are conditionally independent, given the class value. In reality this so called Naive Bayes assumption is often unrealistic, but in practice the naive Bayes model has worked very well. One reason is that according to [1] Naive Bayes assumption is not a necessary but only sufficient condition for naive Bayes optimality. In empirical tests, naive Bayes classifiers have often out- performed more sophisticated classifiers like decision trees or general Bayesian networks, especially with small datasets (up to 1000 rows) [1].
In the educational domain, the Naive Bayes assumption is nearly always violated, because the variables are often interconnected. However, Naive Bayes classifier can tolerate surprisingly strong dependencies between independent vari- ables. In our experiments, the model accuracy suffered only when the conditional probability between two leaf node values was P (F = 0|E = 0) = 0.96. The av- erage mutual information between those variables was also high, AMI(E, F ) = 0.178, of the same magnitude as the dependencies between class variable and leaf variables (AMI ∈ [0.130, 0.300]). The effect to classification accuracy was about the same as in linear regression model.
Tree augmented naive Bayes models (TAN models) [2] enlarge naive Bayes models by allowing additional dependencies. The TAN model structure is oth- erwise like in the naive Bayes model, but each leaf node can depend on another leaf node, in addition to class variable. This is often a good compromise between a naive Bayes model and a general Bayesian network: the model structure is simple enough to avoid overfitting, but strong dependencies can be taken into account. In the empirical tests by [2] the TAN model outperformed the standard naive Bayes, but in our experiments the improvements were not so striking.
Bayesian multinets [3] generalize the naive Bayes classifier further. In Bayesian
multinets we can define a different network structure for every class value. This
is especially useful, when classes have different independence assertions. For ex-
ample, when we try to classify the course outcomes, it is very common that
failed and passed students have different dependencies. In addition, the class of
failed students is often much smaller and thus harder to recognize, because of
less accurate parameter values. When we define for both classes their own model
structures, the failed students can be modelled more accurately. In addition,
the resulting model is often much simpler (and never more complex) than the corresponding standard naive Bayes model, which improves generalization.
Friedman & al. [2] have observed that the classification accuracy in both TAN and Bayesian multinet models can be further improved by certain parameter smoothing operations. In ”Dirichlet smoothing” or standard parameterization the conditional probability of X i given its parent Π X
iis calculated by
P (X i = x k |Π X
i= y j ) = m(X i = x k , Π X
i= y j ) + α ijk
m(Π X
i= y j ) + α ij , where α ij = P
k α ijk are Dirichlet hyperparameters. If α ijk = 0, the pa- rameters reduce to relative frequencies. Other values can be used to integrate domain knowledge. When nothing else is known, a common choice is to use val- ues α ijk = 1. This correction is known as Laplace smoothing. In empirical tests by [2] it produced significant improvements especially in small data sets, but in our experiments the improvements were quite insignificant.
4 Empirical results
Our empirical tests are related to distance learning Computer Science program ViSCoS, in the University of Joensuu, Finland. One of the main goals in ViS- CoS is to develop an intelligent tutoring system based on real student data.
Java Programming courses have been selected as our pilot courses, because of their importance and large drop-out and failing rates. In the first phase, we have developed classifiers, which can predict the course success (either passed or failed/drop-out) as early as possible. Currently this information serves only course teachers, but the next step is to integrate intelligent tutors into course environment.
4.1 Data
Our data set consisted of only exercise points and final grades. The data hah been collected in two programming courses, (Prog.1 and Prog.2), in two consecutive years 2002-2003 and 2003-2004. In Prog.1 course, the students study to use Java in a procedural way, and object-oriented programming is studied only in Prog.2.
The first problem in our modelling task was feature extraction and feature se-
lection. Each class consisted of only 50-60 students, but each student had solved
exercises in 19 weeks. To increase the data size and decrease the number of
attributes, we created new attributes, which abstracted away the differences be-
tween two academic years. This was relatively easy, because the course objectives
had remained the same. The exercises could be divided into six categories: ba-
sic programming skills, loops and arrays, applets, object-oriented programming,
graphical applications and error handling. The exercise points in each category
were summed and normalized so that the maximum points were the same in
both years. After that the data sets could be combined. The resulting Prog.1
Table 2. Selected attributes, their numerical domain (NDom), binary-valued qualita- tive domain (QDom), and description.
Attr. NDom. QDom. Description
A {0, .., 12} {little, lot} Exercise points in basic programming structures.
B {0, .., 14} {little, lot} Exercise points in loops and arrays.
C {0, .., 12} {little, lot} Exercise points in applets.
D {0, .., 8} {little, lot} Exercise points in object-oriented programming.
E {0, .., 19} {little, lot} Exercise points in graphical applications.
F {0, .., 10} {little, lot} Exercise points in error handling.
T P 1 {0, .., 30} {little, lot} Total points in Prog.1.
T P 2 {0, .., 30} {little, lot} Total points in Prog.2.
F R1 {0, 1} {fail, pass} Final result in Prog.1.
F R2 {0, 1} {fail, pass} Final result in Prog.2.
data set contained 125 rows and four attributes and Prog.2 data set contained 88 rows and eight attributes.
The attributes, their domains and descriptions are presented in Table 2. In the original data set, all attributes were numerical, but for Bayesian classifiers we converted them to binary-valued qualitative attributes.
4.2 Classifiers
For model construction, we performed descriptive data analysis, where we searched dependencies between numeric and categorial attributes by correlation, correla- tion ratio, mutual information and association rules. The analysis revealed that total points T P in both courses depended heavily on exercise points and the dependency was quite linear. Even stronger dependencies were discovered by association rules between final results F R and binary versions of exercise at- tributes. The dependency analysis revealed also dependencies between exercise attributes. Most of them were moderate, except the dependency between E and F . In addition, we found that final results in Prog.2 depended strongly on Prog.1 attributes B, C and T P 1. B attribute affected mostly among successful students and C among failed students, but generally T P 1 was the most affecting factor.
The goal was to predict the course final results F R1 and F R2 as early as possible, and thus we tested seven different cases:
1. A ⇒ F R1 4. T P 1 ⇒ F R2 2. A, B ⇒ F R1 5. T P 1, D ⇒ F R2 3. A, B, C ⇒ F R1 6. T P 1, D, E ⇒ F R2
7. T P 1, D, E, F ⇒ F R2
Two numeric classification methods (multiple linear regression and support
vector machines) were applied to numeric data, and three versions of naive Bayes
classifiers (standard NB, TAN and Bayesian multinets) to categorial data. In
numerical methods, we had to learn seven different models in both paradigms,
but for naive Bayes classifiers, it was enough to learn just one model for Prog.1
NB1 TAN1 BMN1
FR1
A B C
FR1
A B C
FR1=0
A B C
FR1=1
A B C
NB2 TAN2 BMN2
TP1
FR2
D E F
TP1
FR2
D E F
FR2=0
D E F
C
FR2=1
D E F
B