}w!"#$%&'()+,-./012345<ya

(1)

Masaryk University Faculty ofInformatics

}w !"#$%&'()+,-./012345<yA|

Clustering Analysis

in Educational Data

Bachelor’sThesis Petr Boroš

(2)

(3)

Declaration

I declare that this thesis is my own work and has not been submitted in any form for another degree or diploma at any university or other institution of tertiary education. Information derived from the work of others has been acknowledged in the text and a list of references is given.

(4)

(5)

Acknowledgement

I would like to express my deepest gratitude to doc. Mgr. Radek Pelánek, Ph.D. for being my advisor and, in particular, for his unceasing support and wise guidance during both research and preparation of this thesis. I would also like to thank Bc. Juraj Nižnan and Mgr. Jiˇrí ˇRihák for being part of Tutor research team and sharing lots of interesting insights. Last, but not least, I would like to thank my family and friends for creating such a great study environment.

(6)

(7)

Keywords

cluster analysis, similarity graphs, Laplacian matrices, spectral clustering, educational data, Problem Solving Tutor

(8)

(9)

Abstract

Spectral clustering is a complex clustering technique based on linear algebra. Intelligent tutoring system is a machine-based tool helping students reach their learning goals more effectively. Problem Solving Tutor is an intelligent tutoring system developed at our university, used by over 10,000 students. We explain theory behind spectral clustering and then propose application of spectral clustering in intelligent tutoring systems, namely automatic concept detection – finding groups of similar problem instances within a problem class. Proposed method is evaluated on real user data from Problem Solving Tutor. With the help of this method we also analyze one particular problem from Problem Solving Tutor.

(10)

(11)

1 Introduction

Education is as old as the human society is. Computers were invented relatively recently but they have changed the world largely. One of the fields where computers and education meet is virtual tutoring. We have seen that intelligent tutoring systems help people learn more effectively and efficiently. How can we raise the performance of these systems and make them better?

Intelligent tutoring systems Every student in a school class of 20 people has different needs. However, a common practise is to use one identical book for all of them. The order of exercises is the same for every student although one student can run into difficulty when solving them and her classmate might be bored by them. A teacher usually mitigates this problem by individual approach, however this requires big effort.

Rise of computing era has opened new possibilities in a lot of different fields and education is one of them. Nowadays it is possible to have a system which gives exercises to a student, tracks her results and then modifies the order of problems so that it suits her the best and makes the process of learning more effective. Those systems are called intelligent tutoring systems.

Future of virtual tutoring Intelligent tutoring systems have not had a large impact on the world’s education so far. They are expensive to develop and until relatively recently, the computing power was costly to deploy. However, both of the reasons are shrinking and education will be facilitated by intelligent systems more and more [3].

There are academic conferences covering both modeling in ITS generally (such as AIED conference) and educational data mining which is integral part of ITS (such as EDM conference). There are projects as Khan Academy which present a big leap forward in bringing virtual tutoring closer to people [17]. However, we are not familiar with a larger penetration of intelligent tutoring systems

(14)

1. Introduction

in common education1_{. We believe that ITSs must perform better in order to} achieve that goal.

Clustering and its application in ITS Intelligent tutoring systems are largely dependent on mathematical statistics. One of the statistical branches deal with clustering as a method for data analysis. Many different clustering algorithms are known. This thesis investigates how one of them, spectral clustering, can be incorporated in educational data.

Application of clustering methods in ITS would make them perform better because we can for example incorporate this additional knowledge into predic-tion models.

At our faculty, there is an intelligent tutoring system called Problem Solving Tutor being developed. This will suit as the principal framework for the method evaluation.

Related work A brief history of clustering educational data can be found in [14]. According to the author, there has been little written on how to use clustering analysis as part of e-learning strategy (which is incentive of this thesis).

A good general introduction into clustering can be found in comprehensive [5] which covers foundations of the method(s) and goes well beyond this thesis. A survey of all the basic algorithms including k-means can also be found there. This thesis largely uses spectral clustering algorithm. Introduction to spectral clustering method is greatly covered by [18] and some more complex variations are described in [12, 16].

The thesis is grounded with data from Problem Solving Tutor. You can find the application onhttp://tutor.fi.muni.czand information about the pre-diction model can be found in [9, 10].

1. At the time of writing the thesis, Carnegie Learning with over half a million users seems to be the most popular system. Even though this number seems large, it is just a little fraction of people involved in education systems around the world.

(15)

1. Introduction Content and contribution In the following chapter we introduce Problem Solving Tutor, explain how it works, demonstrate an example of the problem solved by a student and finally show what data can be used for cluster analysis. In the third chapter we discuss clustering in general and build the knowledge needed for spectral clustering. We define and introduce the main mathematical tools for the application of spectral clustering.

Fourth chapter is about application of the spectral clustering and its evaluation. We present three different views on the method and evaluate them separately. In the last chapter we conclude and suggest future work.

(16)

(17)

2 Educational data

This thesis is about educational data. In this chapter we show an example of intelligent tutoring system and then discuss what data can be used for the analysis. We explain that obtaining labeled data is tricky and performing rigorous analysis is difficult.

2.1 Problem Solving Tutor

Problem Solving Tutor is a free web-based application for practising various exercises. Student creates a profile, solves problems and the system tracks her progress. There is a prediction model inside the system which predicts student’s results and tries to suggest her an exercise which would suit her the best. More about the prediction model can be found in [10].

Figure 2.1: An example of Problem Solving Tutor exercise: Binary Crossword On the Figure2.1you can see an example of a problem from Problem Solving Tutor called Binary Crossword. Student fills the cells with zeros and ones so that all the constraints are satisfied. The constraints are often interconnected and the exercise contains some self-reference.

(18)

2. Educational data

Problem Solving Tutor measures the time it takes the student to solve the problem. Then it decides whether it is better to suggest the student the next exercise in a row, or to skip a few exercises because practising them would not be efficient.

2.2 Problems, problem instances and concepts

In order to be precise, we need to define a few terms. Exercises in Problem Solving Tutor are organized in units – for example the Robotanist is one unit composed of several exercises. We will call the unita problemand a concrete exercise (for example see Figure2.1) a problem instanceor justan instance. A problem is then composed of instances, but there is no detailed classifica-tion of the instances. We will present a problem called Grapher where this classification would be convenient.

Figure 2.2: Grapher: 3 different instances

In Grapher, student is given a graph of function and needs to write an equation of the function in order to solve the exercise. You can see an example of 3 different instances on Figure 2.2. There is a polynomial, logarithmic, and goniometric function (and there are more instances of every kind in Grapher). Clearly, there is a natural classification of the instances – polynomial class, logarithmic class, and goniometric class. Student who thrives in polynomial functions is not necessarily good at logarithmic functions.

(19)

2. Educational data We will call these inherent classesconcepts. The prediction model would per-form better if it knew about concepts within problems, because then it would be able to differentiate between instances which are within one problem, but not exercising the same ability.

In Grapher, concepts are clear and one could easily assign them. Nevertheless, the situation is not as easy in other problems. In some problems we have some hypotheses about concepts, in other problems, we do not. It would be useful to have the concepts mapped for all the problems, but mapping assigned by expert is expensive and error-prone. In this thesis we propose a method for automatic concept detection.

2.3 Input data

Student times

Because we want to develop a method for clustering many different problems, we cannot use any detailed information about the process of solution. This would make the method too narrowly oriented. We can however use the time it took the student to solve an instance, because this information is available within all problems.

The input data then looks as a matrix ofstudents×instances. You can see an

example of this matrix for 3 students and 4 problems under this paragraph. For example, Student 2 have been solving Instance 1 for 42 seconds. Note that some elements are empty as not all students solved all instances. One instance is represented by a column vector in this matrix.

Instance 1 Instance 2 Instance 3 Instance 4

Student 1 212 357 150

Student 2 42 113

(20)

2. Educational data Student skills

In Section4.3we will cluster not only the instances of a single problem, but also problems themselves. Therefore we need a represenation of a problem. As stated earlier, there is a prediction model in Problem Solving Tutor. The model uses a stochastic gradient descent to determine a “skill” for every user and problem. Skill is a number which represents an ability of the user to solve the given problem. More on how the model works can be found in [10]. We will use vector of user skills as a characterization of a problem.

2.4 Labeled data

A general problem with educational data analysis is that the data are usually not labeled [14]. When running experiments, we want to be familiar with the desired output so that we are able to measure performance of the method. When working with educational data, we usually don not have this additional information – in our concrete case we do not know what are the concepts within a problem.

We overcome this difficulty by three different approaches in Sections4.1,4.2

and 4.3. First, we mix two different problems together, obtaining a single problem with two concepts. Second, we use an expert concept mapping – concepts defined by the author of the problem. Third, we use an own insight into studied problems to analyze the output.

Note that the third approach in particular is not very rigorous and may present a research pitfall as “people’s beliefs are shaped largely by their desires”.

(21)

3 Preliminaries

In this chapter we introduce clustering as a statistical method in general and then we take a closer look on two concrete algorithms. First, ak-means algorithm which is a well-known and rather simple method for clustering data. Second, a spectral clustering method which was originally found by [4] and presents a sophisticated approach to data analysis. Before spectral clustering itself we show so called similarity graphs and their Laplacians which are all used by spectral clustering algorithm.

3.1 What is clustering

Suppose we have some data (i.e. points in a plane) and we want to separate them into several groups so that items in one group are somehow “similar”. This separation is calledclusteringand groups are calledclusters. More formally, for a set of vectors X = {x₁,x₂, . . . ,x_n}, clustering technique assigns label l_i

to each vector xi. There is variety of clustering methods [15] and their use

depends on the domain of application, i.e. whether we need to specify number of clusters or let the algorithm determine the number and so on.

There are several notes to mention. First, this is of course not the only possible definition of clustering. There are more general definitions, i.e. those allowing fuzzy clustering [8]. Second, we did not discuss the “similarity” among items of one label in the definition. This is because of practical reason – there are many clustering algorithms and they measure “similarity” differently. We will show a simple algorithm calledk-means which takes basic Euclidean distance as a metric. On Figure3.1 you can see a general example of clustering. The points in plane are the input data and the colours are assigned by clustering algorithm (one colour means one cluster).

3.2 K

-means algorithm

K-means algorithm is simple and well-known method for clustering. We present it because of two reasons. First, it is a good introductory example of

(22)

3. Preliminaries

Figure 3.1: Example of clustering in general.

concrete clustering algorithm as it is very straightforward. Second, it is used as a subroutine in spectral clustering algorithm which is the main topic of this thesis and thus we will need it further on.

Intuitively, the algorithm selectskpoints1_{and thus splits the plane into}_k _parts. Then it computes a centroid for all data points within each part and use those centroids as the newk points. After a couple of iterations it outputsk clusters.

K-means might be slow and susceptible to initialization. On the other hand it is very simple and easy to implement.

Pseudocode of the k-means algorithm can be found in Algorithm3.1. On the Figure3.2 there is an example run of k-means algorithm (squares represent data points and circles represent centroids).

3.3 Similarity graphs

In this section we define some terms that will be needed in spectral clustering algorithm. First, we need to measure similarity of two vectors. A common way to do this is to use a correlation coefficient. We will use Spearman’s rank 1. Usually by random, but there are more sophisticated methods.

(23)

3. Preliminaries Algorithm 3.1k-means algorithm

inputdata(points in plane), integer k centroids←selectkrandom points in plane

forc incentroidsdo

clusters(︀c⌋︀ ←pointsx∈datafor which⋃︀⋃︀c−x⋃︀⋃︀ ≤ ⋃︀⋃︀c′−x⋃︀⋃︀for allc′∈centroids

end for

whileclusterschanged during the iteration do

for c incentroidsdoc←centroid of points in clusters(︀c⌋︀

end for

clusters(︀c⌋︀ ←pointsx∈datafor which⋃︀⋃︀c−x⋃︀⋃︀ ≤ ⋃︀⋃︀c′−x⋃︀⋃︀for allc′∈centroids

end while outputclusters

(24)

3. Preliminaries

correlation coefficientρbecause it is non-parametric and simple to compute.

However, it can be easily substituted by different vector similarity measure as the rest of the application does not depend on it. We will now define two different similarity graphs.

Definition 1. Let X = {x₁, . . . ,x_n} be set of vectors and _ε be value in (︀−1, 1⌋︀. Threshold graph G= (V,E) is undirected graph with vertices V = {v₁, . . . ,v_n} and edges(v_i,v_j) ∈E⇔_ρ(x_i,x_j) ≥_ε.

Threshold graph is sometimes referred to as ε-neighborhood graph.

Definition 2. Let X= {x₁, . . . ,x_n}be set of vectors and k be integer in(︀1,n−1⌋︀. k-nearest neighbour graph G = (V,E) is undirected graph with vertices V = {v₁, . . . ,v_n}constructed as follows: For every vertex v_i we sort the other vertices as (v_π1, . . . ,v_π_n)so that _ρ(x_i,x_π1)is the highest, _ρ(x_i,x_π2)the second highest, and so on. Then we connect vi with vertices{v_π1, . . . ,v_π_k}.

Note that k-nearest neighbour graph is not directed and thus not necessarily

k-regular. This is slightly counterintuitive but it does not present a problem.

3.4 Graph Laplacians

The last thing we have to define before introducing spectral clustering are Laplacian matrices or graph Laplacians. They are important part of spectral graph theory and can be used in several areas of mathematical research [11].

Definition 3. The unnormalized graph Laplacian matrix of graph G is defined as L=D−A where D is G’s diagonal degree matrix and A is G’s adjacency matrix.

(25)

3. Preliminaries Regardless of concrete graph, all Laplacians have some interesting proper-ties, for instance,n×nLaplacian has nnon-negative real-valued eigenvalues

0=_λ₁≤_λ₂≤. . .≤_λ_n. More of the properties and their proofs can be found in

[18]. Laplacians eigenvectors describe many properties of the graph [11]. Even though unnormalized Laplacian can be used in basic spectral clustering, there are more elaborate versions of the algorithm using normalized graph Laplacians. There are more versions of normalized graph Laplacians [2], but we will present just one of them as we found out that they bear more or less equivalent results.

Definition 4. The normalized graph Laplacian matrix of graph G is defined as Lsym=I−D

1 2AD

1

2 where I is the identity matrix, D is G’s diagonal degree matrix

and A is G’s adjacency matrix.

3.5 Spectral clustering

Having introducedk-means algorithm, similarity graphs and Laplacian matri-ces, we can bring out spectral clustering algorithm. One of the big disadvan-tages ofk-means is that it splits plane into halfplanes (when using Euclidian distance as metrics). Figure3.3illustrates this problem. On the left-hand side, data is clustered by spectral clustering and on the right-hand side byk-means. We can clearly see that spectral clustering performs better. More examples can be found in [12].

We will present two different versions of spectral clustering and then evaluate them both in the following chapters. We will see that the more complex version of the algorithm produces better results.

First, we present unnormalized spectral clustering in Algorithm3.2. It is not clear who is the original author of this method, but some historical notes about development of spectral clustering can be found in [18].

(26)

3. Preliminaries

Figure 3.3: Spectral clustering vs. k-means

Algorithm 3.2Unnormalized spectral clustering algorithm inputsimilarity matrixS, integer k

construct a similarity graph GfromS

compute unnormalized Laplacian Lof G

letU be matrix of the first firstk eigenvectorsu1,u2,dots,uk of Las columns

lety1,y2, . . . ,yn be rows of U

run k-means ony1,y2, . . . ,yn, obtain clusters C1,C2, . . . ,Ck

outputC1,C2, . . . ,Ck

Algorithm 3.3Normalized spectral clustering algorithm inputsimilarity matrixS, integer k

construct a similarity graph GfromS

compute normalized Laplacian Lsym of G

letU be matrix of the first firstk eigenvectorsu1,u2,dots,uk of Las columns

lety1,y2, . . . ,yn be rows of U normalized to norm 1

run k-means ony1,y2, . . . ,yn, obtain clusters C1,C2, . . . ,Ck

outputC1,C2, . . . ,Ck

(27)

3. Preliminaries Second, we show normalized spectral clustering according to Ng et al. [12] in Algorithm3.3. This algorithm brought the best results among all tried versions of spectral clustering.

Note that spectral clustering is counter-intuitive and it cannot really be seen what it does at the first glance. We will try to explain the basic intuition behind it in the following paragraphs. For more details, insights and formal proofs see Sections 5–8 in [18].

Suppose we want to separate a weighted graph (weight being the similarity between two vertices) into two groups so that each of the groups contains “the most similar” vertices. A common way how to do this is to find a minimum cut which splits the graph. For larger number of clusters we can generalize the problem to minimum k-cut [6].

One of difficulties with this approach is that the output clusters might not be balanced. For example there can be one cluster with one vertex and one cluster with the rest of vertices. This is usually undesired result when applying clustering algorithm.

We can overcome this issue by penalizing unbalanced clusters. The two most common objective functions to encode this are RatioCut and Ncut [7,16, 18]. Algorithm to find the optimal solution isNP-hard. By algebraic manipulation of Laplacian matrix properties it can be shown that spectral clustering approxi-mates the optimal value of RatioCut and Ncut (unnormalized and normalized version). The formal proof can be found in Section 5 of [18].

(28)

(29)

4 Application and evaluation

We propose application of spectral clustering in educational data. We take data from the Problem Solving Tutor and run experiments on it (the concrete algorithms used are described in Section3.5). We evaluate the results in this chapter as well.

When evaluating clustering technique, we need to have some labeled set of data. That is, data relevant to the domain of application and with pre-assigned labels. Then we just hide labels for a while, apply clustering algorithm and see how much it was successful in matching the original labeling.

We try to perform such analysis in Sections4.1,4.2and 4.3. In the first case, we mix two different problems together and let the algorithm decide which instance belongs to which problem. In the second case, we take a closer look at one particular problem from Problem Solving Tutor which has instances labeled by an expert. In the third case we try to cluster problems instead of instances. In the last section we propose further applications of the method.

4.1 Two mixed problems

There are 30 different problems in Problem Solving Tutor. We took 8 most popu-lar problems (Loop Finder, Nurikabe, Binary Crossword, Tilt Maze, Robotanist, Rush Hour, Region Puzzle) and mixed every pair of problems, obtaining 28 dif-ferent data sets. We temporarily removed the labels of the original problems, so that every data set is just a set of time vectors (see Section2.3). Then we run spectral clustering algorithm parametrized for creating 2 clusters on each of those data sets. Finally, we measured the result as a ratio of correctly assigned problems to all problems. Note that this ratio cannot fall below 50 %, which means that 50 % corresponds to random assignment and 100 % corresponds to perfect result.

We evaluated both threshold graph and k-nearest neighbours graph as a similarity graph. We empirically found out that the best setting of parametres is aroundε=0.25 and kbeing the half of number of instances.

(30)

4. Application and evaluation

First, we present the results for this particular settings in Tables 4.1and 4.1. More discussion about parameter analysis will follow.

Sok. Loo. Nur. Bin. Til. Rob. Rus. Reg. Sokoban – 94.2 86.2 95.2 89.3 73.9 85.4 56.6 Loop Finder 94.2 – 86.6 98.6 96.0 89.2 96.6 84.1 Nurikabe 86.2 86.6 – 94.2 92.9 83.5 76.6 84.9 Binary Crossword 95.2 98.6 94.2 – 95.2 69.0 98.3 85.9 Tilt Maze 89.3 96.0 92.9 95.2 – 87.2 87.7 78.0 Robotanist 73.9 89.2 83.5 69.0 87.2 – 86.9 61.5 Rush Hour 85.4 96.6 76.6 98.3 87.7 86.9 – 60.9 Region Puzzle 56.6 84.1 84.9 85.9 78.0 61.5 60.9 – mean 83.0 92.2 86.4 90.9 89.5 78.7 84.6 73.1 Table 4.1: Spectral clustering on threshold graph

Sok. Loo. Nur. Bin. Til. Rob. Rus. Reg. Sokoban – 96.2 90.4 95.2 92.2 81.2 88.5 58.2 Loop Finder 96.2 – 91.8 97.2 98.5 94.9 95.3 92.0 Nurikabe 90.4 91.8 – 94.2 93.6 87.8 79.4 78.0 Binary Crossword 95.2 97.2 94.2 – 94.6 76.2 97.5 86.5 Tilt Maze 92.2 98.5 93.6 94.6 – 90.5 86.9 66.8 Robotanist 81.2 94.9 87.8 76.2 90.5 – 86.9 67.6 Rush Hour 88.5 95.3 79.4 97.5 86.9 86.9 – 64.9 Region Puzzle 58.2 92.0 78.0 86.5 66.8 67.6 64.9 – mean 86.0 95.1 87.9 91.6 88.3 83.6 85.6 73.4 Table 4.2: Spectral clustering onk-nearest neighbours graph

The overall mean performance is 84.8 % for threshold graph and 86.5 % for

k-nearest neighbours graph. We can see that the results are very good (over 90 %) for problems requiring strong logical reasoning skills (like Loop Finder), but might fall to about 70 % for problems which are based on intuition and luck (like Region Puzzle). This is quite unsurprising as the second kind of problems unevitably contains more noise. The question whether we can do something about those problems as well remains open.

Although random initialization is used in thek-means subroutine, the cluster-ing is very stable (variance is 0.00 for the majority of pairs).

(31)

4. Application and evaluation Comparison withk-means

Juraj Nižnan in his thesis [13] performed a simpler method of clustering on the same data set as we did. He applied basic k-means algorithm on vector of correlations between a single problem and all others. The results can be found in Table4.3. We can see that, with the overall performance 82.2 %, this approach bears worse results than the more sophisticated spectral clustering.

Sok. Loo. Nur. Bin. Til. Rob. Rus. Reg. Sokoban – 95.5 71.9 93.8 82.1 80.4 78.8 74.7 Loop Finder 95.5 – 81.5 94.5 97.5 82.2 89.3 91.0 Nurikabe 71.9 81.5 – 96.1 91.6 77.2 63.6 67.6 Binary Crossword 93.8 94.5 96.1 – 97.4 61.3 96.6 88.9 Tilt Maze 82.1 97.5 91.6 97.4 – 84.4 80.2 75.8 Robotanist 80.4 82.2 77.2 61.3 84.4 – 80.1 60.0 Rushhour 78.8 89.3 63.6 96.6 80.2 80.1 – 68.4 Region Puzzle 74.7 91.0 67.6 88.9 75.8 60.0 68.4 – mean 82.5 90.2 78.5 89.8 87.0 75.1 79.6 75.2 Table 4.3:k-means on correlations between problems (by J. Nižnan [13])

Setting of parameters

Now we elaborate on the setting of parameters. Spectral clustering looks complicated, but there are actually just few parameters needed to be set. We have to set the number of clusters, which presents a problem in general but not in this scenario where we mix two problems. Then we have to choose a similarity graph. There is no agreement on which similarity graph is the right one to choose as it may differ depending on the domain of application. We have chosen threshold graph andk-nearest neighbours graph because of reasons described in Section3.3.

For these two graphs we have to determine the rightε and k. To the best of

our knowledge, there is no better way how to do this than by experiment. For

ε, we tried all the possible values between−0.35 and 0.35 (with step of 0.05).

(32)

4. Application and evaluation 90 85 80 75 70 65 60 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 % ε Figure 4.1: How to setε

Determining ε is easy because it “measures” the similarity of two problem

instances and thus there is a chance of single value suiting all pairs of problems. Fork, the situation is more complicated because it is connected with topological properties of the underlying graph.

We tried to find a single value ofk that would give good results for all pairs of problems and 30≤k≤50 showed as a suitable setting, however the validity

of this value might be questioned when the number of problem instances is increased. Thus we decided to varyk based on number of problem instances. We present a graph which shows that settingk at half of number of instances is reasonable choice. Note that as number of instances varies for every pair of problems, we normalized the x-axis (so it means percentage of problem instances, not their exact number).

The gratifying finding from those graphs is that the curves are smooth and there are not significant peaks which indicates that the algorithm is not very sensitive to the exact setting of parametres.

(33)

4. Application and evaluation 87 86 85 84 83 82 81 1/8 5/8 % problems 1/4 3/8 1/2 3/4

Figure 4.2: How to set k

4.2 Binary Crossword

Taking two different problems and mixing them is useful for justification of the method validity, however it is useless in practise. One of suggested applications for the clustering in educational data is finding concepts within a single problem. Binary Crossword from Problem Solving Tutor has an expert-based concept mapping. Therefore we may run our algorithm on Binary Crossword problem and compare computer-generated results with expert mapping.

Of course, we always may question validity of expert mapping. We will however see that the algorithm matches the expected result pretty well and in addition brings interesting insight into this particular problem.

Recall that Binary Crossword is a problem where student fills in zeros and ones into cells so as to fulfill some constraints. There are 3 expert-based concepts in Binary Crossword – binary numbers (counting in binary numeral system), logical operations (working with AND, OR, etc.) and crosswords (complex tasks, often self-referenced). You can see an example of Binary Crossword on Figure4.3.

(34)

Figure 4.3: Example of Binary Crossword

One of the advantages of spectral clustering is the possibility to use the computed eigenvectors to plot data in a low dimensional space [1]. Even though we can express the performance of the algorithm as 0–100 % score as in the previous case, it is much more interesting to see the actual distribution of problem instances in 2D plane.

You can see this projection on Figure4.4. Every point represents one instance of Binary Crossword. Expert mapped concepts are depicted by different colors (green, blue, red; see the legend). Dashed lines, created by our algorithm, split the plane into three clusters. This picture is more explainatory than just listing the clusters. The distance between instances should roughly correspond to their similarity and we can see that some instances are more similar to other and some are less.

Axes are not labeled. They correspond to second and third eigenvector (the first eigenvector is constant, see Section3.4and [18]) but depicting concrete values would be meaningless.

We can see that all binary numbers instances have been correctly assigned to one cluster up to a group of six instances. We analyzed those instances and found out that five of them are exercises on addition and subtraction. We think 24

(35)

4. Application and evaluation 1st_cluster 2 nd clu ster 3 _rd clu ste_r Binary numbers Crosswords Logical operations Arithmetics (a) (c) (b) (d)

Figure 4.4: Binary Crossword – projection of instances

that as those operations are linked with logical operations, their assignment to this cluster makes sense and brings further insight into the Binary Crossword problem.

Now we take look on four separate instances which we found interesting to discuss more. These instances are marked on Figure4.4 and you can find the concrete exercises on Figure4.3.

First instance (a) is a typical representative of binary numbers concept. Student needs to write decimal numbers in binary system. According to clustering output it lies deeply in the second cluster which is correct assignment.

Second instance (b) is an example of logical operations concept. Student performs logical conjunction and negation. Again, it lies in the core of the third cluster.

(36)

The most interesting is the crosswords concept. It contains quite different instances and its not easy to characterize them. Instance (c) is an example of a crossword. The last instance (d) lies on the edge of the second and third cluster. It was assigned to crosswords by an expert, however we can see that it contains logical operations and binary numbers so its position makes sense. In this section we have analyzed one particular problem and performed an example run of our algorithm. We tried to evaluate the results, however, this evaluation is largely based on experience with this particular problem. We believe there is no easy way how to evaluate the results more formally and rigorously.

4.3 Problem-level clustering

So far we have clustered instances of one particular problem (in case of Binary Crossword) or telling apart instances of two mixed problems. We would like to take a step further and try clustering of whole problems, i.e. not their instances. In case of clustering instances we took a vector of user times as a characteriza-tion of an instance. Now we take a vector of user skills as a characterizacharacteriza-tion of a problem. (See Section2.3 for further details on input data.) Of course, skills are computed by a prediction algorithm and are only as good as this algorithm is. However we have decided not to question the quality of computed skills in this thesis. More on used prediction model can be found in [10].

A downside of clustering problems is that we cannot rigorously measure the performance of the algorithm. In case of two mixed problems we knew what the right answer should be. In case of Binary Crossword we had an expert mapping of instances to clusters.

In this case the situation is more difficult. One problem is that there is no mapping of whole problems. Problems in Problem Solving Tutor are split into two categories: logical (“puzzle”) and educational (“math” and “informatics”). However, this taxonomy should be treated with caution as e.g. Robotanist is within educational category but student is required to have a good command of logical thinking to solve more demanding instances.

(37)

4. Application and evaluation We have selected 11 most solved problems from Problem Solving Tutor (Bi-nary Crossword, Broken Calculator, Loop Finder, Number Maze, Polyminoes, Region Puzzle, Robotanist, Rush Hour, Sokoban, Tents and Tilt Maze) and performed analysis on them. We used the same technique as in the previ-ous section (spectral clustering withk-nearest neighbours graph. We tried to partition the data into 2 and 3 clusters.

There is little sense in just listing the clusters. Instead, we present the result on two Figures4.5 (2 clusters) and4.6 (3 clusters). Note that the first figure is one-dimensional only and it should roughly correspond to splitting the problems into logical and educational categories. The interpretation of the second figure is more unclear, however it shows which problems are more similar to others (the less Euclidean distance is between problems, the more similar they are). Recall that these figures come from the last step of spectral clustering algorithm, just beforek-means run.

Tilt Ma ze Numbe r Maz e Rush Hour Polyminoe s Region Pu zzle Sokoba n Tent s Loop Find er Robota nist Brok en Ca lculat or Bina ry Cross word

Figure 4.5: Problem-level clustering (1D)

4.4 Further applications

We have seen that the algorithm performs very well in telling two problems apart. We have also seen that it can bring an interesting insight into problems – although a deeper knowledge of problem is needed to bring the analysis in, it can be said that the algorithm output is reasonable and can be trusted (at least within the segment of similar problems).

(38)

4. Application and evaluation Sokoban Number Maze Tilt Maze Rush Hour Tents Region Puzzle Polyminoes Loop Finder Robotanist Broken Calculator Binary Crossword

Figure 4.6: Problem-level clustering (2D)

One of the direct applications is of course to find concepts in problems auto-matically. This has two impacts. First, intelligent tutoring systems can perform better skill prediction which is crucial for the users of the system. Second, the problems can be automatically organized more logically – focusing on development of different skills than on different problems itself. To the best of our knowledge, both of those goals are being fulfilled by Problem Solving Tutor research and development team and will be implemented in the next version of the system.

Another way to go is to cluster not problems and instances but students. There has been some research on this topic before [14], but to the best of our 28

(39)

4. Application and evaluation knowledge, this has not been anyhow implemented in Problem Solving Tutor so far.

Our last suggestion is to investigate hierarchical clustering. We have inquired only partitional methods (i.e. split data into several classes), but some problems may be naturally composed of nested concepts and the concept tree might not be flat.

(40)

(41)

5 Conclusion

The thesis focused on application of spectral clustering in educational data. We have introduced Problem Solving Tutor as our benchmark for testing the algorithms. We have defined clustering in general and then shownk-means algorithm and various kinds of spectral clustering algorithms.

We have proved that the performance of the proposed spectral clustering method gives very good results (about 85 %) when telling two problems apart and therefore we might conclude that it can be used in clustering educational data. Also we have seen on the example of Binary Crossword that the method output can bring interesting results when combined with expert knowledge. Finally we have seen that we can use clustering not only on the level of instances, but also on the level of whole problems.

The further work should go in two directions. First, develop the method further and analyze different parameter settings. Experiment with hierarchical clustering and try to approach the problem from the other side by clustering students. Second, bring the results to the intelligent tutoring systems users – that is, include it into prediction models and raise their performance.

Results from this thesis (together with other results) will be published in [1] in proceedings of the 16th International Conference on Artificial Intelligence in Education (AIED 2013).

(42)

(43)

Bibliography

[1] P. Boroš, J. Nižnan, R. Pelánek, and J. ˇRihák. Automatic detection of concepts from problem solving times. InArtificial Intelligence in Education, 2013 (to appear).

[2] F. R. K. Chung. Spectral Graph Teory, volume 92. American Mathematical Society, 1997.

[3] A. T. Corbett, K. R. Koedinger, and J. R. Anderson. Intelligent tutoring systems. Handbook of humancomputer interaction, pages 849–874, 1997. [4] W. E. Donath and A. J. Hoffman. Lower bounds for the partitioning of

graphs. IBM Journal of Research and Development, 17(5):420–425, 1973. [5] G. Gan, C. Ma, and J. Wu.Data clustering: theory, algorithms, and applications,

volume 20. Society for Industrial and Applied Mathematics, 2007.

[6] N. Guttmann-Beck and R. Hassin. Approximation algorithms for mini-mum k-cut. Algorithmica, 27(2):198–207, 2000.

[7] L. Hagen and A. B. Kahng. New spectral methods for ratio cut partitioning and clustering. Computer-aided design of integrated circuits and systems, ieee transactions on, 11(9):1074–1085, 1992.

[8] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comput. Surv., 31(3):264–323, September 1999.

[9] P. Jarušek and R. Pelánek. Modeling and predicting students problem solving times. In SOFSEM 2012: Theory and Practice of Computer Science, pages 637–648. Springer, 2012.

[10] P. Jarušek and R. Pelánek. Analysis of a simple model of problem solving times. InProc. of Intelligent Tutoring Systems (ITS), volume 7315 of LNCS, pages 379–388. Springer, 2012.

[11] B. Mohar and Y. Alavi. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2:871–898, 1991.

(44)

[12] A. Y. Ng, M. I. Jordan, Y. Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.

[13] J. Nižnan. Learning from problem-solving data. Master’s thesis. Masarykova univerzita, Brno, 2013 (to appear).

[14] C. Romero, S. Ventura, M. Pechenizkiy, and R. S. J. D. Baker. Handbook of educational data mining. CRC Press, 2011.

[15] X. Rui and II Wunsch, D. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 16(3):645–678, 2005.

[16] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000. [17] C. Thompson. How khan academy is changing the rules of education.

Wired Magazine, 126, 2011.

[18] U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395–416, 2007.

(45)

Contents of the electronic attachment

An archive with source codes used during experiments is included in the thesis repository in IS MU. The code is written in Python 2.7 and uses

matplotlib, networkx, numpyand scipypackages (above all). The archive contains following files:

binary.py Binary crossword

cormatrix.py Correlation matrix precomputation

problemlevel.py Problem-level clustering

twomixed_eps.py Two mixed problems (threshold graph)

twomixed_knn.py Two mixed problems (k-nearest neighbours graph)

cormatrix Directory with precomputed cormatrices

data Directory with data

pldata Directory with data for problem-level clustering

utils Directory with utility (common) scripts

All of the source code is original with the only exception of

utils/data_loader.py, utils/model.py and utils/problem_data.py

which were originally created by Radek Pelánek and modified by the author of the thesis.

Note that although source code is fully operational, its purpose was to run experiments during the research. We advise not to use this source code in real implementation (although it can be used as a springboard for further development).