Corpus-Based Methods for the Unsupervised Grading of Short Answer Questions

(1)

GRADING OF SHORT ANSWER QUESTIONS

_______________

A Thesis Presented to the

Faculty of

San Diego State University

_______________

In Partial Fulfillment

of the Requirements for the Degree Master of Science

in

Applied Mathematics

_______________

by

Eric Shaun Proffitt Summer 2015

(2)

(3)

Copyright c 2015 by

(4)

ABSTRACT OF THE THESIS

Corpus-based Methods for the Unsupervised Grading of Short Answer Questions

by

Eric Shaun Proffitt

Master of Science in Applied Mathematics San Diego State University, 2015

Information retrieval (IR) approaches to semantic relevancy indexing can be extended from the traditional query-document paradigm to the question-answer paradigm within the con-text of automatic grading. The focus of this paper is to evaluate the success of unsupervised corpus-based approaches to short answer automatic grading IR, applied the corpus of student responses1_{themselves. We illustrate our methods on two datasets, the first a dataset of 270}

an-swers to a quiz question asked in a first semester calculus class at San Diego State University2, and the second a dataset of 29 answers to a quiz question asked in an introductory computer science class at the University of North Texas.

1_{The words ‘response’ and ‘answer’ are used interchangeably in this paper.} 2_{California State University, San Diego}

(5)

LIST OF TABLES

PAGE Table 4.1. Precision/Accuracy results for the calculus dataset, broken down by

number of topics (4-7). Parameter values set to C = 65 and α = 0.5. . . 17 Table 4.2. The average precision and discounted cumulative gain for the LSA

and LDA word-to-word methods (5 topics). . . 19 Table 4.3. The mean precision and 95% confidence intervals for LSA-CSS and

LDA-CSS bootstrapped at 100 resamples (5 topics).. . . 20 Table 4.4. The 20 terms most associated with correct answers (LSA-CSS, 5

(8)

LIST OF FIGURES

PAGE Figure 4.1. Precision graph for the LSA and LDA methods. . . 18 Figure 4.2. Precision-recall curves for the LSA and LDA methods (5 topics). . . 19 Figure 4.3. Precision graph for the LSI and LSA-CSS methods, applied to the

(9)

CHAPTER 1 INTRODUCTION

The recent growth of automatic knowledge assessment in education and testing has paralleled that of machine learning in industry as a whole. Computer assisted essay grading of the TOEFL and GRE, for instance, is now commonplace, with ETS’s1 _{E-rater system (Powers et al., 2000;}

Attali & Burstein, 2005) taking over between one half and one third of the required work. Applications in tutoring, such as the AutoTutor system (Wiemer-Hastings et al., 2005), have learned extensively from tutor conversation patterns, and can now conduct fully automated human-to-human tutoring sessions. Automated learning applications such as these share a common natural language processing (NLP) core, and within NLP, an important sub-discipline is that of information retrieval, which seeks to rank a set of documents d1, ..., dnaccording to

their relevancy with respect to a query q. Queries designate the information requirements indi-cated by the user, however what documents are considered most relevant to the query depends entirely on the similarity function/metric which relates them.

In keeping with machine learning convention, learning in IR is either supervised or unsupervised. If the document set is annotated, then a ranking function can be trained in the traditional supervised manner. However in the unsupervised setting, which is the focus of this paper, a similarity metric must instead be proposed a priori, since there exist no training examples from which to learn it.

The advent of unsupervised learning in text similarity dates back to several papers pub-lished in the 1950s. Among these papers, arguably the most influential was one pubpub-lished by Luhn in (1957) which proposed a measure of text similarity based on direct lexical overlap. In subsequent years, Luhn’s methods were refined into what is now known as the classical vector space model (VSM) of information retrieval (Salton & Lesk 1971). In an attempt to account

(10)

for interterm2 correlation, generalizations such as GVSM and LSI (Wong et al., 1985; Man-ning et al., 2009) were later developed, and their success in capturing relationships between distinct words resulted in considerable improvements in accuracy. These methods, and many like them, can been applied both to the corpus of to-be-ranked documents themselves, as well as to large external corpora. One such example of the latter is explicit semantic analysis (ESA) (Gabrilovich and Markovitch, 2007), which uses LSA to mine Wikipedia for semantic patterns of text similarity.

Also actively researched are non-vector space models, where the focus here is on both word-to-word and text-to-text metrics. Word-to-word similarity is more complex, as it requires an extra step in deciding how best to aggregate for comparison at the textual level (Lintean et al., 2010; Rus & Lintean, 2012), however there is evidence that it results in more accurate rankings when compared to direct text-to-text metrics (Niraula et al., 2013).

Focusing our attention now on question-answer IR in particular, previous work has been done on both knowledge-based and corpus-based approaches (Mohler & Milhalcea, 2009). The Oxford-UCLES system (Sukkarieh et al., 2004) for instance requires knowledge of keywords and their synonyms, and a later refinement (Pulman and Sukkarieh, 2005) builds a model using a variety of supervised methods on scored training data, alongside observed syntactic patterns. On the other hand AutoTutor (Wiemer-Hastings et al., 1999), a corpus-based system, uses a version of latent semantic analysis (LSA) (Landauer and Dumais, 1997), and SELSA (Kanejiya et al., 2003), a hybrid knowledge/corpus-based approach, combines LSA with parts-of-speech tagging.

In this paper, we will rank the similarity of student responses to an instructor-provided solution using a variety of unsupervised corpus-based approaches to automatic short answer grading.. Specifically, we will apply these methods to the corpus of student responses them-selves, which, to our knowledge, has not yet been done in the literature. In addition, we intro-duce several well motivated refinements in the application of word-to-word based approaches

(11)

to information retrieval, and short answer grading in particular. We then conduct two experi-ments, the first our primary experiment in which all vector space and word-to-word models are compared on a new dataset of 270 quiz-question responses asked in a first semester calculus class at San Diego State University, and the second a small ancillary experiment, using only the vector space and word-to-word methods which performed best on the calculus dataset, ap-plied to a dataset of 29 responses to a short answer question asked in an introductory computer science class.

(12)

CHAPTER 2 METHODS

Beyond direct lexical overlap, bag-of-words1_{approaches often succeed or fail based on}

their ability to capture word co-occurrence: car ↔

+ drive

and synonymy:

car ↔

+ drive ↔+ automobile

*The plus sign (+) denotes clear positive correlation.

Co-occurrence is arguably the simpler problem, since words of co-occurrence are by definition positively correlated. Synonyms, on the other hand, may exhibit the full range of correlation. For example the words car and automobile may be positively correlated within a collection of texts on different types of motor vehicles (planes, trains, automobiles, etc.), uncorrelated within a text on English words of French origin, and negatively correlated in reference to a vehicle being driven, (She drives the car. or She drives the automobile.)

In addition there are often further complexities to consider: a polysemous term might need its self-similarity value tempered, term co-occurrence may only be true conditionally, etc. However success in teasing out these relationships generally results in performance far superior to that of direct lexical overlap

In what follows we briefly detail a series of unsupervised learning methods. We begin with the vector space approaches: the vector space model (VSM), the generalized vector space model (GVSM), and latent semantic indexing (LSI). We then move to the word-to-word semantic similarity methods: pointwise mutual information (PMI), latent semantic analysis

(13)

(LSA), and latent Dirichlet allocation (LDA). We also consider three methods for aggregation: greedy matching, optimal matching, and corpus-based semantic similarity (CSS).

2.1 V

ECTOR

S

PACE

M

ODEL

The vector space model (VSM) [23] is the simplest model we discuss, it provides a di-rect measure of the lexical overlap between documents. It does not take into account any rela-tionships which may exist between distinct words, and is thus blind to both word co-occurrence and synonymy. The semantic similarity of two documents d1and d2is the cosine similarity2of

their associated document vectors in term space:

cossim(d1, d2) (2.1)

2.2 G

ENERALIZED

V

ECTOR

S

PACE

M

ODEL

Unlike VSM, which treats term vectors as orthogonal and thus independent, the gen-eralized vector space model (GVSM) [28] dispenses with term orthogonality by inverting the roles of documents and terms in vector space. Terms are now considered as vectors in docu-ment space, with the angle and length of a term determined by its (unnormalized) distribution over the set of documents. A new document vector is then reformed from its constituent terms by taking a weighted sum of the term vectors, with the weights being drawn from the entries of the original document vector. The semantic similarity of two documents d1and d2is computed

by applying to them as a change of basis the document-term matrix D, followed by evaluating their cosine similarity:

cossim(Dd1, Dd2) (2.2)

We can interpret the term-orthogonality-constraint relaxation as a method of discovering word co-occurrence, as terms which co-occur will tend have similar distributions over documents,

2_{cossim(x, y) =} x·y kxk₂kyk₂

(14)

and will thus contribute similarly to the reformed document vector.

2.3 L

ATENT

S

EMANTIC

I

NDEXING

Another method of measuring semantic similarity is that of latent semantic indexing (LSI). Proposed by Deerwester et al. in 1988 [3], LSI is similar to GVSM in its approach to finding appropriate term vectors via a change of basis. However instead of using the document-term matrix directly, the singular value decomposition D = U ΣVT is applied to the document-term matrix, and from this decomposition one forms the topic-document-term matrix Tk = Σ−1k UT.

Where Σk is the diagonal matrix of size k × k containing the singular values of D, σ1 ≥

σ2 ≥ . . . ≥ σk. Thus a document’s term weights weigh the corresponding term’s correlation3

across topics, as opposed to GVSM, which weighs a term’s (unnormalized) distribution over documents.

We motivate our choice of change of basis in the hope that topic vectors will capture not only terms which strongly co-occur, but also terms which are thematically related. How-ever unlike GVSM, where all documents contribute, presumably, an equal amount of useful information to our understanding of a term, the monotonic decrease in singular values suggests that past a certain point a topic vector is likely composed of mostly random noise. Thus by choosing l k we can form Σl, which is an l × k matrix containing just the top l rows of Σk,

and then form the approximation to the topic-term matrix, Tl = Σ−1l UT ' Σ −1

k UT. Formally,

the semantic similarity of two documents d1and d2is computed as follows [14]:

cossim(Tld1, Tld2) (2.3)

2.4 W

ORD

-

TO

-W

ORD

M

ETHODS

Using a bag-of-words approach, it is common for methods of measuring text similarity to aggregate from the somewhat simpler problem of word similarity [21]. Such an approach has

(15)

the advantage of breaking the problem of text similarity down into many smaller problems, but presents a challenge in deciding how best to reconstitute the whole. Vector addition provides one method of reconstitution, however as one leaves the vector space paradigm behind, it becomes less clear how best to re-expand to the textual level. Nevertheless three standard methods have been discussed in the literature, all taking similar but distinct approaches: the greedy matching method, the optimal matching method, and corpus-based semantic similarity [16, 18].

The greedy method of aggregating word-to-word similarity begins by iterating through terms in the first document, pairing each term with the most similar among all those still un-paired terms in the second document, and then summing the result. A symmetric variant of this repeats the process, except beginning with terms in the second document and pairing with those in the first. While simple to implement, one drawback of the greedy method is that the value returned is dependent on the order in which terms are paired. Fortunately this dependence on term order can be overcome by transitioning from the greedy to the optimal matching method. The optimal matching method resolves the problem of term order dependence by searching among all term orderings for the permutation which produces the largest similar-ity value. This method is in fact an implementation of the well known assignment problem in combinatorial optimization. In general, considering the permutations one-by-one is a problem with an expected runtime which scales as ω(n!) with the number of terms. Fortunately there exists a polynomial time solution to the assignment problem, namely the Hungarian algorithm [7]. Less fortunate, though manageable for short documents, is that in the symmetric version one document is likely to be longer than the other, meaning that if we let n be the number of terms in the longer document and k the number of terms in the shorter, then the number of subset choices of terms to be paired from the longer document to the shorter will be n_k.

The third and final method discussed is corpus-based semantic similarity (CSS). CSS is similar to both greedy and optimal matching in that it pairs each term in one document with the most similar term in the other. However with CSS there are no restrictions on the number

(16)

of pairs a term can belong to. Thus CSS is in fact an identical relaxation of both greedy and optimal matching.

As a final note, word-to-word similarity aggregation schemes generally work best with some form of normalization. A standard choice is to divide each one-way sum of pairings by the sum of the term weights in the document from which the pairs originate [16]. We propose a more flexible choice however by raising the sum of term weights to a parameter, α:

X

t∈D

t α

for α ∈ [0, 1]. (2.4) This new parameter α is designed to compensate for sparsity in the document vectors. More specifically, the smaller the value of α, the more sparse documents are punished.

All three aggregation methods discussed above are modular in the sense that the word similarity metric is by design left unspecified. In the following subsections, we briefly detail three corpus-based choices for this metric: pointwise mutual information, latent semantic analysis, and latent Dirichlet allocation.

2.4.1 Pointwise Mutual Information

Due to its simplicity and straightforward application, pointwise mutual information (PMI) has been proposed as one possible corpus-based metric of semantic similarity [16]. Calculated as:

pmi(t1, t2) = log

p(t1, t2)

p(t1)p(t2)

(2.5) for terms t1and t2. PMI takes the log of the probability that two terms co-occur in a document

from the corpus, divided by the product of the individual probabilities that each term occurs in a document independent of the other.

(17)

2.4.2 Latent Semantic Analysis

In order to better capture thematically related words, an alternative method is that of latent semantic analysis. Given the topic-term matrix Tl (see Section 2.3), the semantic

simi-larity of two terms is defined to be the dot product of their associated columns in the topic-term matrix:

T_l(i)· T_l(j) (2.6) If we interpret the entries of column i in the topic-term matrix as term ti’s correlation with

each of the topics, then then two term vectors in topic space will point in a similar direction if they are correlated similarly across topics.

2.4.3 Latent Dirichlet Allocation

Another topic-based approach is that of latent Dirichlet allocation (LDA) [2]. At its core, LDA is a probabilistic graphical model which treats documents as distributions over topics, and topics as distributions over terms. To generate a document d, one chooses a length k vector of topic probabilities θdspecifying the probability that each of the n = 1, ..., Ndword

slots of d is assigned to a particular topic zd,n. Once a topic is chosen for a slot, a word wd,n

is chosen to fill that slot according to the topic zd,n’s associated distribution over terms. The

Bayesian hierarchical model is as follows:

θd∼ Dirichlet(α) (2.7)

zd,n∼ Categorical(θd) for n = 1, ..., Nd

wd,n∼ p(wd,n|zd,n, β) for n = 1, ..., Nd

where β is a k × V matrix with each row of β specifying the probability mass function of a topic over the vocabulary. The value zd,nthus tells us which probability mass function to use

(18)

If we assume α and β are fixed, then for each document d, the hierarchical model above encodes the following independencies: the probability of a particular word appearing at the nth_{slot in a document depends only on the topic to which said word is purported to belong,}

the probability that the nth _{word slot belongs to a particular topic depends only on the topic}

proportions for said document, and the topic proportions for said document depend only on α. With these independencies, along with the assumption that each word slot’s topic choice is made independently, we obtain the following factorization:

To obtain the distribution for a generic document d, which is technically a distribution telling us the probability of particular vector of word choices wd, we marginalize over zdand θd:

p(d|α, β) = p(wd|α, β) (2.9)

=X

zd

Z

p(wd, zd, θd|α, β)dθd

Next, we assume that documents are independent and compute the total probability of a corpus: p(D|α, β) =Y

d

p(d|α, β) (2.10) Typically for each document d, one is interested in the topic proportions θd. To this end,

given the corpus of documents D, an iterative method is applied to the log-likelihood function L(α, β|D) in order to estimate the parameters α and β, which are then used to compute the marginal distribution: p(ΘD|D, α, β) = Y d X zd p(θd, zd, wd|α, β) p(wd|α, β) (2.11) However for our purposes, we are only interested in the matrix β of each topic’s distribution over terms. Thus once β is computed, we can immediately compare the similarity of terms

(19)

i and j by taking the dot product of their mean subtracted (unnormalized) distributions over topics:

(β(i)− β(i)) · (β(j)− β(j)) (2.12) The non-mean subtracted dot product β(i)_{· β}(j) _{has been proposed previously in the literature}

(Rus et al., 2013), however there are two plausible reasons for why mean subtraction produces superior results.

First, by applying mean subtraction, a term with a relatively equal probability of occur-ring in each of the topics will have the size of its probability vector attenuated considerably. Terms such as this can be interpreted as showing no preference for any particular topic, and thus the fact that the term exists in both documents is less likely to be indicative of text-to-text similarity, either because the term is uninformative and functions in the corpus much like a stop-word,4 _{or the term is highly polysemous, and its semantic self-similarity value should be}

discounted on account of the likelihood that the two texts under consideration are using it in different ways.

Second, the mean subtracted dot product is better able to capture the similarity be-tween terms whose distributions match up both at their respective peaks and at their respective troughs. When the number of topics is large this is less important, since terms will have a low probability of occurrence in the vast majority of topics. However when the theme of discussion is well defined, and the number of topics small, the fact that a term doesn’t belong to a specific topic is more likely to be indicative of the types of situations in which the term is being used.

Finally, an important distinction between LSA and LDA is that LDA is inherently random. Topic quality in LDA can vary considerably depending on the final estimates for α and β. Therefore to obtain a more robust set of topics, it is best to run the topic model many times and average the result.

4_{For example the word ’calculus’ would be informative in a general corpus, but not within the corpus of}

(20)

2.5 P

SEUDO

-

RELEVANCE

F

EEDBACK

A common approach to unsupervised IR is that of pseudo-relevance feedback (PRF). PRF allows the ranking engine to automatically improve its quality by taking the top ranked (and sometimes bottom ranked) results from an initial query and using them to augment the query followed by a rerank of the documents. The obvious advantage of such an approach is that the initial ranking provides the machine with access to pseudo-training data which it can use to bootstrap itself towards a superior ranking. On the other hand, the downsides are often an increase in the number of hyper-parameters to be fitted (number of feedbacks, feedback cutoffs, etc.), as well as an expected increase in rank quality variance, since poor initial rankings will be magnified as new feedback queries are performed. Feedback itself inevitably results in query drift, as content and themes of the new expanded query drift further and further away from the author’s original intent; thus a certain threshold in rank quality is often required in order to insure that any drift improves rather than detracts from this quality.

In terms of a bag-of-words approach to IR, the specific advantage of PRF is the expec-tation that synonyms and words of co-occurrence related to terms found in the query will be gathered up along with the top ranked documents. These words are then used to expand the query in such a way that upon performing a second document ranking, new documents will rise to the top which may not share much in common with the query at the level of direct lexical overlap.

One well known and oft used implementation of PRF is that of the Rocchio algorithm [15], which augments the original query vector by adding to it the most relevant document vectors, and subtracting from it those which are least relevant. Although the Rocchio Algorithm was initially developed for general query-document IR, it has adapted with mixed results for question-answer IR, specifically in the domain of short answer grading in the sciences [17]. The update step for the Rocchio algorithm is as follows:

Qm = a · Qo+ b · 1 |Dr| X Dj∈Dr Dj− c · 1 |Dnr| X Dk∈Dnr Dk (2.13)

(21)

Where Qois the original query, b and c are weights, Drthe set of relevant documents, and Dnr

the set of non-relevant documents.

Our experiments indicated that no more than a single update of relevance feedback worked best, as more updates either did little to change the ranking or resulted in unacceptable levels of query drift. However the tiny improvements came at the cost of a considerable amount of hyper-parameter tuning which would have been difficult to justify, and suggested that such a method is not robust and would be unlikely to consistently improve performance.

(22)

CHAPTER 3 DATASETS

3.1 C

ALCULUS

D

ATASET

The motivating problem for this paper derives form the grading of short answer responses to the following question asked in a large enrollment, first semester calculus class: One could argue that instantaneous speed is a theoretical construct because you can’t define speed without traveling a distance over a given period of time, yet the word instanta-neous implies you are measuring an infinitely small period of time. How does calculus enable us to get around this seeming contradiction?

The instructor-provided solution to the question is as follows:

In order to get around this contradiction, calculus uses the concept of limits. To find instantaneous speed, one looks at the sequence of average speeds centered at a point over smaller and smaller timescales. If this sequence approaches a specific number, then this number is referred to as the limit of the sequence, and is defined to be the instantaneous speed at that point.

The data set under consideration includes responses from 270 students enrolled in the class. The answers, which range from a sentence fragment up to several sentences in length, were recorded and then independently scored on a 1-4 scale by four human judges (a professor and three teaching assistants), with the final score the average of the four. The question, solution, and responses were made case insensitive, stripped of all punctuation, stemmed, and spelling was corrected. Finally, a go-list1 _{was used which allowed through only math}

1_{We have not seen the term ’go-list’ used in the literature, however it seemed appropriate to denote a}

(23)

terminology or words often used in conjunction with this terminology.

3.2 C

OMPUTER

S

CIENCE

D

ATASET

The computer science dataset consists of 29 student responses to a short answer question assigned in an introductory computer science class at the University of North Texas.2 The question is as follows:

What is the role of a prototype program in problem solving?

The instructor-provided solution is as follow:

To simulate the behavior of portions of the desired software.

The answers, which range from a sentence fragment up to several sentences in length, were recorded and then independently scored on a 1-5 scale by two human judges, with the final score the average of the two. The data was cleaned in a fashion identical to that of the calculus dataset and filtered using a stop-list plus the removal of all single appearance words in the answer corpus, as such words would not improve performance for any of the corpus-based methods discussed in this paper.

(24)

CHAPTER 4 EXPERIMENTAL SETUP AND RESULTS

4.1 A

NALYSIS OF THE

C

ALCULUS

D

ATASET

Our experiments are focused on evaluating the efficacy of corpus-based methods in short answer grading. Specifically, we’re interested in comparing vector space models to word-to-word based models, LSA to LDA, and optimal matching to CSS; greedy matching is ignored as summarily inferior to optimal matching. The success of these methods will be determined by comparing accuracy, precision and recall using a cutoff value to determine how many of the top ranked answers will be considered by the algorithm to be ’correct’.

In preparing the calculus dataset, all responses which scored a 2.5 or below were clas-sified as 0, and all responses above a 2.5 were clasclas-sified as 1. With this score cutoff, the true number of correct answers was 65, making the percentage of correct answers approxi-mately 24% of the whole. The corpus used for learning was the collection of student responses themselves, plus the solution provided by the instructor. With this corpus, our go-list allowed through a total of 103 math and math-related terms.

When applying both the vector space models and the PMI models, a simple binary weighting was used as it was found to be superior to idf1 weighting. For LSA and LDA, idf was used since for LSA it tended to form far more coherent topic-term matrices, and for LDA far more coherent topics. In addition, our experiments indicated that a small number of singular vectors/topics worked best, which in retrospect seems reasonable considering the small number of terms and short document lengths.

For the aggregation methods (optimal/CSS), the α parameter was set to 0.5 in an at-tempt to correct for go-list induced document sparsity, as it was found that not punishing

spar-1_{Inverse document frequency: idf (t) = log} N

nt+1. Where N is the number of documents in the corpus, and nt

(25)

Method 4 5 6 7 VSM 0.446/0.733 Same Same Same

GVSM 0.538/0.777 Same Same Same

LSI 0.554/0.785 0.615/0.815 0.631/0.822 0.615/0.815 PMI-CSS 0.569/0.793 Same Same Same

PMI-Optimal 0.538/0.778 Same Same Same

LSA-CSS 0.708/0.859 0.738/0.874 0.738/0.874 0.692/0.852 LSA-Optimal 0.692/0.852 0.708/0.859 0.692/0.852 0.615/0.815 LDA-CSS 0.692/0.852 0.723/0.867 0.723/0.867 0.692/0.852 LDA-Optimal 0.677/0.844 0.708/0.859 0.646/0.830 0.615/0.815

Table 4.1. Precision/Accuracy results for the calculus dataset, broken down by number of topics (4-7). Parameter values set to C = 65 and α = 0.5.

sity tended to result in unjustifiably large similarity values for documents composed of only one or two terms. Choices for α in a range around 0.5 also worked well, so long as there was some amount of punishment for sparsity. For the correct answer cutoff a value of C = 65 was chosen, as it corresponds to the equal error rate (precision=recall) which in Figure 2 is denoted by the four intersection points of the black line y = x with the precision-recall curves. For LDA, in order to produce a robust set of topics, the (unnormalized) distribution of terms over topics was averaged over 500 runs of the LDA algorithm.

Referencing Table 4.1, with the exception of PMI and LSI which performed similarly, word-to-word methods out performed vector space methods. In hindsight this finding seems reasonable, considering the rigidity of vector addition when compared to word-to-word match-ing, along with the fact that VSM, GVSM, and PMI are only capable of capturing term co-occurrence, while LSA, LSI and LDA take advantage of clusters of thematically related words. Referring to both Table 4.1 and Figure 4.1, it appears that by a slight margin LSA outperforms LDA. Thus given the general consensus that LDA should produce superior topics, LSA’s superior performance may be due in part to a poor choice of word-to-word comparison metric. One alternative to the mean subtracted dot product would be to choose a probability distribution comparison method, such as the Hellinger Distance or KL-Divergence, however

(26)

our experiments, as well as those of others [22], indicate that these methods do not fare as well as the term vector dot product.

Referring to Table 4.2 and Figure 4.2, the average precision (area under the precision-recall curve) and discounted cumulative gain (DCG)23_{further confirm the slight superiority of}

LSA over LDA and of CSS over optimal matching, not only at the equal error rate (C = 65) but across all correct answer cutoffs.

Comparing optimal matching to CSS (Table 4.1 and Figure 4.1) is also informative. For the equal error rate cutoff, CSS outperforms optimal matching across all three word-to-word methods and all four topic numbers. In addition, for both the average precision and DCG metrics, CSS outperforms optimal matching. Thus since optimal matching is more computationally expensive, there seems little reason to choose it over the simpler and faster CSS. 0.65 0.70 4 5 6 7 Number of Topics Pre ci si on Model LSA-CSS LSA-Optimal LDA-CSS LDA-Optimal

Figure 4.1. Precision graph for the LSA and LDA methods.

2_{DCG =}PC

i=1(2

reli_{− 1)/ log(i + 1)} 3_rel

(27)

0.4 0.6 0.8 1.0 0.00 0.25 0.50 0.75 1.00 Recall Pre ci si on Model LSA-CSS LSA-Optimal LDA-CSS LDA-Optimal

Figure 4.2. Precision-recall curves for the LSA and LDA methods (5 topics).

Method Avg. Precision DCG (C = 65) LSA-CSS 0.779 109.565 LSA-Optimal 0.772 107.230 LDA-CSS 0.765 110.092 LDA-Optimal 0.730 105.975

Table 4.2. The average precision and discounted cumulative gain for the LSA and LDA word-to-word methods (5 topics).

4.1.1 Bootstrapping

In order to determine the robustness of precision values obtained using the LSA and LDA methods, we bootstrap 100 resamples from the calculus dataset, treating precision as a statistical indicator of the quality of the topic-term matrix and topic set produced by LSA

(28)

and LDA, respectively. Computational limitations prevented us from bootstrapping the opti-mal matching method, however considering the sopti-mall differences between optiopti-mal and CSS, bootstrapping only LSA-CSS and LDA-CSS seemed a reasonable compromise.

The number of topics was set to five, and although each resampling resulted in a differ-ent equal error rate cutoff C, the expected value of C remained at 65. We thus chose to keep the correct answer cutoff at this value. It is also important to note that the expected number of distinct answers in a resample is 171, meaning that LSA and LDA only have approximately 63% of the total amount of information in the answer corpus with which to form coherent top-ics. Thus when compared to the actual precision, it is to be expected that mean precision will be statistically biased to underperform.

The results in Table 4.3 indicate that the LSA and LDA methods are quite robust. They present a moderate but expected drop in mean precision. Nevertheless the methods produce small 95% confidence intervals with lower bounds which fall well above even LSI precision. This indicates that the LSA and LDA methods are likely to generalize well to other short answer datasets.

Method Mean Precision 95% CI LSA-CSS 0.686 [0.664, 0.708] LDA-CSS 0.667 [0.647, 0.689]

Table 4.3. The mean precision and 95% confidence intervals for LSA-CSS and LDA-CSS bootstrapped at 100 resamples (5 topics).

4.1.2 Word Significance

Once the student answers have been ranked according to their similarity to the solution, the terms can then be ranked according to the significance of their association with the set of correct answers. Upon selecting a correct answer cut off C, for each term t, a two-sample t-statistic is computed contrasting the proportions of correct and incorrect answers containing

(29)

term t: T = _qµC− µI s2 C NC + s2 I NI (4.1)

where µC and µI are the respective proportions (sample means), and s2C = µC(1 − µC) and

s2

I = µI(1 − µI) their associated sample variances.

1. limit* 11. order* 2. approach* 12. infinite 3. close 13. travel 4. infinitely 14. approximate 5. number* 15. between 6. formula 16. process 7. value 17. epsilon 8. change 18. degree 9. mathematics 19. decrease 10. indefinitely 20. small*

Table 4.4. The 20 terms most associated with correct answers (LSA-CSS, 5 topics). The starred (*) terms are found in the solution.

Setting C = 65, α = 0.5, and using LSA-CSS with 5 topics, we rank the t-values from high to low. Table 4.4 displays the 20 terms most associated with correct answers. As expected, a number of the terms shown are those found in the solution itself, however the list also contains many likely candidates for synonyms and words of co-occurrence which don’t appear in the solution. As an example, consider the sentence fragment from the solution: If this sequence approaches a specific number,

This could be replaced with the alternative fragment: If this sequence becomes infinitely close to a specific value,

which switches out the words ’approach’ and ’number’ for the words ’infinitely’, ’close’ and ’value’, all three of which are found in the top 10 highest ranking terms most associated with

(30)

the set of correct answers.

4.2 A

NALYSIS OF THE

C

OMPUTER

S

CIENCE

D

ATASET

One issue not addressed in the calculus dataset is whether a topic-term matrix built using LSA will be of sufficient quality when the number of responses is small. We limit our analysis of the computer science dataset to a comparison of the efficacy of LSI against LSA-CSS across the topic values 3-22.

In preparation for analysis, a score cutoff of C = 3 was chosen, making the number of true correct answers 15 and thus 52% of the total number of responses. In addition, an idf weighting was used and the sparsity parameter was set to α = 0.5.

As seen in Figure 4.3, LSI and LSA-CSS in fact perform similarly across topics, with both methods attaining precision/accuracy values of 0.933/0.931. Nevertheless LSA-CSS per-forms far more robustly across topic values, achieving levels of precision superior or equal to that of LSI for 90% of topic values considered. Thus since the user will have no access to a val-idation set with which to manually fit the topic value hyper-parameter, this ability to maintain high levels of precision across a wide range of topic values should be considered a valuable feature when choosing between unsupervised topic-based methods in information retrieval.

(31)

0.80 0.85 0.90 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Number of Topics Pre ci si on Model LSI LSA-CSS

Figure 4.3. Precision graph for the LSI and LSA-CSS methods, applied to the computer science dataset.

(32)

CHAPTER 5 DISCUSSION AND CONCLUSIONS

In this paper we conducted the first analysis of a new short answer question dataset using unsupervised corpus-based approaches. In addition, we successfully used the corpus of student responses themselves, overcoming the small corpus size and short response length by introducing the concept of a go-list. The go-list becomes effective in question-answer IR where antecedent knowledge of the question topic allows one to restrict the vocabulary to the topic’s terminology. Previous studies have indicated that corpus relevance is more important than size in question-answer IR (Mohler & Mihalcea, 2009), and given the close connection between a question and its responses, our experiments further confirm this position.

We also introduced several important refinements to the word-to-word based methods: mean subtraction and model averaging for the LDA method, along with the introduction of an extra normalization parameter α, useful for word-to-word methods when documents are short and extremely sparse. These refinements we feel are well justified in the paper.

Based on our experiments with word-to-word methods, there is little evidence that re-quiring that terms belong to at most one pair, such as in the greedy and optimal matching methods, improves precision. Therefore since optimal matching depends on the solution to a combinatorial optimization problem, it seems sensible to prefer the simpler corpus-based se-mantic similarity. Also of note is that despite LDA’s reputation for producing a more coherent set of topics, the LSA method slightly outperforms LDA, indicating that there might be room for improvement in deciding how best to exploit LDA in the formation of word-to-word simi-larity metrics. Moreover as expected, word-to-word methods which build topics outperform on all counts those methods which look only for word co-occurrence or lexical overlap. Finally, our experiments indicate that LSA-CSS performs well even on small datasets, comparable to that of LSI yet performing better across a wider range of topic values.

(33)

CHAPTER 6 COMPUTATIONAL LINGUISTICS

GLOSSARY

Corpus - a large or complete collection of writings.

Part-of-speech Tagging - The process of marking up a word in a text (corpus) as cor-responding to a particular part of speech.

Polysemy - The coexistence of many possible meanings for a word or phrase.

Semantics - The meaning of a word, phrase, sentence, or text.

Stemming - A process for reducing inflected (or sometimes derived) words to their word stem, base or root form.

Stop-list - List of words to be filtered out before the processing of natural language data (text).

Supervised Learning - Finding hidden structures and patterns in labeled data.

Syntax - The arrangement of words and phrases to create well-formed sentences in a language.

(34)

BIBLIOGRAPHY

[1] Y. ATTALI ANDJ. BURSTEIN, Automated essay scoring with e-rater v.2.0, 2005.

[2] D. BLEI, A. NG, AND M. JORDAN, Latent dirichlet allocation, Journal of Machine

Learning Research, 3 (2003), pp. 993–1022.

[3] S. DEERWESTER, S. DUMAIS, G. FURNAS, T. LANDAUER, ANDR. HARSHMAN, Im-proving information retrieval with latent semantic indexing, in Proceedings of the 51st Annual Meeting of the American Society for Information Science, 1988, pp. 36–40. [4] E. GABRILOVICH AND S. MARKOVITCH, Computing semantic relatedness using

wikipedia-based explicit semantic analysis, in Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007, pp. 1606–1611.

[5] D. HIGGINS, J. BURSTEIN, D. MARCU, AND C. GENTILE, Evaluating multiple as-pects of coherence in student essays, in Proceedings of the annual meeting of the North American Chapter of the Association for Computational Linguistics, 2004.

[6] D. KANEJIYA, A. KUMAR,ANDS. PRASAD, Automatic evaluation of students’ answers using syntactically enhanced lsa, in Proceedings of the HLT-NAACL 03 Workshop on Building Educational Applications Using Natural Language Processing, vol. 2, 2003, pp. 53–60.

[7] H. KUHN, The Hungarian Method for the Assignment Problem, Springer, 2009, pp. 29– 47.

[8] A. KUMAR AND S. SRINIVAS, On the performance of latent semantic indexing-based information retrieval, Journal of Computing and Information Technology, 17 (2009), pp. 259–264.

[9] T. LANDAUER AND S. DUMAIS, A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge, Psychological Review, 104 (1997), pp. 211–240.

[10] C. LEACOCK AND M. CHODOROW, Combining Local Context and WordNet Similarity

for Word Sense Identification, The MIT Press, 1998.

[11] C. LEACOCK AND M. CHODOROW, C-rater: Automated scoring of short-answer

ques-tions, in Computers and Humanities, vol. 37, 2003, pp. 389–405.

[12] M. LINTEAN, C. MOLDOVAN, V. RUS,ANDM. D., The role of local and global weight-ing in assessweight-ing the semantic similarity of texts usweight-ing latent semantic analysis, in Proceed-ings of the 23rd International Florida Artificial Intelligence Research Society Conference, 2010.

(35)

[13] H. LUHN, A statistical approach to mechanized encoding and searching of literary

infor-mation, IBM Journal of Research and Development, 1 (1957), pp. 309–317.

[14] C. MANNING, P. RAGHAVAN, AND H. SCHUTZE¨ , An Introduction to 8 Information

Retrieval, Cambridge University Press, 2009, pp. 412–417.

[15] C. MANNING, P. RAGHAVAN, AND H. SCHUTZE¨ , An Introduction to 8 Information Retrieval, Cambridge University Press, 2009, p. 181.

[16] R. MIHALCEA, C. CORLEY, AND C. STRAPPARAVA, Corpus-based and

knowledge-based measures of text semantic similarity, in American Association for Artificial Intelli-gence, vol. 1, 2006, pp. 775–780.

[17] M. MOHLER AND R. MIHALCEA, Text-to-text semantic similarity for automatic short

answer grading, in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 2009, pp. 567–575.

[18] N. NIRAULA, R. BANJADE, D. STEFANESCU, ANDV. RUS, Experiments with semantic

similarity measures based on lda and lsa, in Statistical Language and Speech Processing, vol. 7978, 2013, pp. 188–199.

[19] D. POWERS, J. BURSTEIN, M. CHODOROW, M. FOWLES, ANDK. KUKICH,

Compar-ing the validity of automated and human essay scorCompar-ing, Technical Report 98-08aR, GRE Board Research Report, 2000.

[20] S. PULMAN AND J. SUKKARIEH, Automatic short answer marking, in Proceedings of

the Second Workshop on Building Educational Applications Using NLP, 2005, pp. 9–16. [21] V. RUS AND M. LINTEAN, A comparison of greedy and optimal assessment of natu-ral language student input using word-to-word similarity metrics, in Proceedings of the Seventh Workshop on Innovative Use of Natural Language Processing for Building Edu-cational Applications, 2012, pp. 157–162.

[22] V. RUS, N. NIRAULA,ANDB. RAJENDRA, Similarity measures based on latent dirichlet allocation, in Computational Linguistics and Intelligent Text Processing, vol. 7816, 2013, pp. 459–470.

[23] G. SALTON, A. WONG, AND Y. C.S., A vector space model for automatic indexing, Communications of the ACM, 18 (1975), pp. 613–620.

[24] J. SUKKARIEH, S. PULMAN,ANDN. RAIKES, Auto-marking 2: An update on the ucles-oxford university research into using computational linguistics to score short, free text responses, in International Association of Education Assessment, 2004.

[25] P. TURNEY, Mining the web for synonyms: Pmi-ir versus lsa on toefl, in Proceedings of the 12th European Conference on Machine Learning, 491-502, ed., 2001.

(36)

[26] P. WIEMER-HASTINGS, E. ARNOTT, AND D. ALLBRITTON, Initial results and mixed

directions for research methods tutor, in Supplementary Proceedings of the 12th Interna-tional Conference on Artificial Intelligence in Education, 2005.

[27] P. WIEMER-HASTINGS, K. WIEMER-HASTINGS, AND A. GRAESSER, Improving an

intelligent tutor’s comprehension of students with latent semantic analysis, in Artificial Intelligence in Education, 1999, pp. 535–542.

[28] S. WONG, W. ZIARKO, ANDP. WONG, Generalized vector spaces model in information

retrieval, in Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1985, pp. 18–25.

[29] Z. WU AND M. PALMER, Verbs semantics and lexical selection, in Proceedings of the

(37)

APPENDIX

(38)

LSA/LDA-CSS AGLORITHM

Arguments

answers: The document-term matrix of student responses. solution: The solution vector provided by the instructor.

numOfTopics: The number of singular vectors for LSA and the number of topics for LDA. scores: The answer scores

numCorr: The number of ’correct’ answers taken from the top of the ranking. cutoff: The score cutoff such that responses scored above are considered correct. docTermMatrix: A document term matrix (in this case the solution plus answers).

1. program LSAcomputeSimMatrix

2. docTermMatrix ← IDF(docTermMatrix) 3. U, Σ, V ← SVD(docTermMatrix)

4. l ← numOfTopics 5. Tl ← Σ−1_k UT

6. for i,j in numOfTerms:

7. simMatrix[i,j] ← T_l(i)· T_l(j) 8. return simMatrix

1. program LDAcomputeSimMatrix

2. docTermMatrix ← IDF(docTermMatrix) 3. β ← LDA(docTermMatrix, numOfTopics) 4. for i,j in numOfTerms:

(39)

6. return simMatrix

1. program LSA/LDA-CSS

2. simMatrix ← computeSimMatrix(rowstack(solution, answers), numOfTopics) 3. for n in numOfAnswers:

4. similarity[n] ← CSS(solution, answers[n], simMatrix, α) 5. rankedAnswers ← decreasingSort(answers, scores, similarity) 6. precision ← computePrecision(rankedAnswers, numCorr, cutoff) 7. recall ← computeRecall(rankedAnswers, numCorr, cutoff)

8. accuracy ← computeAccuracy(rankedAnswers, numCorr, cutoff) 9. return precision, recall, accuracy

Corpus-Based Methods for the Unsupervised Grading of Short Answer Questions

ABSTRACT OF THE THESIS

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

CHAPTER 1

INTRODUCTION

CHAPTER 2

METHODS

2.1 V

S

M

2.2 G

V

S

M

2.3 L

S

I

2.4 W

-

-W

M

2.4.1 Pointwise Mutual Information

2.4.2 Latent Semantic Analysis

2.4.3 Latent Dirichlet Allocation

2.5 P

-

F

CHAPTER 3

DATASETS

3.1 C

D

3.2 C

S

D

CHAPTER 4

EXPERIMENTAL SETUP AND RESULTS

4.1 A

C

D

4.1.1 Bootstrapping

4.1.2 Word Significance

4.2 A

C

S

D

CHAPTER 5

DISCUSSION AND CONCLUSIONS

CHAPTER 6

COMPUTATIONAL LINGUISTICS

GLOSSARY

BIBLIOGRAPHY

APPENDIX

LSA/LDA-CSS AGLORITHM