Most previous work into ATS relies on machine learning to model the relationship between texts and scores (Page, 1968; Larkey, 1998; Yigal and Burstein, 2006; Yan- nakoudakis, 2013; Zesch et al., 2015; Dong and Zhang, 2016; Taghipour and Ng, 2016; Tay et al., 2018). In this section, we introduce several machine learning models used in ATS. More details about different machine learning techniques can be found in the textbooks and surveys of Mitchell (1997); Bishop (2006); Murphy (2012); Goodfellow et al. (2016). The aim of machine learning is to implement a system capable of learning from an experience related to a task to improve the system performance in this task based on a performance measure (Mitchell, 1997).
There are several tasks in machine learning, one of which issupervised learning. In supervised learning, the system learns a function f that predicts output yi ∈Ygiven
inputxi ∈Xwithout explicit human instructions. In automated assessment, the output
could be a scalaryi, which is the score of the text marked by a human examiner, or a vectoryi, which represents different aspects of the quality of a text. In this thesis, we
assume each output is a scalaryi. The inputxiis a text written by a learner. This learning process requires an annotated dataset, a machine learning model and an appropriate optimisation method to learn the relationship between the input and output space. In automated assessment, the annotated dataset consists of a collection of texts with scores marked by human examiners; we define this dataset as thetraining set{x}trainin
supervised learning. The procedure for the system to learn this relationship is called
training. The machine learning model can only read data in a specific format, and the
process that maps each text to a model-readable format is defined asfeature extraction. The format of the model-readable data ofxicould hence be a vectorxwith sizeD:
xi =<xi,1,xi,2, . . . ,xi,D >
It could also be a sequence of vectors or other possible forms depending on how the model is designed. During training, we want the predictions ˆyi made by the model on the inputxito be as close as possible to the gold score yi. We therefore need some
performance measuresPto quantitatively describe this closeness, and the model should be optimised on a pre-defined target function which reflects the performance measure we are interested in. In other words,Pshould ideally get better when we are optimising the pre-defined target function.
After we train a model, we need to ensure the model performs well on future unseen data. To evaluate future performance, we reserve an annotated dataset and evaluate the trained model on this held-out dataset. This procedure is calledtesting, and the unseen annotated dataset{x}test used in testing is thetest set. The format of the training and
test sets should be the same, and they should both have been marked and annotated by human examiners.
In order to optimise on the performance measure P, we need to define some configurational parameters before training. Compared to the parameters of the model directly learned during the training process, these configurational parameters cannot be directly learned from the training process. These configurational parameters are called
the hyper-parametersof the model. The hyper-parameters of a model could be the
parameters controlling the complexity of the parameters learned by the model including the regularisation term, the learning rate of the optimisation method, or the time when we stop training the model. During training, when deciding appropriate values for hyper-parameters, we can tune the model on a development set, which is another annotated dataset following the same annotation procedure as the training and test sets. We use the optimal hyper-parameters on the development set in the trained model we evaluate on the test set. This hyper-parameters tuning process is calledvalidation.
In contrast to supervised learning, unsupervised learning does not involve any annotated dataset but identifies the pattern and structure of some unannotated data that might be useful for some downstream tasks. At the intersection of supervised learning and unsupervised learning issemi-supervised learning. Semi-supervised utilises both labelled and unlabelled data instances to learn the relation between the input spaceX
and the output spaceY. Only limited work (Chen et al., 2010) in automated assessment
studied unsupervised learning.5 Most previous work in this field relies on supervised
learning (Page, 1968; Larkey, 1998; Briscoe et al., 2010; Yannakoudakis, 2013). Our work in the following chapters is also based on supervised learning because the patterns learned from unsupervised learning might not capture the patterns to predict the scores we are interested in. For this reason, we focus on supervised learning in this thesis.
5Although Chen et al. (2010) did not explicitly use essay scores in training their model; their work still needs some supervised signals, such as knowing the historical score distribution so that they knew the number of unique terms in each essay has a strong correlation with the essay scores in their dataset.