In this section, we describe the models used in this chapter which leverage multiple datasets from various sources.
6.2.1
Two-Stage Model
We can use the regression-based weighted frustratingly easy domain adaptation (W- FEDA) model from Chapter 5 to utilise multiple datasets. However, there is a potential drawback of this model, which is the incompatibility of different exam marking schemes. For example, the meaning of a score of 16 in the PET exam is different from a score of 16 in the FCE exam. When we are training a model to predict the scores of the texts for PET, the model might be confused by the same numerical scores from different marking schemes. Although W-FEDA might overcome this problem by having a separate domain-specific weight parameter space for each dataset, the shared parameter space shared by different datasets might still be a compromise between the absolute scores from different marking schemes.
Hence, we suggest that when training an transfer-learning based automated grader on multiple datasets from different sources to mark the target dataset, it is better to transfer only the ranking of the language quality among the texts from different source datasets, rather than the absolute scores. The benefit of transferring ranking is that when we put different datasets together, we do not need to worry about how to handle the existence of different marking schemes.
We make a strong assumption here that given any pair of textstext1andtext2, if the
score oftext1is higher thantext2 on one grading scale, the relative order between these
two texts is still preserved even if we use a new marking scheme to mark these two texts. Based on this assumption, transferring the relative order might bring less noise and more useful knowledge to be passed into the shared parameter space from different 1The idea of the two-stage model in Section 6.2 was published as a paper in the proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Cummins et al., 2016b). In this paper, I contributed the idea of the two-stage model. Cummins implemented the model and evaluated it on the ASAP dataset. Briscoe gave feedback and suggestion for the paper.
datasets as the order is always correct in all scales. Cummins et al. (2016b)’s model also utilises multiple datasets by transferring relative order instead of absolute scores. In the following section, we will describe their work in detail.
6.2.1.1 Cummins et al.’s Model
Figure 6.1: First stage of Cummins et al.’s model.
In Cummins et al. (2016b)’s original work in Figure 6.1, we assume that we have seven texts with their feature vectors fromx1tox7. These texts come from three different
tasks T1, T2, and T3 with three different grading scales (the orange, red and green
numbers in Figure 6.1). Each text is marked with the grading scale of the task which the text belongs to2, and the feature vector for each text isxi. We use unweighted FEDA
(Section 5.3.1) to augment each feature vector toΦ(xi).
The automated scoring procedure of their model is split into two stages. In the first stage, we use a perceptron based pairwise-ranking model (Joachims, 2002), which has been used by Briscoe et al. (2010); Yannakoudakis (2013) (Section 2.4) in ATS to do ranking. The model constraints during training are similar to the ranking SVM (Section 2.3.3) in the following equation:
wT(Φ(xi)−Φ(xj))>M f or(xi,xj)∈r (6.1) 2The definition of a task given by Cummins et al. (2016b) is different from the task we define in Chapter 5. The task here is similar to a prompt in this thesis. Different tasks could be marked on the same grading scale or different scales.
In the above equation, (xi,xj)∈rrepresents all the data instance pairs in the training
set where xi has an higher rank compared to xj. During training, the model learns a binary classifier (top right in Figure 6.1) with a weight vector w to minimise the misclassification rate ofdifference vectorsΦ(xi)−Φ(xj) to ensure most data instance pairs
(xi,xj) are bigger thanM, which is the hyper-parameter to control the model margin. The learned optimal parameters forwafter training isw∗based on the perceptron algorithm.
Similar to the ranking SVM, during prediction, the perceptron model predicts a
ranking scoreyˆrank
i for an incoming data instancexiby ˆy rank
i byw
∗·Φ
(xi), and we can get the order of an dataset{xi}N
1 via calculating the ranking score for each instance and sort
the dataset based on the predicted ranking scores.
After we finish training our ranking model, we still do not know the score of each text on its original grading scale but only an order of all the texts. Therefore, we need another stage to get the predicted score for each text on its corresponding grading scale from the ranking model. In this second stage, for feature vectorΦ(xi) with predicted
ranking score ˆyrank
i , a linear regression modelR
1→
R1 is built for all the texts from the
same task to learn the relation between ˆyrank
i and yi. We use this linear regression model
to predict the score ofxi and round it as ˆyi. Therefore, we have one linear regression model for each task, and three regression models in total in the example of Figure 6.1.
In summary, we can interpret Cummins et al.’s model as a model combining multi- task learning and transfer learning together. The first stage is a multi-task learning model optimised on all datasets equally, and the second stage distils and transfers the knowledge from the first stage to each task, respectively.
In this chapter, we define a model that learns a ranking model first and then builds a regression model based on the outputs of the ranking model as atwo-stage model. In contrast, the W-FEDA model described in Section 6.3.1 predict the score of each text in one stage, and we define this type of model as aone-stage model.3
We slightly modify Cummins et al.’s model in that the first-stage model now is the ranking SVM model in LIBLINEAR (Lee and Lin, 2014) instead of the perceptron model, because the ranking SVM model and their perceptron model have similar performance on ATS (Yannakoudakis, 2013), and the LIBLINEAR ranking SVM implementation is highly optimised for convergence speed, allowing this model to be trained without the need to sample data in order to reduce the training set size. We use the L2-regularisation version of ranking SVM in this chapter.
6.2.1.2 Two-Stage Feature-Rich Model
There is one possible drawback in Cummins et al.’s model. In the first stage, we condense all the features of each text into a ranking score, and these features are no
longer visible in the second stage. In other words, the ranking model is optimised on the order (binary classification error rate) rather than the absolute scores, and the regression model might not have enough information to map the order to the original scores by only seeing a single ranking score. Also, Cummins et al. built one single model for every dataset, and the amount of knowledge that should be transferred from the source datasets to the target dataset to give the best model performance and mitigate the potential negative-transfer influence (Rosenstein et al., 2005) might also vary from one dataset to another.
We propose a variation of Cummins et al.’s model addressing these problems. In the second stage, we hypothesize that having a transfer-learning based model with richer features as the inputs to the model should lead to better performance than Cummins et al.’s original approach. We add the ranking scores predicted by the ranking model as an extra feature ˆyrank
i to the featuresxi we identified in Chapter 3 and concatenate
them together. In other words, we use the ranking score as an additional feature and feed it into the baseline SVR model in Chapter 3. Here, we include a weighting hyper-parameterζ ∈ [0,1] to control the influence of the ranking score in the second
stage. The feature representation of the second stage to train the regression model can be written as:
xFRi =ζxi⊕(1−ζ) ˆyranki (6.2)
All the predicted ranking scores ˆyrankare normalised to [0,1] to ensure they are on the
same scale with the features to be concatenated, and we use the SVR model in Chapter 3 as the second-stage model to learn the relation between the new feature vectorsxFR
i and
the scoreyi.