In this section, we describe the performance of classifiers built on the data that we col- lected from both discourse relations and Mechanical Turk annotations. We also validate the proxy data from discourse relations by comparing how the accuracy of a classifier trained on the discourse relations data performs when applied to the direct annotations that we collected from people.
5.4.1 Three types of classifiers
On the discourse relations data, we built two classifiers for distinguishing general and specific sentences: one trained on sentences from Instantiation relations, and one on sen- tences from Specifications. We built a separate classifier on the examples collected from direct annotation. Each classifier was trained and tested on sentences from the same source. For example, we train a classifier on the general and specific sentences collected from the Instantiations data and test its accuracy on a held-out set of sentences, also from Instantiation relations.
We train a logistic regression classifier20 with each set of features described above and evaluate the predictions using 10-fold cross validation. For the Instantiations and Specifications data, the general-specific categories have equal number of sentences and the baseline random accuracy is50%. For the Mechanical Turk annotations, the majority class baseline (specific) is56%. Table5.8shows the accuracy of our features.
The classifier on the turker data, despite having fewer training examples is overall the best performing. With the non-lexical features all put together, the accuracy is79%. Using only lexical features gives worse results,71% on this data. The system trained on Instan- tiations examples is also promising with 75% accuracy for both lexical and non-lexical features. Lexical features are less sparse on larger data and this could be contributing to better performance of these features on the Instantiations data.
The Specifications data however obtains much lower performance, the best accuracy is only 12% above the baseline. It is possible that in Specification relations, the speci- ficity of the second sentence is only relative to that of the first. On the other hand, for
Features Instantiations Specifications Turk annotations NE+CD 68.6 56.1 73.0 language models 65.8 55.7 71.1 word specificity 63.6 57.2 70.2 syntax 63.3 57.3 69.4 polarity 63.0 53.4 67.9 sentence length 54.0 57.2 56.6 all non-lexical 75.0 62.0 79.4 lexical (words) 74.8 59.1 71.5 all features 75.9 59.5 78.2
Table5.8: Accuracy of different features for classifying general versus specific sentences Instantiations, there are individual characteristics related to the generality or specificity of sentences. We further verify the suitability of the discourse relations data for this task in the next section.
Among the non-lexical features, the NE+CD class is the strongest with an accuracy of 68% for Instantiations and 73% on manual annotations. Language models, syntax, polarity and specificity features also outperform the baseline by about15% accuracy. The sentence length features are the least indicative. The non-lexical feature classes though not that strong individually, combine to give the same performance as the word features. The combination of lexical and non-lexical categories does not outperform the accuracies obtained by each individual category.
5.4.2 Validating the use of discourse relations to create examples
We have assumed from the definitions of Instantiation and Specification relations, that their first sentences (Sent1) are general and their second (Sent2) specific. Further, we used these two sentences independently in two different classes. Now we test this intuition directly. We seek to answer two questions:
Instantiations data General Specific Sent1 29 3 L5(14), L4(9), L3(6) L5(1), L4(1), L3(1) Sent2 6 26 L5(1), L4(3), L3(2) L5(13), L4(9), L3(4) Specifications data General Specific Sent1 10 6 L5(4), L4(3), L3(3) L5(1), L4(1), L3(4) Sent2 8 8 L5(5), L4(3), L3(0) L5(5), L4(2), L3(1)
Table5.9: Annotator judgements of general/specific nature for Instantiation and Specifi- cation sentences
ment of generality as we have assumed?
• How well does a classifier trained on the discourse relations data perform on the direct annotations obtained through Mechanical Turk?
To answer the first question, we included sentences from Instantiation and Specifi- cation relations in the dataset for turk annotations. There were 32 Instantiations and 16
Specification relations in the three WSJ articles we annotated and each of these relations is associated with two sentences,Sent1andSent2.
In Table 5.9, we provide the annotator judgements and agreement levels on these sentences. The number of sentencesxin each category with a certain level of agreement
yis indicated as Ly(x). So L5(3) means that three sentences had full agreement5.
For Instantiations, we find that the majority of Sent1 are judged as general and the majority ofSent2are specific,80% in each case. But for bothSent1andSent2, there is one sentence which all the annotators agreed should be in the opposite class than assumed.
Accuracy
Data No. examples All features Non-lexical Lexical
WSJ 293 73.7 76.7 71.6
AQUAINT 287 59.2 81.1 67.5
NYT-science 305 67.2 74.4 58.3
Table 5.10: Accuracies of the Instantiations-trained classifier on the Mechanical Turk an- notations
So there are some cases where without context, the judgement can be rather different. But such examples are infrequent in the Instantiation sentences. Hence this dataset closely approximates the general-specific distinction which we wished to learn.
On the other hand, Specifications show a weaker pattern. For Sent1, still a majority (62.5%) of the sentences are called as general. However, for Sent2, the examples are equally split between general and specific categories. Hence it is not surprising that the Instantiation sentences have more detectable properties associated with the first general sentence and the second specific sentence and the classifier trained with these examples obtains better performance compared with training on Specifications.
Therefore the Instantiations examples appear to be reliable data for our task while Specifications relations do not appear useful for the binary distinction we make in this work. So we further test the validity of the Instantations data by training a classifier on the Instantiations examples and testing it on the annotations obtained directly through Mechanical Turk. High performance on this task would indicate that the Instantiations data while still a proxy provides a similar distinction as that given by people’s ratings.
Table 5.10 shows the results for this task. A classifier was trained on the Instantia- tions data and tested on each of the three sets of annotations from WSJ, AQUAINT and NYT. Since there were sentences in the WSJ test data which overlapped with our Instan- tiations training set, we removed the overlapping sentences and retrained our classifier while testing on the WSJ data. We find that the Instantiations based classifier has the same accuracy on the directly annotated data compared to when tested on a held-out sample of Instantiations sentences. The highest accuracies are obtained using the non-
Feature set Accuracy
Nonlexical 72.7
Words 72.1
All features 74.2
Table5.11: Accuracy on combined set of Instantiations and manually annotated data lexical features similar to our findings in the previous section. For this feature set, the accuracies are around 75% on the WSJ and NYT data. For the AQUAINT annotations, the accuracy is even higher81%. While using the word or all features the accuracies are not as high probably due to varying lexical items present on the WSJ corpus compared to other corpora. Accordingly, the word-based classifiers accuracy is71% on the WSJ data but only58% on the NYT. The non-lexical features on the other hand, have similar high accuracies on different corpora.
These experiments validate that the Instantiations examples provide a suitable and useful dataset for the general/specific distinction.
5.4.3 Combined classifier
Since both Instantiations and the direct annotations gave good accuracies, we also com- bined them to obtain a larger set of examples. Here the total general sentences is 1,768
and there are 1,858 specific sentences. So the distribution is almost equal (49% general and 51% specific) and the baseline random performance would be 50% accuracy. The
10-fold cross validation accuracies from non-lexical, word and ‘all features’ on this full set are shown in Table 5.11. The best accuracy was obtained by combining all features,
74%. Individually, the lexical and non-lexical categories each give 72%.
Since our classification approach has sufficient training data and good accuracy of
75% we used it to analyze writing quality for two genres: summarization and science journalism. Before discussing these we provide further analysis on the manual annota- tions which helped us obtain a score for specificity rather than binary prediction.