As we noted in the introduction to this chapter, we have deliberately only used features which we hope we can relate to quality in a direct manner. Having a large class of features where individual ones do not have a clear relationship to a writing facet will limit our ability to claim if any definable facet is indicative of text quality. Rather the analysis will only denote individual myopic features as significantly predictive. For example, suppose that we find personal pronouns to occur significantly more often in very good
versus the typical category of articles. This result does not necessarily indicate that a narrative style is indicative of good quality or that references to people are more common in good samples. However, for tasks such as text quality prediction, such interpretable results are preferrable. In this section, we describe an annotation study where we directly studied if our features are capturing the intended aspect with good accuracy. During this annotation, our aim is to only understand the representative nature of the features separate from whether the feature is indicative of text quality.
For this analysis, we selected eight features, one from each of our six facets, with the exception of ‘beautiful language’ and ‘affective content’. For beautiful language, we select two features: avr char perp allwhich indicates the average perplexity of words under the ngram character model andsurp wdwhich is a word-pair related feature which mea- sures the number of unusual phrases (normalized by number of words). For affective con- tent, we select the features measuring the total proportion of polarity words (polar prop) and the proportion of total words which have negative sentiment (neg prop).
To obtain text examples, we selected a random sample of articles from our corpus (without regard to quality categories). However, we biased the sample to be representa- tive of different topics in our corpus. We utilize the set of “science” tags from Chapter3
(Section3.1.2) for this purpose. These tags are taken from the NYT corpus metadata and indicate a minimal set of science related topics in the NYT. There were14tags in that set. We exclude the ‘Research’ tag since it is does not indicate a specific topic. For each of the remaining tags, we randomly sample 25 articles from the corpus which contain that tag. In this way, we obtain a representative small sample of our corpus with a total of325
articles.
Since it would be difficult to judge the presence of a facet in a full article or further to indicate its extent in the article, we create smaller snippets from the articles, each of size
200words. We create snippets starting from each paragraph boundary in the article and do not truncate the snippet in the middle of a sentence. The resulting snippets are quite coherent and a total of6192snippets were obtained.
For each feature, we compute its value for all the snippets. Then, we select the 50
snippets with highest feature value, the50 with lowest value for the feature and50sam- ples randomly chosen without regard to feature value.29 We provided these snippets in random order and asked annotators to indicate the degree to which the facet represented by the feature is present in the snippet. For example, for the affective content feature, we asked an annotator to rate the passage for the degree to which sentiment and emotion is present in the snippet. The annotators used a scale from1 to 10where 10 indicates that the facet is present to a very high degree and1indicates that the facet is almost absent.
29We select only one snippet per article to avoid having the annotation data biased towards a few articles
Note that our annotation procedure is based on texts ranked high and low according to certain feature values. An alternative method is to first directly obtain ratings for each facet on a collection of snippets and then compute the extent to which our features reflect these ratings. In the latter approach, it is unclear how large a collection we should annotate in order to obtain samples which have high and low degree of presence for all the aspects that we consider. So we choose our two step approach of first obtaining feature values on the texts and then estimating the accuracy of the induced rankings.
Our annotators were undergraduate students from University of Pennsylvania’s en- gineering and pschology departments and are all native speakers of English. During a training phase, each student was assigned two aspects which they studied in detail. A description of the facet was provided together with example snippets that were manu- ally chosen to reflect high, low and medium presence of the facet. Each facet was also assigned to two different annotators. They annotated a sample of 10 snippets individu- ally and the two annotators who rated the same facet discussed their ratings with each other.30 Even during the training sessions, we found that the annotators had reasonable agreement in their ratings and were able to discuss to resolve differences.
After training, each annotator annotated the 150 snippets belonging to top, bottom and random values (each50) of a feature. Another annotator annotated a random sample of 30 snippets (from the 150) in order to measure agreement. If a feature captures a particular aspect then the snippets ranked at the top should receive higher ratings from annotators compared to those ranked by the feature as low. We include the set of random snippets to check the prevalence of an aspect. If any snippet chosen at random has a high value for the aspect from the annotators, it would indicate that the aspect is highly prevalent in the texts in our corpus. So a feature based on this aspect is unlikely to be useful for differentiating the articles.
The results are shown in Table 6.4for the eight selected features. The second column indicates annotator agreement which we measure as the Pearson correlation between the ratings of the two annotators on the common 30 snippets. A ’⇤’ indicates that the correlation was significant with p-value less than 0.05. The next three columns indicate
Feature Agree- Mean ratings from annotator Significance ment Top (T) Bottom (B) Random (R)
total visual 0.57* 4.72 1.88 2.84 T>B, T>R, B<R animate prop 0.94* 6.72 1.30 4.04 T>B, T>R, B<R
narrative 0.78* 7.34 3.72 4.52 T>B, T>R
avr char perp all 0.09 4.50 4.62 4.30
surp wd 0.47* 4.80 4.08 4.12 T>B, T>R
polar prop 0.71* 4.68 1.96 2.86 T>B, T>R, B<R
neg prop 0.69* 4.96 1.28 2.48 T>B, T>R, B<R
res total prop 0.71* 3.84 1.30 2.46 T>B, T>R, B<R
Table 6.4: Agreement (Pearson correlation) of annotators and mean values of ratings for the different splits in feature value. The last column indicates whether the ratings for the splits are significantly different. Significant correlations in the second column are marked with a ‘*’
the mean value of the annotator rating for the top, bottom and random snippets. The last column indicates whether the mean value for top ranked snippets is significantly higher thanbottomranked snippets (T>B) and if thetopandbottomsnippets have ratings
significantly different fromrandomlychosen snippets. High or low trends are indicated by
>and<symbols. The values in two classes of snippets were compared using a two-sided
t-test and a p-value of less than0.05was taken to indicate significance.
We find that for most of our features, the two annotators had high agreement in their judgements of whether the text ranked high or low with regard to the corresponding facet. Most of these correlations between the annotators’ ratings are0.5and above. The highest agreement is for animacy feature reaching 0.9 correlation. For the avr char perp all
feature, there is no correlation at all between the annotators. The proportion of visual words and the surph wdfeatures have around0.5correlation. Narrative sub-genre, po- larity and research content features have0.7correlation.
For the differences between top, bottom and random snippets, most of the features showed the desirable trends. The annotators rated the top ranked snippets according to feature value as having high presence of the aspect and the bottom snippets as having
much lower presence of the aspect. Similarly, both top and bottom snippets are rated significantly different from random snippets indicating that these features create useful distinction between texts according to the facet they represent. The only feature where no significant results were obtained is the one for unusual words. Note also that annotators did not have any agreement for ratings for this feature. This result indicates that either this feature does not capture the ‘unusual words’ aspect or that people do not perceive unusual words as related to beautiful writing. Notably, all the features with the exception of ‘beautiful writing’ are designed to reflect a facet of writing (such as sentiment) without reference to whether the text is considered as interesting or of high/low quality. However in the case of ‘beautiful language’ features, we are directly asking annotators to judge the attractive nature of the writing and this could increase the variability in ratings accord- ing to a person’s preferences and opinions. Future work should focus on how different aspects can be annotated separate from questions of quality judgement.