3 4 Comparison with ratings of a student annotator

The seed articles for our corpus were obtained using the judgements of leading journalists. For expanding this set, we used a simple heuristic based on the authors of these seed articles. Therefore we can consider the resulting categories as approximating the judgements of the expert journalists. In this section, we provide the results of a small an-

very good+greatwriting typicalwriting

Author No. (%) of articles Author No. (%) of articles

Altman, Lawrence K 417(9.8) Fountain, Henry 466(2.4)

Kolata, Gina 407(9.6) Pollack, Andrew 380(1.9)

Wade, Nicholas 371(8.7) Markoff, John 306(1.6)

Grady, Denise 354(8.3) Lohr, Steve 280(1.4)

Chang, Kenneth 298(7.0) Revkin, Andrew C 213(1.1)

Brody, Jane E 273(6.4) Schwartz, John 209(1.1)

Wilford, John Noble 254(6.0) Pear, Robert 183(0.9) Stolberg, Sheryl Gay 253(5.9) Leary, Warren E 167(0.8) Mcneil, Donald G Jr 170(4.0) Glanz, James 165(0.8)

Overbye, Dennis 166(3.9) Goode, Erica 160(0.8)

Broad, William J 157(3.7) Goodstein, Laurie 146(0.7) Harris, Gardiner 140(3.3) Blakeslee, Sandra 133(0.7) Carey, Benedict 132(3.1) Feder, Barnaby J 132(0.7)

Harmon, Amy 122(2.9) Hafner, Katie 132(0.7)

Gorman, James 100(2.4) Eisenberg, Anne 130(0.7)

3614(85.0) 3202(16.3)

Author pair No. (%) of examples Author ofvery goodarticle Author oftypicalarticle

Wade, Nicholas Pollack, Andrew 942(2.2)

Overbye, Dennis Glanz, James 622(1.5)

Wilford, John Noble Leary, Warren E 516(1.2)

Carey, Benedict Goode, Erica 375(0.9)

Chang, Kenneth Leary, Warren E 372(0.9)

Chang, Kenneth Glanz, James 346(0.8)

Kolata, Gina Pollack, Andrew 324(0.8)

Wilford, John Noble Glanz, James 323(0.8)

Altman, Lawrence K Pollack, Andrew 320(0.8)

Grady, Denise Pollack, Andrew 266(0.6)

Altman, Lawrence K Bradsher, Keith 220(0.5)

Kolata, Gina Duenwald, Mary 218(0.5)

Chang, Kenneth Fountain, Henry 211(0.5)

Overbye, Dennis Leary, Warren E 207(0.5)

Grady, Denise Duenwald, Mary 195(0.5)

Total 5457(13%)

Table 3.10: The15 most frequent author pairs of very goodand typical articles in the topic normalized corpus

notation study where we asked an undergraduate student to provide personalized ratings for a few articles from our corpus. We wanted to study the following questions:

1. How much does an individual’s ratings agree with the experts?

2. Is there noticeable difference between thegreatandvery goodcategories?

3. How accurate is the similarity measure used for creating the article mappings in the topic normalized corpus?

As we discussed in Chapter2, in this thesis, we use the ratings of experts as our gold standard because that definition helps us focus on the linguistic properties of the text. We performed the following annotation study in order to understand how a person from the target population of an application (such as a recommendation system) would rate the same articles. People differ in which topics they like and have personal preferences for style of writing. It is hard to control for these preferences during annotation. However it is useful to know how the judgements of a target population relates to expert ratings which we use for developing the text quality measures. This annotation study is a preliminary analysis with this aim.

We hired an undergraduate student to do the annotations. The student had no prior knowledge and experience in natural language processing techniques or linguistics. From our topic normalized corpus we chose20pairs of (great,typical) articles and20pairs of (very good,typical) articles for annotation. We also created10pairs, where both articles came from thegreatorvery goodcategories. In each case, thetypical article is one of the10most similar articles to thegoodsample but they span a range of similarity values as noted in the previous section.

The student read each article in a pair and answered two questions. The order of articles in a pair was randomly assigned and the pairs were also randomly presented. A computer interface was used for the annotation. It showed the two articles on the screen and the following questions.

Is the topic of the articles the same? For example, when both articles are about ‘con- troversies related to vaccination’ we may consider them highly similar. When

both are about ‘vaccines’, they are medium similar and when one is about ar- chaeology and other about chemistry, you may consider them not at all similar. Therefore varying degrees of similarity can be assigned to articles. The scale for this rating is1(not same) to10(almost exactly same).

Which article is more interesting to read? Give an overall rating for how much you would prefer to read one article versus another. You may find one article more interesting because it is more informative, written creatively or captivates your attention. Indicate your preference on the following scale: a) prefer article A very much b) prefer article A somewhat c) no preference d) prefer article B somewhat e) prefer article B very much

We provided the annotator with10practice pairs of articles to familiarize herself with the task and scales for ratings. Then the50pairs that we described above were provided. First, we provide an analysis of the similarity ratings from the annotator. We compare the automatic measure we used for pairing articles (cosine overlap of topic words) with the annotator’s ratings for similarity. These values are plotted in Figure 3.1. The Pear- son correlation between the automatic measure and annotator scores is rather high,0.57

(pvalue of 1.5e-5). Therefore the similarity metric used for topic normalization is quite reliable.

For the ratings of quality, we have summarized the results in Table 3.11. The first column indicates what type of pair was compared. The ‘goodis better’ column presents the number of examples where the annotator chose the great or very good article as better than the typical article. We had two levels of preference–‘very much better’ and ‘better’. We present the combined counts for both these levels since the number of examples in our annotation study is not large. Similarity we indicate the number of times a

typical article was preferred over thegreat or very good articles. ‘No pref.’ indicates that neither article was preferred over the other.

For the pairs comparinggreatwith atypicalarticle we find that thegreat article is chosen as better in14 out of20pairs. This result indicates that the annotator had a clear preference for thegreatarticles, aligning with the judgements of the expert journalists.

The trend for the very goodversustypicalarticles is not as strong. Close to half the pairs were judged as ‘no preference’ and the remaining cases were almost equally divided

Figure 3.1: Similarity values computed using topic words versus annotator’s similarity ratings

Type of pair No. pairs No pref. Goodis better Typicalis better

greatvs.typical 20 1 14 5

very goodvs.typical 20 9 6 5

greatvs.very good 10 5 3 2

Total pairs 50

between preferring the very good (6 times) and typical (5 times) articles. Our simple heuristic of using the articles written by the great author set as very good writing is not reflected in these ratings. Further examination should be done to understand if our heuristic works well. For example, these ratings were provided by a student. It would be interesting to examine how a professional writer or journalism student will rate the same articles. Further obtaining ratings from a number of annotators and averaging them will provide better normalization over people’s individual preferences. For now we will continue to use the categories developed by our heuristics and leave further annotation and cleaning of article categories for future work. For comparing greatand

very goodarticles however, the results are close to expected. Half the pairs are rated as ‘no preference’ indicating that both articles could be of good quality.

In document Predicting Text Quality: Metrics for Content, Organization and Reader Interest (Page 67-73)