4 1 Syntax as a rough proxy - Predicting Text Quality: Metrics for Content, Organization and Re

In order to learn patterns in intentional structure, we need a way to identify the author intention behind each sentence. One approach would be to obtain annotations for intentions for a reasonable amount of data and use it to train a classifier to identify intentions. But such annotations are only available (publicly) on one corpus of chemistry academic journal articles [84] and more recently on a corpus of computational linguistics conference publications [158]. To perform the task on a different genre or even different subgenre of academic articles such as review summaries, we would need to obtain separate annotations. Further, annotation for intentional structure involves several challenges. For many genres, even the related area of science journalism, it would be challenging to pre- define the intention categories and obtain reliable annotations. Academic articles have a

restricted structure and further can be analyzed as individual sections, but it is unclear if a similar strategy can be developed for other genres. Therefore rather than rely on manual annotation, we use an insight about sentence syntax to propose an approximate indicator of sentence intention.

We introduce the idea that the syntax of a sentence can act as a rough proxy for its intentional structure. The motivation for using syntax comes from the observation that certain sentence types such as questions and definitions have distinguishable and unique syntactic structure.

For instance, in our previously introduced example from Wikipedia articles (Table

4.1), several syntactic patterns can be found. The first sentences of these articles have the prototypical syntax of definition sentences. Definitions usually have the same structure: they start with concept to be defined expressed as a noun phrase followed by a copular verb (is/are). The predicate contains two parts: the first is a noun phrase reporting the concept as part of a larger class (eg. an aqueduct is a water supply), the second component is a relative clause listing unique properties of the concept. The second sentences of the articles (1b and2b), which provide specific details also have some distinguishing syntactic features such as the presence of a topicalized phrase providing the focus of the sentence. In this way, the two articles which have similar sequence of communicative goals also have similar syntactic patterns for the sentences in the sequence.

A number of recent studies also support the idea of syntactic patterns in discourse. Cocco et al. (2011) [22] show that significant associations exist between certain part of speech tags and sentence types such as explanation, dialog and argumentation in French short stories. For the task of discourse parsing, Lin et. al (2009) [89] report that the syntactic productions from adjacent sentences are powerful features for predicting which discourse relation (cause, contrast, etc.) holds between them.

There is also evidence from entrainment literature that certain grammatical productions are repeated in adjacent sentences more often than would be expected by chance [38, 140]. Motivated by such patterns, Debey, Keller and Sturt (2006) [37] and Cheung and Gerald (2010) [20] build parsers that take advantage of the syntax of adjacent sentences for parsing a current sentence. The idea is that a production that was used in the

immediately previous sentence is likely to be relevant for the current sentence as well given the evidence from syntactic entrainment.

However, these entrainment-based studies have focused only on the repetition of grammatical productions in adjacent sentences. We performed a pilot study to exam- ine if other types of syntactic patterns are also present in adjacent sentences. In this study we considered all pairs of grammatical productions and investigated whether they are likely to appear in adjacent sentences more often than chance.

We use the gold standard parse trees from the Penn Treebank [100] for this study. Our unit of analysis is a pair of adjacent sentences (S1,S2)and we choose to use Section 0 of the corpus which has 99 documents and 1727 sentence pairs. We enumerate all productions that appear in the syntactic parse of any sentence and exclude those that appear less than25 times, resulting in a list of197unique productions. Then all ordered pairs13 ₍_p

1,p2)of productions are formed. There are a total of38,809production pairs.

For each pair, we compute a2x2contingency table with the following components: • c(p1p2) =number of sentence pairs where p12S1and p22S2

• c(p1¬p2) =number of pairs where p12S1and p262S2

• c(_¬p1p2) =number of pairs where p162S1 andp22S2

• c(_¬p1¬p2) =number of pairs where p162S1 andp262S2

We remove the pairs where c(p1p2)is less than three. Then we use a chi-square test

to understand if the observed countc(p1p2)is significantly (95% confidence level) greater

or lesser than the expected value if occurrences of p1and p2 were independent.

Given that we are performing the tests for a large number of production pairs (38,809), there is an increased chance of Type I errors (rejecting the null hypothesis when it is actually true). To mitigate this issue, we perform Bonferroni correction for the p-values from the test. To ensure that an overall95% confidence level is maintained (for the full set of tests), individual p-values should be less than 0.05/38809=1.28_⇥10 6 This approach

is one of the conservative techniques to reduce Type I errors. 13₍_p₁_, _p₂_{) and (}_p₂_,_p₁_{) are considered as different pairs.}

For this corrected p-value, 25 production pairs turn out as occurring significantly greater than chance. No pair was detected as occurring less than expected. The25pairs of the first kind are listed in Table4.2along with the number of times they occurred together,

c(p1p2). We also divide these pairs into three simple categories: ‘repetitions’, ‘related to

quantities’ and ‘other’. In Dubey, Sturt and Keller (2005) and Cheung and Penn (2010), a similar test was peformed for identifying production pairs that are repeated very often in adjacent sentences. They use a slightly different test which examines if the probability with which the production appears in a second sentence S2 given that it appeared in

previous sentence S1 is greater than the probability with which it generally appears in S2. Cheung and Penn compute these productions also on the Penn Treebank albeit on

different sections compared to our analysis. However, we present some of their results also for comparison. In the last column in Table 4.2, we show the top 10 productions which Cheung and Penn report in their paper as having the highest entrainment. Their list is weighted by the frequency of the production.

A small fraction of the significant pairs (7/25) that we found are indeed repetitions as pointed out by prior work. Most of these are related to quantifier phrases and noun phrases similar to the top list of Cheung and Penn. However, we also found other regularities which are not repetition of productions. Some of these sequences are related to quantities and can be explained by the fact that these articles come from the finance domain and often discuss prices and shares. But there is also a class that is not repetitions or readily observed as domain-specific.

We analyzed example sentences with these sequence patterns to understand some of the trends. The most frequent pattern, (VP!VB VP_| NP-SBJ!NNP NNP), contains a bare verb in S1 and propernames as subjects of the second. We found that in such sentence

pairs, S1 is often associated with modals and presents hypotheses or speculations. The

following sentence S2 often has an entity, a person or organization, giving their opinion

on the hypothesis. This pattern roughly correponds to aspeculatefollowed byendorse

sequence of intentions in the sentences. An example sentence pair with these productions is shown below. The spans corresponding to the left-hand side non-terminal in the productions is indicated by square brackets.

Our study Cheung and Penn (2010)

p1 p2 c(p1p2)

Repetitions

VP!VBD SBAR VP!VBD SBAR 83 QP!# CD CD

QP!$ CD CD QP!$ CD CD 18 NP!JJ NNPS

NP!$ CD -NONE- NP!$ CD -NONE- 16 NP!NP , ADVP

NP!QP -NONE- NP!QP -NONE- 15 NP!DT JJ CD NN

NP-ADV!DT NN NP-ADV!DT NN 10 PP!IN NP NP

NP-LOC!NP , NP NP-LOC!NP , NP 3 QP!IN $ CD

NP!NP NP-ADV NP!NP NP-ADV 7 NP!NP : NP

related to quantities INTJ!UH

NP!QP -NONE- QP!$ CD CD 16 ADVP!IN NP QP!$ CD CD NP!QP -NONE- 15 NP!CD CD NP!NP NP-ADV NP!QP -NONE- 11 NP-ADV!DT NN NP!QP -NONE- 11 NP!NP NP-ADV NP-ADV!DT NN 9 NP-ADV!DT NN NP!NP NP-ADV 8 NP-ADV!DT NN NP!$ CD -NONE- 8 NP!$ CD -NONE- NP-ADV!DT NN 8 NP!NP NP-ADV QP!CD CD 6 QP!CD CD NP!NP NP-ADV 5 FRAG!NP-SBJ NP NP!$ CD -NONE- 3 other VP!VB VP NP-SBJ!NNP NNP 27 NP-SBJ-1!NNP NNP VP!VBD NP 13 NP-PRD!NP PP NP-PRD!NP SBAR 7 NP-LOC!NNP S-TPC-1!NP-SBJ VP 6 NP-SBJ!NP , NP-LOC , NP-LOC!NP , NP 3 NP-LOC!NNP NP-LOC!NP , NP 3 FRAG!NP-SBJ NP NP-LOC!NP , NP 3

Table 4.2: The left column has the production pairs that we identified as occurring in adjacent sentences significantly more than chance. The top10 productions that Cheung and Penn (2010) found as repeated very often are in the rightmost column.

“ Markey said we could [have done this in public ” because so little sensitive information was disclosed]_VP, the aide said. [Mr. Phelan]_NP-SBJthen responded that he would have been happy just writing a report to the panel, the aide added.

Similarly, in the adjacent sentence pairs from our corpus containing the items (NP-LOC

! NNP _| S-TPC-1! NP-SBJ VP), p₁ often introduced a location name and was associated

with the title of a person or organization. The next sentence has a quote from that person, where the quotation forms the topicalized clause in p2. Here the intentional structure is introduceX /statement byX such as in the following example:

Two years ago, the Rev. Jeremy Hummerstone, vicar of Great Torrington, [Devon]_NP-LOC, got so fed up with ringers who didn’t attend service he sacked the entire band; the ringers promptly set up a picket line in protest. [“They were a self-perpetuating club that treated the tower as sort of a separate premises]_S-TPC-₁, ” the Vicar Hummerstone says.

These results show the existence of reasonable patterns for a domain in the syntax of adjacent sentences. Even though the Penn Treebank contains function tags and traces which are not provided by automatic parsers, we can expect that other such syntactic patterns would be present in most domains and genres. Our metric for organization quality aims to characterize syntactic patterns on a broad scale. The model relies on two assumptions which summarize our intuitions about syntax and intentional structure:

1. Sentences with similar syntax are likely to have the same intention or communicative goal.

2. Regularities in intentional structure of articles will be manifested in syntactic regularities between adjacent sentences.

Below we describe the models we developed to learn such syntactic patterns.

In document Predicting Text Quality: Metrics for Content, Organization and Reader Interest (Page 78-83)