Sentence Transformations - Data-Driven Detection of Necessary Transformations for Automatic Tex

T EXT S IMPLIFICATION

2.4 Data-Driven Detection of Necessary Transformations for Automatic Text Simplificationfor Automatic Text Simplification

2.4.2 Sentence Transformations

A similar idea of using a parallel corpus of original and manually adapted texts for learn-ing the transformations which are necessary for an automatic simplification of texts was

used byPetersen and Ostendorf(2007),Gasperin et al.(2009), andDrndarevi´c and Sag-gion(2012). This time, the focus of the studies was on specific sentence transformations such as splitting and deletion. Although they cannot be directly compared as they were performed on the corpora in different languages and for different target populations, they reveal some interesting phenomena which seem to be independent of the target population and language. Experiments presented in Chapter4were mainly inspired by those three previous studies (Petersen and Ostendorf,2007;Gasperin et al.,2009; Drn-darevi´c and Saggion,2012). Table2.5provides a quick overview of the differences and similarities among those three studies.

Table 2.5: Studies on necessary sentence transformations forATS Petersen-07 Gasperin-09 Drndarevic-12

Language English Portuguese Spanish

Target Language learners Low literacy People withID

Text genre News News News

# of sentences 2588 2685 246

Sent. splitting Yes Yes No

Sent. deletion Yes No Yes

Classifier C4.5 decision tree SMO (SVM) SVM

The columns ‘Petersen-07’, ‘Gasperin-09’, and ‘Drndarevic-12’ represent the studies byPetersen and Ostendorf(2007),Gasperin et al.(2009), andDrndarevi´c and Saggion(2012).

In all three cases, the authors were interested in developing a system for automatic simplification of texts for their specific target population. Although the user groups and languages were different, text genre was the same (news articles) and the observed transformations were similar: some sentences or phrases were deleted, long sentences were split into several shorter ones, long descriptive phrases were shortened, etc. The

authors were not interested in changes to vocabulary in any of the three studies, but rather focused on:

1. Differences in part-of-speech usage and phrase types between original and sim-plified sentences (Petersen and Ostendorf,2007);

2. Characteristics of sentences which were chosen to be split (Petersen and Osten-dorf,2007;Gasperin et al.,2009);

3. Characteristics of sentences which were deleted (Petersen and Ostendorf, 2007;

Drndarevi´c and Saggion,2012).

Petersen and Ostendorf (2007) used a corpus of 104 original news articles and their abridged versions developed by Literacyworks, which is freely available on the inter-net¹¹. Gasperin et al. (2009) used corpora from two of the main Brazilian newspapers, Zero Hora and Folha de S˜ao Paulo. The first one (a total of 2,116 original sentences) comprises general news articles, while the second one (a total of 569 original sentences) contains texts from the science section. Drndarevi´c and Saggion (2012) used the cor-pus of news articles obtained from the Spanish news agency Servimedia¹²and compiled under the Simplext project (Saggion et al.,2011).

Petersen and Ostendorf(2007) reported that out of a total of 2,539 original sentences (100%), 30% were dropped, 19% were split (into two or more abridged sentences), 7%

were merged (two original sentences merged into one abridged), and 47% of sentences had ‘1-1’ alignment (one original sentence corresponds to one abridged sentence). The

11http://literacynet.org/cnnsf/index cnnsf.html

12http://www.servimedia.es/

proportions of deleted, split, and ‘1-1’ aligned sentences in the Simplext corpus (ˇStajner et al.,2013) were similar (Table2.6). The only difference was that in the Simplext cor-pus, the amount of split and deleted sentences was practically the same. The analysis of the corpora used byGasperin et al.(2009), however, revealed a significant difference in comparison with the other two as there were almost no deleted sentences. This might be interpreted as an interesting difference in simplification strategies when simplifying texts for different target groups. It seems that simplification of texts for language learn-ers and people with intellectual disabilities requires a fair amount of content reduction (reflected in the number of deleted sentences), while simplification for people with low literacy tries to keep all information which was present in the original text.

Table 2.6: Distribution of sentence transformations

LiteracyWorks Wikipedia PorSimples Simplext

Language English English Portuguese Spanish

Genre News Wikipedia News News

Target Language learners Various Low literacy People withID

# of sentences 2,588 90,000 2,685 246

Split 18% 11% 29% 23%

Deleted 29% 31% 0.3% 21%

Merged 6% 7% 0.3% Unknown

‘1-1’ 46% 51% 70% 55%

The columns ‘LiteracyWorks’, ‘Wikipedia’, ‘PorSimples’, and ‘Simplext’ represent the following four studies conducted on the corresponding corpora: (Petersen and Ostendorf,2007), (Coster and Kauchak, 2011b), (Gasperin et al.,2009), and (ˇStajner et al.,2013)

Coster and Kauchak (2011b) introduced a new dataset for text simplification by aligning the sentences from English Wikipedia¹³(EW) and Simple English Wikipedia¹⁴

13http://en.wikipedia.org/

14http://simple.wikipedia.org

(SEW). Simple English Wikipedia offers a similar content as English Wikipedia pre-sented using simpler vocabulary and grammar in order to facilitate its comprehension to children, English language learners, people with low-literacy levels, and other people with reading difficulties.

Sentences from the EWand SEWwere automatically aligned. Two human evalu-ators estimated the automatic sentence alignment in theEW-SEWdataset as correct in 91% of the cases (on a small portion of 100 sentences), while the other 9% was only par-tially correct. Out of 137,000 aligned sentence pairs, 27% of sentences were identical and they were excluded from further analysis. 23% of the remaining original sentences could not be aligned with any simplified sentence, and 27% of the remaining simpli-fied sentences could not be aligned with any original sentence. Among the remaining sentence pairs, the ‘1-1’ alignment (one original to one simple sentence) was found in 37% of the cases, the ‘1-2’ (one original to two simple sentences) was found in 8% of the cases, and the ‘2-1’ (two original to one simple sentence) alignments were found in 5% of the cases (Table 2.6). That such a great number of simplified sentences which could not be aligned with any original sentence is the consequence of the fact that the texts in the SEWwere not made as direct simplifications of the corresponding original articles, rather they were written independently but following the same topic. For the same reason, the number of deleted sentences in Table 2.6 is not directly comparable with the number of deleted sentences in the otherTScorpora.

Based on word alignment learned using GIZA++ (Och and Ney,2003), Coster and Kauchak(2011b) focused their study on word transformations and calculated the per-centage of sentences which included:

• Rewordings (a normal word is changed to a different simple word): 65%,

• Deletions (a normal word is deleted): 47%,

• Reorderings (non-monotonic alignment): 34%,

• Merges (multiple normal words are condensed to a single simple word): 31%

• Splits (a normal word is split into multiple simple words): 27%.

In document NEW DATA-DRIVEN APPROACHES (Page 52-57)