A UTOMATIC T EXT S IMPLIFICATION
3.2 Data-Driven Approaches to ATS
3.2.3 Lexico-Syntactic Simplification
Zhu et al.(2010) proposed a tree-based simplification model, inspired by syntax-based machine translation (Yamada and Knight,2001). It was the first statistical simplification model which covered splitting, dropping, reordering and substitution. Zhu et al.(2010) paired the corresponding articles inEWandSEW, extracted plain texts, used the Stan-ford parser (Klein and Manning, 2003b) for sentence boundary detection and tokeni-sation, applied sentence-level TF*IDF for aligning the corresponding sentence pairs (original and simplified), and trained the tree-based text simplification model (TSM) on the full parse trees. A few examples of the output of their system are presented in Table3.6.
Table 3.6: Examples of the output of theTSsystem proposed byZhu et al.(2010)
Ex. Version Sentence
(1) EW “Genetic engineering has expanded the genes available to breeders to utilize in creating desired germlines for new crops.”
TSM “Engineering has expanded the genes available to breeders to use in making germlines for new crops.”
SEW “New plants were created with genetic engineering.”
(2) EW “An umbrella term is a word that provides a superset or grouping of related con-cepts, also called a hypernym.”
TSM “An umbrella term is a word. A word provides a superset of related concepts, called a hypernym.”
SEW “An umbrella term is a word that provides a superset or grouping of related con-cepts.”
(3) EW “Almost as soon as he leaves, Annius and the guard Publius arrive to escort Vitel-lia to Titus, who has now chosen her as his empress.”
TSM “Annius and the guard Publius arrive to take Vitellia to Titus. Titus has now chosen her as his empress.”
SEW “Almost as soon as he leaves, Annius and the guard Publius arrive to take Vitellia to Titus, who has now chosen her as his empress.”
EW – English Wikipedia (original); TSM – tree-basedTSmodel proposed byZhu et al.(2010); SEW – Simple English Wikipedia. All examples are taken from the study byZhu et al.(2010).
The first example (Table3.6) illustrates dropping (Genetic, and desired) and substi-tution (utilize → use, and creating → making) performed by the proposedATSmodel.
In the second example, theTSMsystem performs dropping (also) and sentence splitting operations. The third example combines sentence splitting with substitution (escort → take). TheTSMsystem outperformed the standardPB-SMTsystem in the Moses toolkit trained on the same dataset and several other baselines (Zhu et al.,2010).
Woodsend and Lapata(2011a) followed the idea presented byYatskar et al.(2010) but instead of just learning lexical simplifications, they used quasi-synchronous gram-mar (Smith and Eisner,2006) to learn a wide range of rewriting transformations for text simplification. Woodsend and Lapata(2011a) trained two systems, one usingSEW re-vision histories (REVH), and the other using the simplification corpus made of aligned sentences fromEWandSEW(ALIGNED). The proposed systems were fully automated and did not need any human intervention at any moment. The results of the compari-son of the output of those systems with the ‘gold standard’ Simple English Wikipedia articles and two baselines demonstrated that the system creates informative articles, which are simpler to read than the baselines (Woodsend and Lapata,2011a). As a lexi-cal simplification baseline, the authors used simplification lists made by Spencer Kelly (SPLIST, see Section3.2.1). The other baseline was the tree-basedTSsystem proposed byZhu et al.(2010). Two examples of original sentences and their simplified versions produced by various systems are presented in Table3.7.
Narayan and Gardent(2014) combined a probabilistic module for splitting and dele-tion with a monolingual transladele-tion model for phrase substitudele-tion and reordering. The proposedATSsystem is based on deep semantic representations (the Discourse
Repre-Table 3.7: Comparison of lexico-syntactic data-drivenTSsystems
Version Sentence
EW “Wonder has recorded several critically acclaimed albums and hit singles, and writes and produces songs for many of his label mates and outside artists as well.”
Zhu et al. “Wonder has recorded several praised albums and writes and produces songs. Many of his label mates and outside artists as well.”
ALIGNED “Wonder has recorded several critically acclaimed albums and hit singles. He produces songs for many of his label mates and outside artists as well. He writes.”
REVH “Wonder has recorded many critically acclaimed albums and hit singles. He writes. He makes songs for many of his label mates and outside artists as well.”
SEW “He has recorded 23 albums and many hit singles, and written and produced songs for many of his label mates and other artists as well.”
EW “The London journeys In 1790, Prince Nikolaus died and was succeeded by a thoroughly unmusical prince who dismissed the entire musical establishment and put Haydn on a pension.”
Zhu et al. “The London journeys in 1790, prince Nikolaus died and was succeeds by a son became prince. A son became prince told the entire musical start and put he on a pension.”
ALIGNED “The London journeys In 1790, Prince Nikolaus died. He was succeeded by a thoroughly unmusical prince. He dismissed the entire musical establishment. He put Haydn on a pension.”
REVH “The London journeys In 1790, Prince Nikolaus died. He was succeeded by a thoroughly unmusical prince. He dismissed the whole musical establishment. He put Haydn on a pension.”
SEW “The London journeys In 1790, Prince Nikolaus died and his son became prince. Haydn was put on a pension.”
EW– English Wikipedia (original); Zhu et al. – tree-basedATSsystem proposed byZhu et al.(2010);
ALIGNED–ATSsystem by (Woodsend and Lapata,2011a) trained on aligned sentences; REVH –ATS system by (Woodsend and Lapata,2011a) trained usingSEWrevision histories; SEW – Simple English Wikipedia. All examples are taken from the study byWoodsend and Lapata(2011a).
sentation Structure – DRS (Kamp,1981) assigned by Boxer (Curran et al.,2007)). The use of deep semantic representations (instead of sentences and full parse trees used in previous studies) facilitates completion (re-creation of the shared element) in the split sentences and better control over deletion of sentence parts, avoiding deletion of
oblig-atory arguments (Narayan and Gardent, 2014). The following examples of an original sentence (3) and its simplified versions (4) and (5) obtained by the systems proposed by Zhu et al.(2010) andWoodsend and Lapata(2011a), respectively, illustrate the need for the use of deep semantic representation in the splitting operation (Narayan and Gardent, 2014).
(3) “The judge ordered that Chapman should receive psychiatric treatment in prison and sentenced him to twenty years to life.”
(4) “The judge ordered that Chapman should get psychiatric treatment. In prison and sentenced him to twenty years to life.”
(5) “The judge ordered that Chapman should receive psychiatric treatment in prison.
It sentenced him to twenty years to life.”
Zhu et al.(2010) fail to copy the shared argument The judge to the second sentence (4), whileWoodsend and Lapata(2011a) do not replace the antecedent The judge with a correct pronoun (5). Those errors are due to the fact that both systems (Zhu et al.,2010;
Woodsend and Lapata,2011a) rely solely on syntax. By contrast, the semantically based system proposed by Narayan and Gardent (2014) correctly copies the shared element The judgeto the second simplified sentence.
In the next example of an original sentence (6) and its simplified version (7) pro-duced by the system proposed byZhu et al.(2010), the system incorrectly deletes oblig-atory argument gifts and modifies the sentence meaning to giving knights and warriors instead of giving gifts to knights and warriors (Narayan and Gardent,2014).
(6) “Women would also often give knights and warriors gifts that included thyme leaves as it was believed to bring courage to the bearer.”
(7) “Women also often give knights and warriors. Gifts included thyme leaves as it was thought to bring courage to the saint.”
The probabilistic model trained on semantic representations proposed for handling deletion byNarayan and Gardent(2014) avoids such a deletion of obligatory arguments of a predicate, and thus leads to better meaning preservation.