4.3 Natural Language Processing Tools Used
4.3.1 Morphological Analysis
LX-Suite (Barreto et al.,2006;Branco & Silva, 2006;Silva,2007) splits a text into paragraphs and sentences, splits sentences into words and then annotates each word with its lemma (i.e. its dictionary form), part-of-speech (whether it is a noun, verb, adjective, etc.), and inflectional morphology (gender and number for nouns and adjectives, person, number and tense for verbs, etc.). It additionally recognizes multi-word names.
Figure 4.1 shows the morphological annotation produced at this stage for an example sentence occurring in a document input to the system. In that figure, the topmost box contains the raw text. The middle box shows the direct output of LX-Suite. The box at the bottom contains the output of LX-Suite, converted to an XML format in such a way that the removal of all XML tags results in the original, unannotated text. This is convenient for alignment purposes, as explained below.
This last representation is the one that is used in subsequent phases. Sentences are enclosed in s tags. Words are associated with w elements and annotated with: their dictionary form (the lemma attribute), their part-of-speech (pos), and their inflectional morphology (morph). There is also a numeric identifier, useful for further processing and debugging (the id attribute).
In the output of LX-Suite, punctuation marks are represented as separate word tokens, and contractions are split up into their composing elements. For instance,
the contracted forms do and da in Figure 4.1 are separated in de “of” and o or a
“the.” The parts-of-speech annotated in that figure are: preposition (PREP), name (PNM), punctuation (PNT), adverb (ADV), definite article (DA), verb (V), common noun (CN), adjective (ADJ), relative pronoun (REL).
This tool is not completely error free (notice the name TWA800 in Figure 4.1
annotated as an adjective), but the error rates are very low and state-of-the-art for this sort of tool. For instance, the part-of-speech tagger has an accuracy of 96.87% (Branco & Silva,2006).
Because two sources of annotations are often needed in combination—the orig- inal TimeML annotations and the annotations provided by natural language tools such as the just mentioned LX-Suite—it is necessary to combine the two groups of annotations.
The challenge here is that one cannot simply send the annotated data to LX- Suite, as it has no way of knowing what is an annotation and what is linguistic material. Additionally, LX-Suite changes the input text when it splits sentences into words: by separating punctuation and splitting contractions, the number of word tokens, as defined by whitespace, is different between its input and its output. Therefore, the linguistic material in the two annotated formats need to be aligned
somehow. The approach used is to convert the LX-Suite output, shown in the
4.3 Natural Language Processing Tools Used
Em Washington, hoje, a Federal Aviation Administration publicou gravações do controlo de tráfego aéreo da noite em que o voo TWA800 caiu.
<s>Em/PREP[O]Washington/PNM[B-LOC] ,*//PNT[O]hoje/ADV[O],*//PNT[O]
a/DA#fs[O] Federal/PNM[B-ORG] Aviation/PNM[I-ORG] Administration/PNM[I-ORG]
publicou/PUBLICAR/V#ppi-3s[O]gravações/GRAVAÇÃO/CN#fp[O]de_/PREP[O]
o/DA#ms[O] controlo/CONTROLO/CN#ms[O]de/PREP[O]
tráfego/TRÁFEGO/CN#ms[O] aéreo/AÉREO/ADJ#ms[O]de_/PREP[O]a/DA#fs[O]
noite/NOITE/CN#fs[O] em/PREP[O] que/REL[O]o/DA#ms[O] voo/VOO/CN#ms[O]
TWA800/TWA800/ADJ#ms[O]caiu/CAIR/V#ppi-3s[O].*//PNT[O] </s>
<s><w id="3" lemma="Em" pos="PREP">Em</w> <w id="4" lemma="Washington" pos="PNM">Washington</w><w id="5" lemma="," pos="PNT">,</w> <w id="6" lemma="hoje" pos="ADV">hoje</w><w id="7" lemma="," pos="PNT">,</w> <w id="8" lemma="a" pos="DA" morph="fs">a</w> <w id="9" lemma="Federal"
pos="PNM">Federal</w> <w id="10" lemma="Aviation" pos="PNM">Aviation</w> <w id="11" lemma="Administration" pos="PNM">Administration</w> <w id="13" lemma="PUBLICAR" pos="V" morph="ppi-3s">publicou</w> <w id="14"
lemma="GRAVAÇÃO" pos="CN" morph="fp">gravações</w> <c><w id="16"
lemma="de" pos="PREP" surface="de"/><w id="17" lemma="o" pos="DA" morph="ms" surface="o"/><cs>do</cs></c> <w id="19" lemma="CONTROLO" pos="CN"
morph="ms">controlo</w> <w id="20" lemma="de" pos="PREP">de</w> <w id="21" lemma="TRÁFEGO" pos="CN" morph="ms">tráfego</w> <w id="22" lemma="AÉREO" pos="ADJ" morph="ms">aéreo</w> <c><w id="24" lemma="de" pos="PREP"
surface="de"/><w id="25" lemma="a" pos="DA" morph="fs"
surface="a"/><cs>da</cs></c> <w id="27" lemma="NOITE" pos="CN"
morph="fs">noite</w> <w id="28" lemma="em" pos="PREP">em</w> <w id="29" lemma="que" pos="REL">que</w> <w id="30" lemma="o" pos="DA"
morph="ms">o</w> <w id="31" lemma="VOO" pos="CN" morph="ms">voo</w> <w id="32" lemma="TWA800" pos="ADJ" morph="ms">TWA800</w> <w id="34" lemma="CAIR" pos="V" morph="ppi-3s">caiu</w><w id="35" lemma="." pos="PNT">.</w></s>
Figure 4.1: Morphological annotation of raw input. The sentence translates to
English as In Washington today, the Federal Aviation Administration released air traffic control tapes from the night the TWA Flight eight hundred went down.
of that figure. This format has the property that if one removes all the XML
tags, the original text is obtained. For alignment purposes with the TempEval
annotations, this characteristic is important because TimeML also has this property. As a result, alignment can be performed by looking at character positions, ignoring the annotations.
This is how the two kinds of annotations are combined. Figure 4.2 shows the
TimeML annotation for the sentence in Figure4.1(top box) and the result of com-
bining it with the automatic morphological annotation (bottom box).