LX-Parser and LX-DepParser - Natural Language Processing Tools Used

4.3 Natural Language Processing Tools Used

4.3.2 LX-Parser and LX-DepParser

In some of the work reported below, syntactic information is used. This information

is derived from two parsers: LX-Parser (Silva et al., 2010), a constituency parser,

and LX-DepParser (Reis,2010), a dependency parser.

Figure 4.3 shows a syntactic tree produced by LX-Parser for a sentence that

is a shorter version of this chapter’s working example, so it can fit the page. The actual output format of LX-Parser is a bracketed representation of a tree, as shown in Figure4.4.

LX-Parser is based on the Stanford parser of Klein & Manning (2003). It was

trained for Portuguese with mostly news articles. Under the Parseval metric it

achieves an F-measure of 88% (value obtained through 10-fold cross-evaluation). LX-DepParser produces dependency graphs for input sentences. An example

can be seen in Figure4.5. Once again, the parser’s output is actually textual. More

specifically, it follows the CoNNL format. It is organized in columns and rows, with each row representing a word, and each column a specific piece of information relating to that word. A slightly abridged example, where columns irrelevant to the

present discussion were eliminated, can be seen in Figure 4.6. There, the leftmost

column contains a numeric identifier for a word. The second column shows the surface form of the word as it occurs in the text. The third, fourth and fifth columns contain properties of the word identified by LX-Suite: respectively its lemma, part- of-speech and inflection tag. The last two columns describe the dependency graph. The sixth column contains the identifier of the word that the current word depends on, and the last column shows the name of the dependency relation. The main verb

4.3 Natural Language Processing Tools Used

<s>Em Washington,<TIMEX3 tid="t53" type="DATE" value="1998-01-14" temporalFunction="true" functionInDocument="NONE"

anchorTimeID="t52">hoje</TIMEX3>, a Federal Aviation Administration<EVENT eid="e1" class="OCCURRENCE" stem="publicar" aspect="NONE" tense="PPI"

polarity="POS" pos="VERB">publicou</EVENT> gravações do controlo de tráfego aéreo da<TIMEX3 tid="t54" type="TIME" value="1998-XX-XXTNI"

temporalFunction="true" functionInDocument="NONE"

anchorTimeID="t52">noite</TIMEX3> em que o voo TWA800<EVENT eid="e2" class="OCCURRENCE" stem="cair" aspect="NONE" tense="PPI" polarity="POS" pos="VERB">caiu</EVENT>.</s>

<s><w pos="PREP">Em</w> <w pos="PNM">Washington</w><w pos="PNT">,</w> <TIMEX3 tid="t53" type="DATE" value="1998-01-14" temporalFunction="true" functionInDocument="NONE" anchorTimeID="t52"><w pos="ADV">hoje</w></TIMEX3><w pos="PNT">,</w> <w pos="DA"

morph="fs">a</w> <w pos="PNM">Federal</w> <w pos="PNM">Aviation</w> <w pos="PNM">Administration</w> <EVENT eid="e1" class="OCCURRENCE"

stem="publicar" aspect="NONE" tense="PPI" polarity="POS" pos="VERB"><w pos="V" lemma="PUBLICAR" morph="ppi-3s">publicou</w></EVENT> <w pos="CN" lemma="GRAVAÇÃO" morph="fp">gravações</w> <c><w pos="PREP" surface="de"/><w pos="DA" morph="ms" surface="o"/><cs>do</cs></c> <w

pos="CN" lemma="CONTROLO" morph="ms">controlo</w> <w pos="PREP">de</w> <w pos="CN" lemma="TRÁFEGO" morph="ms">tráfego</w> <w pos="ADJ"

lemma="AÉREO" morph="ms">aéreo</w> <c><w pos="PREP" surface="de"/><w pos="DA" morph="fs" surface="a"/><cs>da</cs></c> <TIMEX3 tid="t54" type="TIME" value="1998-XX-XXTNI" temporalFunction="true" functionInDocument="NONE"

anchorTimeID="t52"><w pos="CN" lemma="NOITE" morph="fs">noite</w></TIMEX3> <w pos="PREP">em</w> <w pos="REL">que</w><w pos="DA"

morph="ms">o</w> <w pos="CN" lemma="VOO" morph="ms">voo</w> <w pos="ADJ" lemma="TWA800" morph="ms">TWA800</w> <EVENT eid="e2" class="OCCURRENCE" stem="cair" aspect="NONE" tense="PPI" polarity="POS" pos="VERB"><w pos="V" lemma="CAIR" morph="ppi-3s">caiu</w></EVENT><w pos="PNT">.</w></s>

Figure 4.3: Example parse tree produced by LX-Parser. The sentence translates to English as In Washington today, the Federal Aviation Administration released air traffic control tapes.

4.3 Natural Language Processing Tools Used (S (S (PP (P (Em)) (NP (N (Washington)))) (S (ADV’ (PNT (,)) (ADV (hoje)) (PNT (,))) (S (NP (ART (a)) (N’ (N (Federal)) (N (Aviation)) (N (Administration)))) (VP (V (publicou)) (NP (N’ (N (gravações)) (PP (P (de)) (NP (ART (o)) (N’ (N (controlo)) (PP (P (de)) (NP (N’ (N (tráfego))) (A (aéreo))))))))))))))

Figure 4.4: Example output of LX-Parser corresponding to the tree in Figure 4.3.

The sentence translates to English as In Washington today, the Federal Aviation Administration released air traffic control tapes.

is represented as depending on word 0 (which does not exist), with the dependency relation being ROOT.

LX-DepParser was developed based on the MSTParser (McDonald et al.,2005)

and trained on the same corpus as LX-Parser. Its accuracy is 86.8%.

As can be seen from these examples, the two parsers sometimes produce results that say different things. For instance, the dependency representation corresponding to the syntactic tree for this sentence would have the word hoje “today” depending

on the main verb form publicou “released”, since the structure in Figure4.3is meant

to indicate that that adverbial is a modifier of a syntactic constituent headed by this verb. Instead, the dependency parser wrongly says it is a modifier of the preposition phrase em Washington “in Washington.” Therefore, if syntactic information is to be explored in the context of temporal relation classification (or any other problem), the choice of parser can produce different results.

The representations produced by these parsers are aligned with the word tokens coming from LX-Suite, so no additional alignment efforts are required.

Figure 4.5: Example dep endency graph pro duced b y LX-DepP arser

4.4 Classifier Features

In document Processing temporal information in unstructured documents (Page 106-111)