• No results found

Temporal Information Extraction

In document Temporal search in web archives (Page 50-54)

2.2 Natural Language Processing

2.2.3 Temporal Information Extraction

Extracting temporal information such as temporal expressions and event expres-sions from text documents is important to obtain a temporal interpretation of the documents’ contents. Later, in Chapter 5, we describe techniques that make use of temporal expressions (e.g., “January 15, 2009”, “last week”, and “in 2009”).

In this section, we provide a brief overview of temporal expressions and existing approaches for their extraction and annotation.

Classes of Temporal Expressions

Alonso et al. [AGBY07] distinguish three classes of temporal expressions:

• Explicit. These include temporal expressions such as “January 5, 2009”,

“December 1996”, or “in 1945” that have an immediate interpretation.

2.2 Natural Language Processing

Algorithm 1: Viterbi algorithm

Data: HMM and observed output1, . . . , σm} Result: Maximum probability state sequence ρ

δ[1..n][1..m] // Maximum probabilities δ(i, j)

1

• Implicit. These include temporal expressions such as “Christmas 2001”,

“Boxing Day 1995”, or “New Year’s Eve 2000”. Their interpretation re-quires background knowledge (e.g., that the expression New Year’s Eve implicitly refers to December 31).

• Relative. These include temporal expressions such as “yesterday”, “last week”, or “in January”. When interpreting them a temporal anchor (e.g., the pub-lication time of the document) is needed.

Extraction of Temporal Expressions

The extraction of temporal expressions can be separated into an identification phase and interpretation phase. In the identification phase, parts of the text that constitute temporal expressions are identified. In the interpretation phase, the meaning of the identified temporal expressions is determined by mapping them onto the timeline. For relative temporal expressions this includes determining the right temporal anchor and resolving the temporal expression relative to this anchor. Notice that the separation into these two phases is conceptual – actual tools may interleave the two phases.

State-of-the-art extraction tools for temporal expressions, including as two ex-amples GUTime [MW00] (as part of the TARSQI [VMS+05] toolkit) and Timex-Tag [AvRdR07], differ in how much they rely on hand-crafted versus learnt rules. GUTime [MW00], on the one hand, uses a hand-crafted set of regular ex-pressions to identify temporal exex-pressions; for interpreting them a combination of hand-crafted rules and learnt rules is employed. Machine learning is applied, for instance, to distinguish whether “today” refers to the publication time of the document or the meaning of nowadays. TimexTag [AvRdR07], on the other hand, relies entirely on statistical learning both in the identification and the interpreta-tion phase.

TimeML

TimeML [PCI+03, TIMEML] is a markup language to annotate temporal expres-sions and event expresexpres-sions in natural language text. TimeML annotates tem-poral expressions by enclosing them with the TIMEX3 tag. Figure 2.3 shows an excerpt from a New York Times article, published on February 1, 2007, with temporal expressions annotated using TARSQI. Let us explain some features of

2.2 Natural Language Processing

Hong Kong is poised to hold the first election in more than half <TIMEX3 tid="t3" TYPE="DURATION" VAL="P100Y">a century</TIMEX3> that includes a democracy advocate

seeking high office in territory controlled by the Chinese government in Beijing. A pro-democracy politician,

Alan Leong, announced <TIMEX3 tid="t4" TYPE="DATE"

VAL="20070131">Wednesday</TIMEX3> that he had obtained enough nominations to appear on the ballot to become the territory’s next chief executive. But he acknowledged that he had no chance of beating the Beijing-backed incumbent, Donald Tsang, who is seeking re-election.

Under electoral rules imposed by Chinese officials, only 796 people on the election committee -- the bulk of them with close ties to mainland China -- will be allowed to vote in the <TIMEX3 tid="t5" TYPE="DATE"

VAL="20070325">March 25</TIMEX3> election. It will be the first contested election for chief executive since Britain returned Hong Kong to China in <TIMEX3 tid="t6"

TYPE="DATE" VAL="1997">1997</TIMEX3>. Mr. Tsang, an able administrator who took office during the early stages of a sharp economic upturn in <TIMEX3 tid="t7" TYPE="DATE"

VAL="2005">2005</TIMEX3>, is popular with the general public. Polls consistently indicate that three-fifths of Hong Kong’s people approve of the job he has been doing . It is of course a foregone conclusion -- Donald Tsang will be elected and will hold office for <TIMEX3 tid="t9" beginPoint="t0" endPoint="t8" TYPE="DURATION"

VAL="P5Y">another five years</TIMEX3>, said Mr. Leong, the former chairman of the Hong Kong Bar Association.

Figure 2.3: New York Times article annotated using TARSQI

TimeML and TARSQI by means of this example. TimeML distinguishes differ-ent types of temporal expressions – in the example the two types DURATION and DATE are present. Temporal expressions have unique identifiers in TimeML.

The publication time of the document –if available– has the unique identifier t0. When interpreting a temporal expression, TARSQI takes into account the publication time of the document. The second temporal expression found (i.e.,

“Wednesday”) is thus correctly interpreted as January 31, 2007, as can be seen from the value 20070131 of the VAL attribute. For the last temporal expression found (i.e., “another five years”) TARSQI correctly determines that this refers to a five-year period (as can be seen from the value P5Y of the VAL attribute). The begin boundary of this duration is determined as the publication time of the document (hence beginPoint="t0"); its end boundary refers to the temporal expression having the identifier t8 that is thus introduced only implicitly.

Our description in this section focuses on temporal expressions and their ex-traction. For a broader perspective on temporal information extraction and tem-poral issues in natural language processing, we refer to the survey by Verhagen and Moszkowicz [VM09] and the textbook by Mani et al. [MPG05].

In document Temporal search in web archives (Page 50-54)