HeidelTime’s Algorithm with Domain-dependent Normalization Strategies

3.5 HeidelTime, a Multilingual, Cross-domain Temporal Tagger

3.5.5 HeidelTime’s Algorithm with Domain-dependent Normalization Strategies

In this section, we present HeidelTime’s algorithm with its different phases and the domain-dependent normalization strategies used to fully normalize underspecified and relative temporal expressions.

HeidelTime’s Algorithm

As show in Figure 3.8, HeidelTime expects as input part-of-speech tagged sentences and user-specified parameters defining which types of expressions are to be annotated (parameter annotate) and which language and domain are used (parameters lang and domain, respectively). In an initialization phase,

3.5 HeidelTime, a Multilingual, Cross-domain Temporal Tagger Resources – … Resources – Spanish Resources – German Pattern Resources Normalization Resources Rule Resources Resources – English Pattern Resources Normalization Resources Rule Resources Parameters lang=”English” domain=”news” annotate= Dates Times Durations Sets Algorithm

Input: POS-tagged sentences, parameters 1 readParameters(lang, annotate, domain) 2 interpretResources(lang)

3 allTimexes = ()

4 foreach sent in document 5 foreach type in annotate

6 t = extractTimexes(sent,type) 7 normalizeTimexes(t) 8 allTimexes.add(t) 9 disambiguateTimexes(allTimexes, domain) 10 removeInvalidTimexes(allTimexes) √ √ √ √

Figure 3.8: HeidelTime’s algorithm reading parameters and resources.

the parameters are read (line 1) and the resources of the corresponding language are interpreted by HeidelTime’s resource interpreter (line 2) as described in Section 3.5.3. Then, HeidelTime performs the extraction and normalization of temporal expressions by running the following phases: (i) the extraction phase, (ii) the normalization phase, (iii) the disambiguation phase, and (iv) the cleaning phase. In Figure 3.8, these phases are called in lines 6, 7, 9, and 10, respectively. The extraction and normalization phases are called for every sentence (line 4) and for every annotation type (line 5). Note that for each sentence, all rules are applied as will be further explained in Section 3.5.6.

Extraction Phase & Normalization Phase: Extraction and Local Normalization

During the extraction phase, the extraction parts of the rules are searched in the sentences. During the normalization phase, the – possibly underspecified – normalized values are assigned to the extracted expressions. In the previous section, we detailed the syntax of the rule language and described that further constraints (pos_constraint, offset) may have to be satisfied in the extraction phase, and further attributes (mod, freq, and quant) may have to be normalized in the normalization phase.

Disambiguation Phase: Addressing Underspecified and Overlapping Expressions

After all sentences are processed, underspecified and ambiguous expressions are subject to analysis in the disambiguation phase. For this, all extracted expressions, which are part of other temporal expressions, are removed. For example, in the phrase “On January 24, 2009, . . . ”, HeidelTime’s rules match (i) “January 24, 2009”, (ii) “January 24”, (iii) “January”, and (iv) “2009”, but all expressions except the longest one (i) are removed. If overlapping expressions are extracted, e.g., “late Monday” and “Monday morning”, the situation is more difficult and thus resolved after the value normalization is finished as detailed below.

In the next step, all remaining temporal expressions are searched for values starting with “UNDEF”. For these expressions, the reference time and the relation to the reference time are determined, and the values are disambiguated according to this information – depending on the domain.

Then, the overlapping expressions described above are disambiguated. For this task, different strategies may be applied. While one possible strategy, which was our first realized strategy (see, Strötgen and Gertz, 2013a), is to keep only one of two overlapping expressions, another, more promising and currently applied strategy is to merge both expressions into a single one if both expressions are of the same type

3 Cross-domain Temporal Tagging

(or of types date and time) and if neither of the expressions is matched by a negative rule. In the latter case, the expression matched by the negative rule is removed. Thus, HeidelTime does not only rely on its rules but can also merge expressions similar to Chronos (Negri and Marseglia, 2004), which used specified composition rules (cf. Section 3.2.5). While determining the new extent is straightforward – e.g., “late Monday” and “Monday morning” are merged into “late Monday morning” – a distinction of cases is needed for the normalization:

• The value attribute is set in the following way: (i) If the two expressions have the same value attribute, this value is used for the merged expressions as well. (ii) If they have different value attributes, the more fine-grained value is used. (iii) If the granularities are equal for both expressions but the values are not identical, the value of the first expression is used.

• Other normalization attributes, such as the modifier attribute, are set in the following way: (i) If an attribute is identical for both expressions or only one of the overlapping expressions has an attribute, it is used for the merged expression. (ii) If two expressions have different attribute contents, the attribute content of the first expression is used.

In addition, the user is informed about overlapping expressions40since these indicate that the rules can probably be improved. In the example, a rule for expressions such as “late Monday morning” should be added. The user can modify the corresponding rules or create new rules, which is quite simple due to the strict separation between the source code and the resources and due to the well-defined rule syntax.

Cleaning Phase: Removing Invalid Expressions

In the cleaning phase, all invalid temporal expressions are deleted, i.e., expressions identified by negative rules and thus expressions with the value “REMOVE”. Since all shorter expressions within these expressions have already been deleted in the disambiguation phase, the task of negative rules to block parts of expressions for other rules is correctly performed in the cleaning phase. The following example illustrates this procedure. Assuming the phrase “in 2000 kilometers”, the expression “2000” is extracted as a temporal expression by a positive rule. However, “2000 kilometers” is matched by a negative rule (a rule similar to rule_negative_r1 presented in the Rule Syntax Example 5, page 82). During the disambiguation phase, the expression “2000” is removed since it is covered by the longer matched expression “2000 kilometers”. Finally, during the cleaning phase, “2000 kilometers” is removed since it was matched by a negative rule with the value “REMOVE”, so that finally no expression is matched in the phrase “in 2000 kilometers”.

Domain-dependent Normalization Strategies

To further detail the disambiguation phase, we use two examples of Figure 3.2 (page 49). In the news document (Figure 3.2(a)) and the narrative document (Figure 3.2(b)), the expressions “December” and “December 25” occur. In HeidelTime’s normalization phase, they are normalized to “UNDEF-year-12” and “UNDEF-year-12-25”, respectively. During the disambiguation phase, these have to be fully specified. For this, HeidelTime applies domain-dependent normalization strategies (cf. Section 3.3.8). Thus, for narrative documents, HeidelTime assumes the previously mentioned temporal expression of the type date to be the reference time. Assuming a chronological order of the reference time and the underspecified expression, the value of the expression “December 25” is correctly normalized to “1979-12-25”.

40_{HeidelTime outputs the following information as stderr (standard error output stream): the two overlapping expressions and} the names of the rules which matched the two expressions.

3.5 HeidelTime, a Multilingual, Cross-domain Temporal Tagger

For news documents, HeidelTime assumes the document creation time to be the reference time, and the relation to the document creation time has to be identified using tense information of the sentence. This is done by exploiting part-of-speech tags of the verbs in the sentence. If past tense is determined, the year of the value will be set to the year of the previous December of the document creation time. If present or future tense is identified, it will be set to the year of the December after the document creation time. In the example, the document creation time is “1998-04-28”, i.e., the value of the expression “December” is correctly disambiguated to “1997-12” since the tense of the sentence (the verb “cited”) is determined as past tense. In general, HeidelTime performs the domain-dependent normalization as suggested in Section 3.3.8 and as illustrated in Figure 3.5 (page 60).

Summary

In this and the previous sections, we have shown that HeidelTime’s rule syntax is well-defined and can be used to extract and normalize different types and occurrences of temporal expressions. We explained some examples for English temporal expressions and detailed HeidelTime’s algorithm with its domain- dependent normalization strategies. In the next sections, we will discuss typical aspects of rule-based systems, and detail how HeidelTime’s resources were developed for English, but also for several other languages, and how language resources can be developed for further languages.

In document Domain-sensitive Temporal Tagging for Event-centric Information Retrieval (Page 100-103)