3.5 HeidelTime, a Multilingual, Cross-domain Temporal Tagger
3.5.7 Resource Development Process
In this section, we describe the resource development process for different languages. First, we detail the evolution of HeidelTime’s English resources and domain capabilities. Then, we present a general strategy how to add language resources for further languages and briefly explain how the resources for the languages currently supported by HeidelTime were added. It is crucial to know which corpora have been used to develop HeidelTime’s language resources to be able to interpret the evaluation results in Section 3.6, where many different corpora have been used to evaluate HeidelTime.43
English Resources
In the context of TempEval-2, we developed HeidelTime’s first version of English resources using the TempEval-2 training data, which corresponds to the TimeBank corpus (Verhagen et al., 2010). We developed a precision- and a recall-optimized rule set (Strötgen and Gertz, 2010a), but later dropped the recall-optimized one since we decided to put HeidelTime’s focus on high-quality normalization of temporal expressions rather than trying to increase the recall of the extraction task at the expense of normalization quality. Note that the TempEval-2 challenge only addressed temporal tagging of documents of the news domain, and thus, HeidelTime was developed to interpret temporal expressions according to the news domain strategy.
For processing narrative-style documents such as Wikipedia articles, we then added the second normalization strategy to HeidelTime and extended the pattern, normalization, and rule resources. However, these had only been minor extensions and the main effort was put into developing the new normalization strategy for relative and underspecified expressions (cf. Section 3.3.8 and Section 3.5.5). In addition, these adaptations were not performed using an annotated corpus since at this point in time, there has not been any temporally annotated corpus for narrative-style documents. WikiWars has been published in 2010 as the first corpus containing such documents (Mazur and Dale, 2010). Thus, in the context of our work on spatio-temporal document exploration (Strötgen and Gertz, 2010b), we manually checked the results on some Wikipedia articles, developed the narrative normalization strategy accordingly, and adapted the English resources when necessary. The result of this work corresponds to the first publicly available version of HeidelTime’s English resources (initial version), and the WikiWars corpus was only used for evaluation.
After having successfully addressed the news and the narrative domain, we studied the differences to other domains, namely colloquial and scientific documents, as well as challenges and possible strategies to address them (Strötgen and Gertz, 2012b). In this context, we developed HeidelTime’s normalization strategies for colloquial and autonomic documents (cf. Section 3.3.8). In addition, English-colloquial and English-scientific resources have been developed.
For the development of the English-colloquial resources, we added several non-standard language expressions, which are often used as synonyms for temporal expressions in colloquial text such as tweets
43Note that HeidelTime is a dynamic system and since making HeidelTime publicly available, we keep on receiving feedback with suggestions on how to improve HeidelTime. In addition, whenever we are applying HeidelTime and analyze its tagging results, we try to identify tagging errors and to think about possible improvements, which are usually easy to integrate due to HeidelTime’s well-defined rule syntax. Thus, we are regularly updating HeidelTime’s resources to further increase its quality for extracting and normalizing temporal expressions on different domains. Due to HeidelTime’s dynamic nature, the resource development process described in this section covers the evolution of HeidelTime’s resources until the current version (version 1.5, released September 17, 2013).
3.5 HeidelTime, a Multilingual, Cross-domain Temporal Tagger
and short messages. For this, the entries of all pattern resources are checked for synonyms using the noslang dictionary44 that contains more than 5,000 entries of so-called Internet slang and acronym formulations that are often used in SMS as well. Then, all synonyms are added to the pattern and normalization resources. When processing colloquial texts, one has to select “english-colloquial” as language, in addition to setting the domain to “colloquial”.
For English-scientific, we added some phrases that are often used to refer to a time point zero. Furthermore, we mainly adapted the normalization resources for patterns referring to unresolvable expressions (cf. Section 3.3.8). The strategies to handle colloquial and autonomic documents, and also the pattern and normalization resources for English-colloquial and English-scientific, have been developed by analyzing the newly developed corpora Time4SMS and Time4SCI (cf. Section 3.3.6). This should be taken into account when interpreting HeidelTime’s evaluation results on these corpora.
In the context of TempEval-3 (UzZaman et al., 2013), we used the TempEval-3 training data to further boost HeidelTime’s extraction and normalization quality for English (Strötgen et al., 2013). However, we only used the two gold standard corpora (corrected versions of the TimeBank corpus and the Acquaint corpus, cf. Section 3.2.3) and not a newly published silver standard corpus, which contains merged results of three state-of-the-art temporal taggers. This decision was made after an initial analysis of the silver standard, which did not seem to be helpful for developing and improving a rule-based systems. The changes to improve HeidelTime’s extraction and normalization quality on the TempEval-3 corpus have been validated on several other English corpora to avoid overfitting to the TempEval-3 training data.
In summary, we developed English resources for temporal tagging documents of four domains: news, narrative, colloquial, and autonomic. Note that the domain-dependent normalization strategies are language-independent, but that for autonomic and colloquial documents additional language-dependent patterns and normalization resources have been developed. In Section 3.6, we will present HeidelTime’s evaluation results on different corpora and domains. Note that the evaluation corpora have not been used to develop HeidelTime’s English resources until boosting HeidelTime for TempEval-3. For this, in Section 3.6, we will clearly point out whether a corpus was used during the development or exclusively as evaluation corpus.
General Resource Development Process for Further Languages
In the following, we describe the resource development process as a general strategy to add capabilities for a new language to HeidelTime. While we followed this strategy to add German, Spanish, Italian, Arabic, and Vietnamese resources, this is also the strategy we suggest to add further languages. Note that while some (semi-)automatic approaches for adapting a temporal tagger to additional languages have been described in Section 3.4.4, these did not perform as well as manually adapted systems. Thus, our strategy requires some manual effort. However, we agree with Negri (2007) that this process can be quite fast if the developer has (at least) basic knowledge about the language and is familiar with the system’s architecture. Due to HeidelTime’s well defined rule syntax and the strict separation between the source code and language-dependent resources, the latter point is even not very important in the case of developing HeidelTime resources for further languages.
Linguistic Preprocessing: Except for the resources, all HeidelTime internals are indeed language- independent. However, HeidelTime requires linguistic preprocessing, namely sentence splitting, tokeniza-
44
3 Cross-domain Temporal Tagging
tion, and part-of-speech tagging (cf. Section 3.5.2). These tasks are language-dependent and have to be addressed when one wants to extend HeidelTime for further languages. As will be detailed in Section 3.5.8, HeidelTime is based on the unstructured information management architecture UIMA (Ferrucci and Lally, 2004b). Thus, the preprocessing tasks have to be performed by an analysis engine. Either one of the wrappers of the UIMA HeidelTime kit (cf. Section 3.5.8) already can process the language of interest, or an analysis engine for these preprocessing tasks has to be developed. This can usually be done by writing a UIMA wrapper for an existing linguistic preprocessing tool for the language of interest.
Resource Development Process: The linguistic compositions to form temporal expressions are language- dependent. Thus, it is important to develop language-dependent rules. However, the meaning of temporal expressions in different languages is often very similar. For example, all current HeidelTime languages contain patterns (or words) referring to names of months such as “January” (English), for which translations to the seven languages are amongst others: “Januar” (German), “januari” (Dutch), “enero” (Spanish), “gennaio” (Italian), “janvier” (French), “
QKA JK
(/ynayr/)” (Arabic), and “tháng một” (Vietnamese). Note that there are variations in how one refers to the month “January” in the different languages, but the meaning of “January” can be expressed by these patterns.Translation of Pattern Files:As described in Section 3.5.3, HeidelTime’s language-dependent resources contain so-called pattern files, which are read by HeidelTime’s resource interpreter and later accessed by the extraction part of the rules. These pattern files contain pieces of temporal information, e.g., names of months, names of weekdays, but also numbers, which can refer to days of a month, and so on. The first step in the resource development process for a new language is to develop the pattern information. The goal is that the pattern files contain all the patterns that are usually used in the target language to form temporal expressions. For this, we start with the pattern files of the source language (usually English) and translate all the content that also exists in the target language. Note that pattern files can be removed, and new pattern files can be added if necessary.
Translation of Normalization Files:Closely related to the pattern resources are HeidelTime’s normal- ization resources, which can be accessed by the normalization parts of the rules. Here, the meaning of the patterns is stored, for example, that “01” is the normalized value of expressions referring to the month January. It is possible to put normalization information of patterns from different pattern files into the same normalization resource. For example, there may be different patterns for expressions referring to a month which can be used in different contexts (and thus in different rules), but the normalization information of all the month patterns may be stored in the same normalization resource. Based on the source normalization resources (usually English), the normalization resources for the target language are created.
Rule Development and Iterative Resource Improvement: For the rule development, the following strat- egy can be applied:
1. Based on the source rules (English) and knowledge about the target language, a few simple rules for the target language are developed.
2. The training documents are processed with these simple rules and checked for incompletely matched expressions. Based on them, the simple rules can be improved and extended, and – whenever necessary – further patterns and normalization information can be added to the resources. This is, for instance, usually necessary for modifiers, which can be expressed in many different ways.
3.5 HeidelTime, a Multilingual, Cross-domain Temporal Tagger
3. In the next step, the training documents are checked for undetected temporal expressions, and rules are created to match such expressions. Here, the goal should be to write the extraction part of the rules as precisely as necessary and as generally as possible. In addition, more complex source rules can be translated to achieve high coverage in the target language although such rules might not have been necessary for the training corpus of the target language. In this way, the resources for the target language can benefit from the high quality of the source language which would not be possible if a temporal tagger for the new language is developed from scratch. For instance, the Spanish HeidelTime resources benefited from the high quality of the English resources, which were used as the starting point in the Spanish development process (Strötgen et al., 2013).
4. Finally, steps (2) and (3) are applied recursively for the adapted resources. This should be done until the rules cannot be improved or modified further without worsening the already obtained results. Note that parts of this process can be performed automatically as suggested by Negri et al. (2006) and Spreyer and Frank (2008). However, to achieve high quality temporal tagging resources for the target language, a manual inspection of the new resources is necessary to not achieve a lower quality as if a temporal tagger was tailored for the target language as reported by Negri et al. (2006).
Corpora Used during Resource Development
For developing HeidelTime resources for German (Strötgen and Gertz, 2011) as well as for Spanish, Italian, Arabic, and Vietnamese (Strötgen et al., 2014a), we followed the strategy described above. Thus, we used some corpora during the language resource development process as described in the following.
HeidelTime’s German resources were developed after the English ones. For our work on multilingual document similarity (cf. Section 6.5 and Strötgen et al., 2011), we used some German Wikipedia articles to improve the German rules. However, at this point in time, we had not yet developed WikiWarsDE, and thus, we did not use the WikiWarsDE corpus for the development of the German resources. In contrast, we manually checked the Wikipedia articles for incorrectly annotated expressions to detect errors.
We then developed Spanish resources in the context of the TempEval-3 competition (Strötgen et al., 2013). Thus, we used the Spanish TempEval-3 training data for developing the Spanish HeidelTime resources. In parallel, Italian, Arabic, and Vietnamese resources were developed (Strötgen et al., 2014a). Since neither of the languages was part of the TempEval-3 challenge, we had to use other corpora during the development process. For Italian, we used the Italian TempEval-2 training corpus. For Arabic, we split the existing Arabic part of ACE multilingual 2005 training corpus into a training and test sets. Note that the training set is TIMEX2-annotated and does not contain any normalization information. Thus, special attention had to be paid to the normalization quality by manually validating the normalization of the matched expressions during the resource development. For Vietnamese, no annotated training data was available so that we used some unannotated Wikipedia articles similar as for German. In an iterative way, these were manually checked for incomplete and missed temporal expressions, as well as for the quality of the normalization of the extracted temporal expressions.
Since the normalization strategies, the rule syntax, and the English resources were already available, the development of the resources for the other languages was straightforward. Although we had to deal with some language-specific challenges (in particular for Arabic), adding HeidelTime resources for new languages is much faster than building a new temporal tagger for the language of interest.
3 Cross-domain Temporal Tagging
type attributes
Timex3 filename, sentenceId, firstTokenId, foundbyRule,
timexType, timexValue, timexQuant, timexFreq, timexMod Sentence filename, sentenceId
Token filename, sentenceId, tokenId, pos DCT filename, value, timexId
Timex3Interval Timex3,
timexValueEB, timexValueLB, timexValueEE, timexValueLE
Table 3.8: All types and their attributes as defined in HeidelTime’s UIMA type system.
Motivated by the simplicity of adding language resources to HeidelTime, the resources for the other two languages currently supported by HeidelTime (Dutch and French) have been independently developed by other researchers: van de Camp and Christiansen from Tilburg University addressed Dutch (van de Camp and Christiansen, 2012) while Moriceau and Tannier from LIMSI (Paris) addressed French (Moriceau and Tannier, 2014). Both followed a similar strategy as we did for the other languages.