Different types of information are tried in the solutions for the automatic classifica- tion of temporal relations.
Several classifier features are tested, and many of them are new. These are pre- sented in Section4.4. In order to evaluate these classifier features, simple classifiers
4.2 Baselines
are employed as baselines. These baselines use a minimal set of relatively shallow features. They are presented in this section.
The baselines consist of machine learned classifiers similar to the ones used by
Hepple et al.(2007). This was one of the participating systems of the first TempEval.
It used machine learning algorithms implemented in Weka (Witten & Frank,1999).
Here we follow the same approach and test several of the algorithms implemented in Weka. These baseline classifiers are also similar to the classifiers used in Section3.8.2
to compare TimeBankPT to the original English corpus that it is based on, with a few differences made explicit here. We present results for the same algorithms as
Hepple et al.(2007) used in the first TempEval, and additionally for a decision trees algorithm, Weka’s trees.J48. This last one was chosen because it is fast to train and produces human readable models, which is useful during development and for error mining.
The algorithms that were employed are:
• rules.DecisionTable is a decision table classifier;
• trees.J48 is Weka’s implementation of the C4.5 algorithm to generate decision trees;
• rules.JRip is a propositional rule learner implementing the RIPPER algorithm; • lazy.KStar is a nearest neighbor classifier that uses an entropy-based similarity
function;
• bayes.NaiveBayes is a Bayesian classifier;
• functions.SMO is an algorithm to train support vector machines.
The default parameters are used for all of these algorithms, both in these base- lines and in the final classifiers.
At this point, it should be mentioned again that the tasks of TempEval are to determine the type of temporal relations. Each train or test instance thus corre- sponds to a temporal relation, i.e. a TLINK element in the TimeML annotations
(see Figures 2.5 and 3.2). The classification problem is to determine the value of
an event (referred by the eventID attribute of TLINK elements) to another tempo- ral entity, that can be a time (pointed to by the relatedToTime attribute), in the case of Task A Event-Timex and Task B Event-DocTime, or, in the case of Task C Event-Event, another event (given by the attribute relatedToEvent).
For the features that are employed in the baseline classifiers we also took inspi-
ration in the approach of Hepple et al. (2007) and our approach in Section 3.8.2.
The same features described there are used in these baselines as well. These are good features for baselines since they are easily computed from the annotated data. The event- features correspond to attributes of EVENT elements, with the ex- ception of the event-string feature, which takes as value the character data inside the corresponding TimeML EVENT element. In a similar spirit, the timex3- fea- tures are taken from the attributes of TIMEX3 elements with the same name. The order features are the attributes computed from the document’s textual content. The feature order-event-first encodes whether in the text the event term precedes the time expression it is related to by the temporal relation to classify. The classifier feature order-event-between describes whether any other event is mentioned in the text between the two expressions for the entities that are in the temporal relation, and similarly order-timex3-between is about whether there is an intervening temporal expression. Finally, order-adjacent is true if and only if both order-timex3-between and order-event-between are false (even if other words occur between the expressions denoting the two entities in the temporal relation).
One difference between the baseline models and the models described in Sec- tion 3.8.2 is that the final sets of features employed in the classifiers used in Sec- tion3.8.2are the same as the ones used byHepple et al.(2007) for English: since the point was to compare classifier performance on the two data sets, the same features
were used. That is, the feature sets employed are the ones reported byHepple et al.
(2007) and optimized for English. The feature sets in these baselines are, in turn,
optimized for Portuguese.
More specifically, we tried all possible combinations of these features. The re- sulting classifiers are evaluated using 10-fold cross-validation on the training data. The the best classifier was kept as the baseline, for the rest of the work reported
4.2 Baselines
Task
Attribute Task A Task B Task C
event-class d×rkns ×jrk×s djrkns event-stem ×jrk×× ××r×n× ×××××× event-aspect ××rk×s ××××n× ××rkn× event-tense ××rkn× djrkns djrkns event-polarity d×××ns ××××n× ××r×n× event-pos ××r××s ×××k×× ×j××ns event-string ××××ns ×××××× ××××××
order-adjacent ××××n× n/a n/a
order-event-first djrkns n/a n/a
order-event-between djrkns n/a n/a
order-timex3-between ×jrk×s n/a n/a
timex3-mod ××r×ns ×××k×× n/a
timex3-type d×rk×s ××rk×× n/a
Table 4.1: Feature combinations used in the baseline classifiers. Features inspired
by the ones used byHepple et al.(2007) in TempEval. Key: d means the feature is
used with DecisionTable; j, with J48; r with JRip; k, with KStar; n, with NaiveBayes and s with SMO.
the sets of features that yield these best results and are employed in the baseline classifiers.
Table 4.2 presents the evaluation results for the best feature combination and
for each task and algorithm, using 10-fold cross-validation. data.
The results in Table4.2are better than the results in Table3.7, in Section3.8.2, because in the former case feature selection was performed with the Portuguese data, whereas in the latter the combination of features used was the same as the one used for English byHepple et al. (2007), although the initial set of features is identical.
These are the classifiers that will be used for the comparison with the additional
features to be tried. As mentioned before in Chapter3, the data used are organized
in a training set and an evaluation set. The training part is around 60,000 words long, the test data containing around 9,000 words. When tested on the held-out test
data, these six classifiers present the scores in Table 4.3. These scores will also be
compared at the end.
These baselines are easily reproducible: they are based on freely available soft- ware, and the features that are employed are easily computed from the annotated
Task
Classifier Task A Task B Task C
DecisionTable 55.5 79.3 52.2 J48 57.1 79.7 55.6 JRip 53.8 78.8 52.7 KStar 58.3 79.5 56.8 NaiveBayes 57.5 80.2 54.2 SMO 57.2 79.8 57.0
Majority class baseline 49.4 62.4 41.8
Table 4.2: Performance of the baseline classifiers on the training data, using 10-fold cross-validation on the training data
data, with no need to run any natural language processing tools whatsoever (or any other tool).
A few comments on the selected features are in order. Task A Event-Timex seems to benefit from some of the order- features considerably, as they are present in the optimal feature set of every classifier for this task. Task A Event-Timex is about temporal relations between events and times mentioned in the same sentence. When they are mentioned close enough in the text, it is often the case that the time expression is syntactically dependent on the event term, in which case the temporal relation is very frequently OVERLAP. In some other cases, these two entities are mentioned in the same sentence very far apart from each other, and the temporal relation between them is more indirect, and it is often not OVERLAP.
For Task B Event-DocTime and Task C Event-Event, verb tense seems to be a very important classifier feature. For Task B Event-DocTime, it is the only feature that is present in the best feature combinations of all algorithms. This is expected, since this task is about relating an event with the document creation time, and verb tense locates the event denoted by a verb relative to the speech time, which is the same as the document creation time. For Task C Event-Event, the information carried in the class attribute of EVENT elements, encoded in the event-class feature, is also useful. Task C Event-Event is about temporal relations between the main events of two consecutive sentences. The feature class distinguishes, among other things,
between states and other types of situations (see Section 3.3.1 and Section 2.2.2).