• No results found

4.5 Feature Selection and Results

4.5.3 Error Analysis

Consider a simplified problem where we use two Boolean features, the class values

are A and B, and the training set is composed of the instances in Table4.11.

A reasonable classifier trained on this data will assign a new instance with the feature vector <TRUE, FALSE> to class B, because of the training instance no. 4. But in the case of an instance like <TRUE, TRUE> it received conflicting evidence of its class: the instances 1, 2, and 3 have similar feature vectors, but while instance

4.5 Feature Selection and Results

1 has class A, the other two have the class B. The feature vector <FALSE, FALSE> is affected by the same problem: the instances 5, 6 and 7 exhibit these features and have the class A, the instances 8 and 9 have the class B. If in these cases our reasonable classifier assigns the majority class in each group of instances that share the same feature values, he will assign class B to instance no. 1 and class A to instances 8 and 9. If the classifier does not choose the majority class within each group of instances with the same features, error will be higher, at least on this training data. Therefore, this classifier cannot be 100% accurate when classifying instances that it has seen during training, and such a classifier will likely not be 100% accurate on unseen instances represented with the same set of features. This problem can be solved by adding features that differentiate these instances.

The same problem can be seen with the classifiers and the sets of features used in our work.

Even with all the features employed, for each task there are still some groups of instances such that all instances in that group are identical (their feature vectors are identical) but not all instances in that group have the same class. This means that in any of these tasks, 100% accuracy is impossible with these features.

Using the optimal set of features for the best classifiers for each task (the support vector machines), this amounts to the following numbers. For Task A Event-Timex, there are 7 such groups in the training data, affecting 17 instances. The number of instances that do not exhibit the majority class in their group is 7. Therefore, in the training data at least, error has to be at least 0.5%.

For the other tasks these numbers are higher. For Task B Event-DocTime,

there are 16 groups and 35 instances. The number of instances associated with the minority class is 16, or 0.6% of all instances in the training set. In the case of Task C Event-Event, there are 20 such groups, encompassing 64 training instances. 25 instances do not have the same class as the majority class in their group, or 1.4% of the total number of training instances. The test data do not show this problem.

With the simple features baselines (the baseline classifiers that employ a smaller set of features), this problem is much stronger. For Task A Event-Timex, there are 341 instances in the training data affected by it, in 95 groups. Around 10% of the total number of instances belong to one of these groups and do not have the majority class of their group. 10% is also roughly the gain in accuracy on the

training data for this task when we go from the simple features baselines to the final classifiers. For Task B Event-DocTime, 459 instances are affected, in 120 groups. Around 7% of the instances are affected and do not have the majority class of their group. For Task C Event-Event, this amounts to just 3%. Task A Event-Timex is the one where our work showed the greatest improvement but also the one where clearly more features were needed to properly distinguish the instances. The very small size of this problem in the final models indicates that further progress may be difficult with the mere introduction of more features.

These classifiers always produce an answer (no instance is left unclassified), but recall and precision measures can still be computed for each class value, and we can take their average, weighted by their frequency, as global recall, precision and

F-measure.1

Table4.12, Table 4.13 and Table4.14 show the precision, recall and F-measure

scores for Task A Event-Timex, Task B Event-DocTime and Task C Event-Event respectively, broken down by class. They show that some classes are much harder than others. The vague classes BEFORE-OR-OVERLAP, OVERLAP-OR-AFTER and VAGUE show null scores (on the test data they show 0 precision, recall and F-measure for the three scores), probably because of their low frequency in the data, at least partly. The instances of these classes are also naturally harder to classify, since they

1

Precision is defined as the number of true positives tp divided by the sum of the number of true positives with the number of false positives f p:

P = tp

tp + f p (4.1)

Recall is the number of true positives divided by the sum of the number of true positives with number of false negatives f n:

R = tp

tp + f n (4.2)

The F-measure is the harmonic mean of precision and recall:

F = 2 ×P × R

P + R (4.3)

For instance, for the class OVERLAP the true positives are the instances correctly classified as OVERLAP, the false positives are the instances incorrectly classified as OVERLAP, and the false negatives are the instances that should have been classified as OVERLAP but were not.

4.5 Feature Selection and Results

10-fold cross-validation Evaluation on test data

Class P R F P R F 10-fold Cross-validation OVERLAP 0.716 0.804 0.758 0.754 0.86 0.804 BEFORE 0.682 0.63 0.655 0.615 0.421 0.5 AFTER 0.621 0.649 0.634 0.452 0.633 0.528 BEFORE-OR-OVERLAP 0 0 0 0 0 0 OVERLAP-OR-AFTER 0 0 0 0 0 0 VAGUE 0.5 0.04 0.074 0 0 0 Weighted avg. 0.65 0.683 0.661 0.596 0.669 0.625

Table 4.12: Precision (P), recall (R) and F-measure (F) of the support vector ma- chine for Task A Event-Timex, broken down by class

are exactly those for which the human annotators could not make a specific decision. The majority classes (OVERLAP for Task A Event-Timex and Task C Event-Event and BEFORE for Task B Event-DocTime) seem to be the easiest, showing F-measures (0.804 for Task A Event-Timex, 0.874 for Task B Event-DocTime, and 0.653 for Task C Event-Event, on unseen data) much higher than the weighted average F-measure for the task and evaluation method.

The majority classes always show higher recall than precision, reflecting a general bias for the majority class even with all the new features. For Task A Event-Timex, this is also the case for the second most frequent class (AFTER) on the unseen test data. The other classes show much poorer recall, which means that this classifier is strongly biased for the two most frequent classes.

In the case of Task B Event-DocTime, recall is higher than precision for the majority class and the third most frequent class (BEFORE and AFTER, respectively). The OVERLAP class, which is the second most frequent class, shows the inverse numbers. The most useful feature for this classifier is verb tense, so this difficulty may be linked to the tense system of Portuguese, possibly with the ambiguity of the present tense. This tense can describe ongoing events, but also past (the historical use of the present tense) and future events. In Portuguese, it is used to describe future events much more often than in English.

In Task C Event-Event, the majority class is once again OVERLAP and the second most frequent class is BEFORE. Here, once again recall lines up with frequency.

10-fold cross-validation Evaluation on test data Class P R F P R F 10-fold Cross-validation OVERLAP 0.718 0.685 0.701 0.847 0.61 0.709 BEFORE 0.881 0.938 0.909 0.808 0.952 0.874 AFTER 0.728 0.762 0.745 0.686 0.729 0.707 BEFORE-OR-OVERLAP 0 0 0 0 0 0 OVERLAP-OR-AFTER 0.333 0.057 0.098 0 0 0 VAGUE 0.364 0.111 0.17 0 0 0 Weighted avg. 0.798 0.825 0.809 0.764 0.792 0.769

Table 4.13: Precision (P), recall (R) and F-measure (F) of the support vector ma- chine for Task B Event-DocTime, broken down by class

10-fold cross-validation Evaluation on test data

Class P R F P R F 10-fold Cross-validation OVERLAP 0.625 0.717 0.668 0.65 0.656 0.653 BEFORE 0.53 0.691 0.6 0.425 0.627 0.507 AFTER 0.573 0.62 0.596 0.521 0.595 0.556 BEFORE-OR-OVERLAP 0 0 0 0 0 0 OVERLAP-OR-AFTER 0 0 0 0 0 0 VAGUE 0 0 0 0 0 0 Weighted avg. 0.494 0.581 0.533 0.49 0.55 0.515

Table 4.14: Precision (P), recall (R) and F-measure (F) of the support vector ma- chine for Task C Event-Event, broken down by class

There are many test instances misclassified as BEFORE, which is reflected in its relatively low precision