• No results found

“He has called the garage to make an appointment for the car.”

4.5 Machine learning

Machine learning algorithms extrapolate from the example to new input cases, either by extracting regularities from the examples for instance in the form of rules or decision trees, or by a more direct use of analogy in lazy learning algorithms such as memory-based learning. We chose to use two machine learning algorithms in our study: rule induction as implemented inRIPPER(Cohen, 1995) (version 1, release 2.4) and memory-based learning MBL (Aha et al., 1991; Daelemans et al., 1999), as implemented in the TiMBL software package (Daelemans et al., 2002).

Rule induction

Rule induction is an instance of “eager” learning, where effort is invested in searching for a minimal-description-length rule set that covers the classifications in the training data. The rule set can then be used for classifying new instances of the same task.

RIPPER (Cohen, 1995) induces rule sets for each of the classes in the data, maximizing accuracy and coverage for each induced rule. The method starts with the ordering of all classes in the training data (for the experiments described here that is NOUN and

VERB). The rule induction algorithm finds a rule set that separates the least frequent class from the remaining classes. All instances covered by the learned rule set are then removed from the data set, and the algorithm separates the next least frequent class from the remaining classes. This process is repeated until a single class remains. This class, which is the most frequent one, will be used as default class.

Memory-based learning

Memory-based learning, in contrast, is “lazy”, meaning that learning is merely the stor-age of training examples in memory and it generalizes by using intelligent similarity metrics. The category of the most similar example(s) is used as a basis for extrapolating the category of the test example.

Memory-based learning treats a set of labelled (classified) training instances as points in a multi-dimensional feature space, and stores them as such in an ‘instance base’

in memory. An instance consists of a fixed-length vector of feature-value pairs, and an information field containing the classification of that particular instance. After the instance base is stored, new (test) instances are classified by matching them to all stances in memory, and by calculating with each match the ‘distance’ between the

in-stance in memory and the new inin-stance. The classification of new material in MBL

essentially follows the k-nearest neighbor classification rule (Cover and Hart, 1967) of searching for nearest neighbors in memory, and extrapolating their (majority) class to the new instance.

The strength of memory-based language processing is that it performs no abstraction, for instance through defining rules, which allows it to deal with productive but low-frequency exceptions (Daelemans et al., 1999). Taking these exceptions into account is useful, since it is difficult to discriminate between noise on the one hand, and valid exceptions and irregularities on the other hand.

4.5.1 Experiments

A central issue in the application of machine learning is the setting of algorithmic pa-rameters; both RIPPER and MBL feature several parameters of which the values can seriously affect the bias and result of learning. Also, the particular features that are selected as well as the amount of data available will determine which parameters are optimal. Few reliable rules of thumb are available for setting parameters. To estimate appropriate settings, a big search space needs to be sought through in some way, after which one can only hope that the estimated best parameter setting is also good for the test material – it might be overfitted on the training material.

Fortunately, we were able to do a pseudo-exhaustive search (testing a selection of sen-sible numeric values where in principle there is an infinite number of settings), since the CGN data set is small (1004 instances). ForMBL, we varied the following parame-ters systematically in all combinations (see Daelemans et al. (2002) for a description of these parameters):

• the k in the k-nearest neighbor classification rule: 1, 3, 5, 7, 9, 11, 13, 15, 19, 21, 25, 29, 35, 39, 45, 49, 55 and 65

• the type of feature weighting: none, gain ration, information gain, and chi-squared

• the similarity metric: overlap, orMVDMwith back-off to overlap at levels 1 (no back-off), 2, and 5

• the type of distance weighting: none, inverse distance, inverse linear distance, and expo-nential decay with α = 1, α = 2 and α = 4

ForRIPPER we varied the following parameters:

• the minimal number of instances to be covered by rules: 1, 2, 5, 10, 25, 50

• the class order for which rules are induced: increasing and decreasing frequency

• allowing negation in nominal tests or not

• the number of rule set optimization steps: 0, 1, 2

4.5 MACHINE LEARNING

We performed the full matrix of all combinations of these parameters for both algo-rithms in a nested 10-fold cross-validation experiment. First, the original data set was split in ten partitions of 90% training material and 10% test material. Second, nested 10-fold cross-validation experiments were performed on each 90% data set, splitting it again ten times. To each of these 10 × 10 experiments all parameter variants were applied. Per main fold, a nested cross-validation average performance was computed;

the setting with the average highest F-score on noun attachment is then applied to the full 90% training set, and tested on the 10% test set.

4.5.2 Results

First, we report on the results obtained directly from the nested cross-validation experi-ment on the Spoken Dutch Corpus data. Second, we report on applying the best overall parameter settings ofRIPPER andMBL to the external validation corpus of newspaper and e-mail data.

Internal results: Spoken Dutch Corpus data

First, we carried out experiments using MBL and RIPPER to obtain the performance score per feature (i.e. for the lexical features and the co-occurrence strength value). Ta-ble 4.2 forMBLand Table 4.3 for RIPPER show that for optimizing on noun attachment the scores for all five features are reasonably robust and add information. However, the performance measures for the best parameter setting for using all features are con-siderably higher. For testing on single features the performance is lower, especially on verb attachment. ForMBLand RIPPER only the P-feature obtains a performance on verb attachment that approximates the performance for testing on all features. ForMBL

the co-occurrence feature also shows a reasonable performance on both noun and verb attachment. Appendix D gives the results forMBL for testing on combinations of two features. These results show that the performance on using two features is better than on single features, but not as well as on using all features.

Table 4.2: Performance measures in percentages for predicting PP attachment in the CGN material (1004 in-stances) byMBL.

MBL NOUNattachment VERBattachment

accuracy precision recall Fβ=1 precision recall Fβ=1

all 77 81 81 81 71 69 70

N1 67 69 85 76 63 39 48

P 73 81 72 76 63 73 67

N2 62 64 86 74 52 24 32

V 59 62 82 71 46 22 30

Cooc(N P) 68 74 73 74 59 60 59

baseline 60 60 100 75 - 0

-Table 4.3: Performance measures in percentages per feature for predicting PP attachment in the CGN material (1004 instances) byRIPPER.

RIPPER NOUNattachment VERBattachment accuracy precision recall Fβ=1 precision recall Fβ=1

all 70 74 83 77 52 50 49

N1 64 63 98 77 83 11 18

P 69 74 81 76 64 52 53

N2 66 64 98 78 87 17 27

V 62 61 99 76 55 4 7

Cooc(N P) 65 65 93 76 49 21 27

baseline 60 60 100 75 - 0

-The performance measures for both algorithms are considerably higher than the base-line which indicates the performance when always noun attachment is predicted. MBL

produces the highest accuracy, 77%, which is significantly higher than the accuracy of

RIPPER, 70% (t = 2, 87, p < 0.05, df = 18). MBLalso produces the highest F-score, 81%, which is significantly higher than that ofRIPPER, 77% (t = 2, 97, p < 0.05, df = 18).

The best overall cross-validated setting for MBL was no feature weighting, k = 25,

MVDM, and exponential decay distance weighting with α = 2. It has been argued in the literature that high k and distance weighting is a sensible combination (Zavrel et al., 1997). More surprisingly, no feature weighting means that every feature is regarded equally important.

ForRIPPER, the best overall cross-validated parameter setting is to allow a minimum of one case to be covered by a rule, induce rules on the most frequent class first (noun attachment), allow negation (which is, however, not used in the end), and run one optimization round. The most common best rule set is the following:

1. if P = van then NOUN

2. if Cooc(N P) > 0.07 then NOUN 3. if P = voor then NOUN

4. if there is no verb then NOUN 5. else VERB

This small number of rules test on the presence of the two prepositions van (from, of) and voor (for, before) which often co-occur with noun attachment (i.e. on the whole data set 351 out of 406 occurrences of the two prepositions), a value of Cooc(N P) similar to the optimal co-occurrence threshold reported earlier (0.07), and the absence of a verb (which occurs in 27 instances).

4.5 MACHINE LEARNING

External results: newspaper and e-mail data

We evaluated the results of applying the overall best settings on the held-out data (i.e.

the 157 sentence external newspaper and e-mail material). Performance measures for

MBLare given in Table 4.4 and forRIPPERin Table 4.5. These results roughly correspond with the previous results (i.e. the proportions are the same). Performance measures are again considerably above baseline, although lower than for CGN data. MBLattains lower precision but higher recall thanRIPPER on noun attachment. Again, for testing on single features the performance is lower, especially on verb attachment. ForRIPPER

only the P-feature obtains a performance on verb attachment that approximates the performance for testing on all features. ForMBLthe same is true for both the P-feature and the co-occurrence feature.

Table 4.4: Performance measures in percentages per feature for predicting PP attachment in the newspaper and e-mail material (157 instances) byMBL.

MBL NOUNattachment VERBattachment

accuracy precision recall Fβ=1 precision recall Fβ=1

all 66 69 74 72 62 55 58

N1 61 60 94 74 69 16 27

P 64 68 70 69 58 57 58

N2 57 57 96 72 43 4 8

V 55 57 90 70 36 7 12

Cooc(N P) 65 68 74 71 60 52 56

baseline 58 58 100 73 - 0

-Table 4.5: Performance measures in percentages per feature for predicting PP attachment in the newspaper and e-mail material (157 instances) byRIPPER.

RIPPER NOUNattachment VERBattachment accuracy precision recall Fβ=1 precision recall Fβ=1

all 66 70 71 71 61 60 60

N1 58 58 100 73 100 1 3

P 64 67 72 70 58 52 55

N2 57 57 99 73 50 1 3

V 57 57 100 73 0 0 0

Cooc(N P) 62 62 91 74 67 24 35

baseline 58 58 100 73 - 0