Evaluation on a larger dataset - Unsupervised Learning of Multiword Expressions

The evaluation results that have been presented sofar look promising. However, there are concerns about the statistical significance of the improvement in terms of accuracy that the system achieves over the baseline. We employed two non parametric statistical tests: Fisher’s exact test and McNemar’s test. The latter is not suitable for small contingency table values, while the former one is. The values in the contingency tables representing the current experiments are marginal; thus, we applied both statistical significance tests. According to Fisher’s exact test, the proposed system with manually estimated parameters performs significantly better than the 1c1word baseline for some similarity values (sim), only: 15%, 40%, [50%-55%], and [70%-95%] (see figure 4.3). In contrast, according to McNemar’s test the same improvement is statistical insignificant for all similarity values (sim). Our intuition about this result is that most possibly the size of the dataset is very small.

To inspect whether the proposed system actually performs better than the baseline, we created a bigger dataset, shown in table 4.4. It consists of 100 multiword expressions, half of which are compositional and half non-compositional. This dataset is a super-set of the previous one.

On this new dataset, we evaluated the proposed system using exactly the same evaluation settings as before. Figure 4.8 presents a comparison between the accuracies achieved by the 1c1word baseline, the system with manually estimated parameters, and the system with parameters automatically estimated by the best performing weighted and unweighted graph connectivity measures. We observe that for the meaningful range of similarity values (sim), [20%, 95%], the system with manual selected parameters performs better than the 1c1word baseline. According to Fisher’s exact test, this increase in accuracy is statistically significant for the whole range. In contrast, according to McNe- mar’s test, the increase in significant for: [20% ,45%], 65%, and 90%.

Compositional multiword expressions

action officer basic color car battery box white oak cartridge brass checker board closed chain common iguana corn whiskey corner kick cream sauce cubic meter eastern pipistrel field mushroom flight simulator graphic designer hard candy honey cake ill health jazz band

jet plane king snake labor camp laser beam

lemon tree life form love letter luggage van male parent medical report memory device mythical monster parking brake petit juror red fox relational adjective sausage pizza savoy cabbage surface fire taxonomic category tea table telephone service thick skin touch screen toxic waste upland cotton water snake water tank wood aster parenthesis-free notation

Non-compositional multiword expressions

agony aunt air conditioner black maria dead end dutch oven fire brigade fish finger fool’s paradise goat’s rue golden trumpet green light high jump joint chiefs lip service living rock magnetic head monkey puzzle motor pool oyster bed palm reading paper chase paper gold paper tiger personal equation personal magnetism petit four picture palace pill pusher pink lady pink shower powder monkey prince Albert

public eye quick time rat race red devil

red dwarf red tape road agent round window

sea lion small beer small voice spin doctor stocking stuffer sweet bay teddy boy think tank vegetable sponge winter sweet

Figure 4.8: Comparison of baseline system, manual parameter tuning and the two best performing graph connectivity measures

As far as automatic parameter tuning is concerned, the best performing unweighted and weighted graph connectivity measures are average graph entropy and weighted average graph entropy. They perform very similarly to each other and it is not clear which one is best. However, both systems with automatically tuned parameters perform significantly better than the baseline for a meaningful range of similarity values (sim), [20%, 95%], according to both statistical significance tests.

Interestingly, most values of the four systems in figure 4.8 for similarity values in [0%, 75%] are less than 50%. However, the dataset consists of an equal number of compositional and non-compositional multiword expressions. There are several reasons why accuracy for all systems happens to be lower than 50%. A major one is that small similarity values are expected to judge most multiword expressions as compositional. At the same time, some vectors are very noisy, since the data is downloaded from the web. Due to the great differences in frequency of the multiword expressions, different settings are mostly suitable for each. This is only taken into account by the parameter estimation scheme that employs graph connectivity measures.

Figures 4.9 and 4.10 show the accuracy achieved by the systems using unweighted and weighted graph connectivity measures for automatic parameter estimation, respect- ively. We observe that the worst performing ones, average degree and weighted average

Figure 4.9: Unweighted graph connectivity measures.

Figure 4.10: Weighted graph connectivity measures.

degree are still not much worse than the others. The remaining ones, unweighted and weighted versions of average cluster coefficient, edge density, and average graph entropy perform similarly. Average graph entropy and weighted average graph entropy achieve the highest accuracy value.

whose parameters are automatically tuned can perform better than one whose parameters were chosen manually. The reason is that during manual parameter estimation the best “universal” parameter combination was chosen. This means that for all multiword expressions and their corresponding semantic heads the parameters are the same. In contrast, the automatic parameter estimation scheme, that was presented in section 4.6, selects a different parameter setting for each word or multiword expression whose senses are induced.

4.9 Further evaluation of unsupervised parameter tuning

In document Unsupervised Learning of Multiword Expressions (Page 160-164)