The evaluation results that have been presented sofar look promising. However, there are concerns about the statistical significance of the improvement in terms of accuracy that the system achieves over the baseline. We employed two non parametric statistical tests: Fisher’s exact test and McNemar’s test. The latter is not suitable for small contingency table values, while the former one is. The values in the contingency tables representing the current experiments are marginal; thus, we applied both statistical significance tests. According to Fisher’s exact test, the proposed system with manually estimated para- meters performs significantly better than the 1c1word baseline for some similarity values (sim), only: 15%, 40%, [50%-55%], and [70%-95%] (see figure 4.3). In contrast, accord- ing to McNemar’s test the same improvement is statistical insignificant for all similarity values (sim). Our intuition about this result is that most possibly the size of the dataset is very small.
To inspect whether the proposed system actually performs better than the baseline, we created a bigger dataset, shown in table 4.4. It consists of 100 multiword expressions, half of which are compositional and half non-compositional. This dataset is a super-set of the previous one.
On this new dataset, we evaluated the proposed system using exactly the same eval- uation settings as before. Figure 4.8 presents a comparison between the accuracies achieved by the 1c1word baseline, the system with manually estimated parameters, and the system with parameters automatically estimated by the best performing weighted and unweighted graph connectivity measures. We observe that for the meaningful range of similarity values (sim), [20%, 95%], the system with manual selected parameters per- forms better than the 1c1word baseline. According to Fisher’s exact test, this increase in accuracy is statistically significant for the whole range. In contrast, according to McNe- mar’s test, the increase in significant for: [20% ,45%], 65%, and 90%.
Compositional multiword expressions
action officer basic color car battery box white oak cartridge brass checker board closed chain common iguana corn whiskey corner kick cream sauce cubic meter eastern pipistrel field mushroom flight simulator graphic designer hard candy honey cake ill health jazz band
jet plane king snake labor camp laser beam
lemon tree life form love letter luggage van male parent medical report memory device mythical monster parking brake petit juror red fox relational adjective sausage pizza savoy cabbage surface fire taxonomic category tea table telephone service thick skin touch screen toxic waste upland cotton water snake water tank wood aster parenthesis-free notation
Non-compositional multiword expressions
agony aunt air conditioner black maria dead end dutch oven fire brigade fish finger fool’s paradise goat’s rue golden trumpet green light high jump joint chiefs lip service living rock magnetic head monkey puzzle motor pool oyster bed palm reading paper chase paper gold paper tiger personal equation personal magnetism petit four picture palace pill pusher pink lady pink shower powder monkey prince Albert
public eye quick time rat race red devil
red dwarf red tape road agent round window
sea lion small beer small voice spin doctor stocking stuffer sweet bay teddy boy think tank vegetable sponge winter sweet
Figure 4.8: Comparison of baseline system, manual parameter tuning and the two best performing graph connectivity measures
As far as automatic parameter tuning is concerned, the best performing unweighted and weighted graph connectivity measures are average graph entropy and weighted av- erage graph entropy. They perform very similarly to each other and it is not clear which one is best. However, both systems with automatically tuned parameters perform signi- ficantly better than the baseline for a meaningful range of similarity values (sim), [20%, 95%], according to both statistical significance tests.
Interestingly, most values of the four systems in figure 4.8 for similarity values in [0%, 75%] are less than 50%. However, the dataset consists of an equal number of compositional and non-compositional multiword expressions. There are several reasons why accuracy for all systems happens to be lower than 50%. A major one is that small similarity values are expected to judge most multiword expressions as compositional. At the same time, some vectors are very noisy, since the data is downloaded from the web. Due to the great differences in frequency of the multiword expressions, different settings are mostly suitable for each. This is only taken into account by the parameter estimation scheme that employs graph connectivity measures.
Figures 4.9 and 4.10 show the accuracy achieved by the systems using unweighted and weighted graph connectivity measures for automatic parameter estimation, respect- ively. We observe that the worst performing ones, average degree and weighted average
Figure 4.9: Unweighted graph connectivity measures.
Figure 4.10: Weighted graph connectivity measures.
degree are still not much worse than the others. The remaining ones, unweighted and weighted versions of average cluster coefficient, edge density, and average graph entropy perform similarly. Average graph entropy and weighted average graph entropy achieve the highest accuracy value.
whose parameters are automatically tuned can perform better than one whose paramet- ers were chosen manually. The reason is that during manual parameter estimation the best “universal” parameter combination was chosen. This means that for all multiword expressions and their corresponding semantic heads the parameters are the same. In contrast, the automatic parameter estimation scheme, that was presented in section 4.6, selects a different parameter setting for each word or multiword expression whose senses are induced.