Chapter 2 Methods and Measures
2.3 Psychometric measures
class 45.4 did. Class 45.4 performs particularly bad in French also because its member verbs are low in frequency.
Some errors are due to polysemy, caused partly by the fact that the French version of the gold standard was not controlled for this factor. Some verbs have their predominant senses in classes which are missing in the gold standard, e.g. the most frequent sense of retenirismemorize, notkeepas in the gold standard class 13.5.1. GET.
Finally, some errors are not true errors but demonstrate the capability of clustering to learn novel information. For example, the CHANGE OF STATE class 45.4 includes many antonyms (e.g.weakenvs.strengthen). Clustering (using F17) separates these antonyms, so that verbs adoucir, att´enuer and temp´erer appear in one cluster and consolider and renforcer in another. Although these verbs share the same alternations, their SPs are different. For the same reason, verbs inLIGHT EMISSION class 43.1 end up in different clusters, depending on whether they describe abstract or concrete light emission.
The opposite effect can be observed when clustering maps together classes which are actually semantically and syntactically related (e.g. 36.1CORRESPONDand 37.7SPEAK).
Such classes are distinct in Levin and VerbNet, because these resources do not to draw links between semantically similar classes belonging to different main classes.
Cases such as these show the potential of clustering in discovering novel valuable infor-mation in data. It is encouraging that we have observed this effect in this first clustering experiment in French.
medium-high frequency verbs ( ´O S´eaghdha and Copestake, 2008; Vlachoset al., 2009b).
As seen in section 6.7.1, such differences in data can have significant impact on perfor-mance.
However, parser and feature extraction performance can also play a big role in overall accuracy, and should therefore be investigated further. When we evaluated our basicSCF feature (equivalent to F1) using the same corpus data and gold standard but an older version of the RASPparser and the SCF extraction system in section 3.4, the F dropped dramatically: from 57.8 to 38.3. The relatively low performance of basic LP features in French suggests that at least some of the current errors are due to parsing. Future research should therefore investigate the source of error at different stages of processing.
In the future, it would also be interesting to investigate whether performance on French can be further enhanced by language-specific tuning (e.g. by experimenting with language specific features such as auxiliary classes).
Methodology similar to ours has yielded promising results on semantic verb classifica-tion in German (Schulte im Walde, 2006) and Japanese (Suzuki and Fukumoto, 2009).
However, these studies have not focussed on Levin style classes, and have not explored cross-linguistic transfer. The works most related to ours are those of Merloet al.(2002) and Ferrer (2004). Our results contrast with those of Ferrer who showed that a cluster-ing approach does not transfer well from English to Spanish. However, her experiment used basic SCF and named entity features only, and a clustering algorithm less suitable for high dimensional data.
Like us, Merloet al. (2002) created a gold standard by translating Levin classes to an-other language (Italian). They also applied a classification approach developed for En-glish to Italian, and reported good overall performance using features developed for English. Although the experiment was very small in scale (involving three classes and a few features only), and although it involved a use of a supervised classification technique, the results are in agreement with our results from this larger, unsupervised experiment with French.
In their recent experiment, Falket al.(2012) built on some of the work we have described in this chapter (Sunet al., 2010). They made use of existing syntactic and semantic lexical resources to cluster 2183 French verbs in our gold standard classes (a superset of our gold standard). They experimented with a new clustering method and new feature sets. They obtained better result (70F), but this result is not comparable with ours because the gold standard was not identical. Also, manually specified rather than automatically acquired features were used in the experiment. In addition, we found that there are two potential flaws with the experiment which can affect the results. 5:
5The second point was confirmed with the first author via personal communication. We were not able to get a clarification regarding the first point.
1. In order to obtain the thematic grid feature from VerbNet, a classifier was trained to map French verbs to VerbNet classes. The gold standard verbs and classes were used to train the classifier (see footnote 3 on page 2 in their paper). In other words, the gold standard was used for feature extraction. This makes the clustering result higher than in fully automatic work, as the thematic grid feature is already implicitly encoded in the class label.
2. F-Measure was used to select the number of clusters for K-Means and IGNG (see page 4 in their paper). This means that the gold standard was used as help in clustering. This also makes the result unrealistically high from the perspective of automatic acquisition, as the reported best F-Measure cannot be extracted when the gold standard is unknown.
In sum, the experiments reported in this chapter further support the linguistic hypoth-esis that Levin-style classification can be cross-linguistically applicable or overlapping (Levin, 1993). A clustering technique such as the one presented here could be used as a helpful tool to investigate this hypothesis further, and to find out whether classifications are similar across a wider range of more diverse languages. From the NLP perspective, the fact that an unsupervised technique developed for one language can be applied to another language without substantial changes in the methodology means that automatic techniques can be used to hypothesise useful Levin-style classes in a cost-effective man-ner (Kipperet al., 2008). This, in turn, can facilitate the creation of VerbNets for new languages.
Task-based evaluation of verb classification
VerbNet has proved useful for many practicalNLPtasks including automatic verb acqui-sition (Swift, 2005), semantic role labelling (Swier and Stevenson, 2004), robust seman-tic parsing (Shi and Mihalcea, 2005), word sense disambiguation (Dang, 2004), building conceptual graphs (Hensman and Dunnion, 2004), and creating a unified lexical resource for knowledge extraction (Croch and King, 2005). According to our knowledge, auto-maticallyacquired classification has not been evaluated in the context of anNLPtask yet, although such an evaluation would be important. We apply our automatically acquired verb and noun classes (SPs) to twoNLPtasks: metaphor identification and argumentative zoning. We did this work in collaboration with Ekaterina Shutova and Yufan Guo. The project plan, system design, experiment and evaluation were carried out by Ekaterina Shutova and Yufan Guo respectively. The author’s contribution was to provide the lexi-cal classifications and the related statistics for the two tasks. We summarise the resulting work in this chapter. Details of the work can be found in the following publications:
Shutova et al. (2010); Guo et al. (2010, 2011b). All the examples, figures and tables in this chapter were originally authored by Ekaterina Shutova and Yufan Guo for the publications above.