Individual Differences
4.3.1 Participants
Encouraged by the good results obtained using the SP feature, we conducted a prelimi-nary experiment with a semantic feature important for manual classification, but previ-ously not used in automatic verb classification: diathesis alternations (DAs).
DAs are the regular alternations of the syntactic expression of verbal arguments, some-times accompanied by a change in meaning. For example:
• The man broke the window.
• The window broke.
In Levin’s classification, a verb class is usually characterized in terms of DAs. For ex-ample, COOK verbs (e.g. bake, cook, fry, toast . . . ) can take DAs such as the causative alternation, middle alternation and instrument subject alternation.
There have been a few works on automatic DA detection (McCarthy and Korhonen, 1998; Lapata, 1999; McCarthy, 2000; Tsang and Stevenson, 2004), but they all rely on WordNet. There is no prior work on incorporating automatically acquired DAs to aid verb classification.
We can define two approaches to DA acquisition: detection and approximation. De-tection is similar to the previous work (McCarthy and Korhonen, 1998; Lapata, 1999;
McCarthy, 2000; Tsang and Stevenson, 2004): SCFs are first acquired, and then supple-mented with semantic features (e.g.SPs) in order to detect whether we have aDA. We can replace the WordNetSPs with automatically acquiredSPs which have the benefit that they can be ported across tasks. Specificapproximationsof features related to DA have been attempted in an earlier verb classification experiment; the causativity feature in Merlo and Stevenson (2001) (section 2.3.2) is one example. Since our goal is to improve verb classification, we do not have to know which sentences are alternating in order to make use of DAs. We only need to make DAs as part of the classification model. One way is to modelDAs as correlations between frames. If we observe that two types of frames co-occur frequently enough, we assume a potential co-occurrence ofDA. One drawback of the approximation approach is false positives (pairs of frames co-occur frequently, but they are notDA). In what follows, we will evaluate the potential usefulness of approximation (with false positives) for verb classification. We will discuss the two approaches in the subsequent sections.
Frame Example sentence Example frequency NP+PP(on) Jessica sprayed paint on the wall. 40
NP+PP(with) Jessica sprayed the wall with paint. 30
PP(with) *The wall sprayed with paint. 0
PP(on) Jessica sprayed paint on the wall 30 Table 8.1: Example frames for verb spray
Figure 8.1: Graphical model for the joint probability of pairs of frames. v represents a verb,a represents aDA andf represents a specific frame in total ofM possible frames Diathesis alternation approximation
ADA can be approximated by a pair of frames. We define a frame as SCFparameterized for the preposition. Example frames for the verb “spray” are shown in table 8.1.
The feature value of a single frame feature is the frequency of the frame. Given two framesfv(i), fv(j) of a verb v, they can be transformed into a feature pair (fv(i), fv(j)) as an approximation to a DA. The feature value of the DA feature (fv(i), fv(j)) is ap-proximated by the joint probability of the pair of frames p(fv(i), fv(j)|v), obtained by integrating all the possible DAs. In other words, the key assumption is that the joint probability of two frames has a strong correlation with the DAs, if the joint probabil-ity is properly modelled by taking account of the hidden DAs. We use the DA feature (fv(i), fv(j)) with its valuep(fv(i), fv(j)|v)as a new feature for verb clustering.
As a comparison point, we can ignore theDA and make a frame independence assump-tion. The joint probability is decomposed as:
p(fv(i), fv(j)|v)0 ,p(fv(i)|v)·p(fv(j)|v) (8.1) Since SCFs are generated by the underlying meaning components (Levin and Hovav, 2006), they are dependent. The dependency of the frames is represented by a simple graphical model shown in figure 8.1. The verbv and framesf are observed, and alter-nationa is hidden. The aim is to approximate but not to detect a DA, so a is summed out:
p(fv(i), fv(j)|v) = X
a
p(fv(i), fv(j)|a)·p(a|v) (8.2)
In order to evaluate this sum, we make a relaxation1: thesumin equation 8.1 is replaced with the maximum (max). This is a reasonable relaxation, as a pair of frames rarely participates in more than one type of aDA.
p(fv(i), fv(j)|v)≈max(p(fv(i), fv(j)|a)·p(a|v)) (8.3) The second relaxation is to further relax the first relaxation by replacing themax with the least upper bound (sup): Iffv(i)occurs a times, fv(j)occurs b times and b < a, the number of times that aDA occurs betweenfv(i)andfv(j)must be smaller or equal tob.
p(fv(i), fv(j)|v) ≈ sup{p(fv(i), fv(j)|a)} ·sup{p(a|v)} (8.4) sup{p(fv(i), fv(j)|a)} = Z−1·min(fv(i), fv(j))
sup{p(a|v)} = 1
Z = X
m
X
n
min(fv(m), fv(n))
So we end up with a simple form:
p(fv(i), fv(j)|v)≈Z−1·min(fv(i), fv(j)) (8.5) The equation is intuitive: If fv(i) occurs 40 times and fv(j) occur 30 times, the DA
betweenfv(i)andfv(j)≤30 times. This upper bound value is used as the feature value of theDA feature.
The original feature vectorf of dimensionM is transformed intoM2 dimensions feature vector˜f. Table 8.2 shows the transformed feature space for the example verbspray. We can see that the feature space matches our expectation well: the valid DA has a value greater than 0 and the wrongDA is assigned the value of 0.
Preliminary experiment In order to evaluate the usefulness of this model, a preliminary verb clustering experiment was performed using three feature sets:
• F1: F-SCF+PP(B) (See table 2.4)
• F2: The frame pair features built from F1 with frame independence assumption (equation 8.1). This feature is not a proper DA feature as it ignores the inter-dependency of the frames which are produced by the underlying DA.
• F3: The frame pair features (DAs) built from F1 with the frame dependency as-sumption (equation 8.4).
1A relaxation is a method used in mathematical optimization for relaxing the strict requirement, by either substituting for it another easier requirement or else dropping it completely.
Frame pair Possible alternation Occurrence Feature value
NP+PP(on) NP+PP(with) Locative 30 0.158
NP+PP(on) PP(with) Causative(with) 0 0
NP+PP(on) PP(on) Causative(on) 30 0.158
NP+PP(with) PP(with) ? 0 0
NP+PP (with) PP(on) ? 30 0.158
PP(with) PP(on) ? 0 0
NP+PP(on) PP(on) ? 40 0.211
NP+PP(with) NP+PP(with) ? 0 0
PP(with) PP(with) ? 30 0.158
PP(on) PP(on) ? 30 0.158
Table 8.2: Example frame pair features for the verb spray
The datasets are the test sets 7-11 (3-14 classes) in Joaniset al.(2007), and the Sunet al.
(2008b)’s 17 classes test set (T2) introduced in chapter 3.
TheSPECclustering algorithm was used. We used a divergence-based distributional sim-ilarity measure in the works described in chapter 3. Due to the high dimensionality of the quadratic feature space, the computational cost of the divergence similarity mea-sure (e.g. equation 3.1) is prohibitive. So we use the Bhattacharyya kernel (Jebara and Kondor, 2003) to improve the computational efficiency.
wb(v, v0) =
D
X
d=1
(vdvd0)1/2 (8.6)
The mean-filed bound of the Bhattacharyya kernel is very similar to the KL divergence kernel (Jebaraet al., 2004). The form of the Bhattacharyya kernel is relatively simple, which also helps the theoretical analysis in the next section.
To further reduce the computational complexity, a set of high frequency features over instances was used. For 3-6 way classifications (Joanis et al.’s test set 7-9), 50 features are used and 7-17 way classifications employ 100 features. In the next section, we will show that F3 outperforms F1 regardless of the feature number setting.
The results are shown in table 8.3. The result of F2 is lower than that of F3, and even lower than that of F1 for 3-6 way classification. This indicates that the frames independence assumption is a poor assumption. F3 yields substantially better result than F2 and F1. This experiment shows that DA features are clearly more effective than the frame features on these two datasets, even when relaxations are used.
Analysis with the Bhattacharyya kernel In this section, we examine the effect of theDA
features by investigating their impact on the kernel, especially the correlation with the feature frequency.
Feature set Joanis et al.
Sun et al.
7 8 9 10 11
F1 54.54 49.97 35.77 46.61 38.81 60.03 F2 50.00 49.50 32.79 54.13 40.61 64.00 F3 56.36 53.79 52.90 66.32 50.97 69.62
Table 8.3: Results when using F3 (DA), F2 (pair of independent frames) and F1 (single frame) features with Bhattacharyya kernel
Figure 8.2: Comparison between frame features (in blue) andDA features (in red) with different feature number settings. DA features clearly outperform frame features. The left figure is the result on test set 10 (8 ways). The right figure is the result on test set 11 (14 ways). The x axis is the number of features. The y axis is the F-Measure result.
We prove that theDA feature increases the impact of the middle-range frequency frames on the Bhattacharyya kernel. The high frequency features have yet a larger impact in general than low frequency ones. This is a nice property as the high frequency features are often considered to be more reliable than the low frequency ones. The details of the mathematical proof are shown in the Appendix C.
An experiment was carried out using F1 and F3 features on Joanis et al. (2007)’s test set 10 and 11. The frequency ranked frames were added to the clustering one at a time, starting from the most frequent one. The results are shown in figure 8.2. F3 (in red) clearly outperforms F1 (in blue) on all the feature number settings. After adding some highly frequent frames (30 for test set 10 and 60 for test set 11), the performance for F1 is not further improved. This is in line with the mathematical proof in equation C.2:
the kernel value is dominated by the top frequency frames in F1. The performance of F3, in contrast, is generally improved for almost all the frames including the mid-range frequency frames. However, the improvement becomes less significant for the frames with relatively low frequency.
In conclusion, this experiment demonstrates that the performance of using frame fea-tures is dominated by the high frequency frames, whereas the DA features reduce the dominance by enabling the mid-range frequency frames to further improve the
perfor-mance.
Future work Our preliminary experiment shows, for the first time, that automatically acquiredDA can provide a useful feature for verb classification. In the future, we plan to evaluate the performance ofDA features in a larger scale experiment. We were not able to perform large scale experiments yet, because the dimensionality of the transformed feature space is too high (quadratic of the original feature space). An unsupervised di-mensionality reduction technique (e.g. Zhao and Liu (2007)) will need to be used in order to improve the computational efficiency. Moreover, we plan to integrate the DA
feature with other features (e.g. SPs) in order to further improve the accuracy of verb clustering.
Detecting diathesis alternations from selectional preferences
A few studies including McCarthy and Korhonen (1998); Lapata (1999); McCarthy (2000) have attempted DA detection using SPs. WordNet (Miller, 1995) classes have been employed as SP classes. We plan to investigate whether SPs acquisition using our new unsupervised technique (section 3.2) could be used forDAdetection. Comparing to the latent variable model, this approach aims to actually detectDAs and find the partici-pating instances instead of just approximatingDAs.
We will investigate the best approaches toDAdetection using automatically acquiredSPs.
The method needs to be general enough to cover most types ofDAs and efficient enough for a large scale experiment.
One of the main problems in previous work on DA detection has been the sparse data problem in syntactic slots for which SPs are acquired. In order to reliably detect DAs, we plan to experiment with a very large corpus (e.g. Gigaword corpus (Graff et al., 2003)). We will compare the resulting DAs to the DAs listed in Levin (1993). We will also evaluate the usefulness ofDA features in the verb classification task.