CHAPTER 3 : Lexical and Non-Compositional Entailment
3.2 Supervised Model for Lexical Entailment Classification
We now turn to the task of automatically determining the basic entailment relation that holds between two natural language strings. We aim to build a statistical classifier which takes as input a pair of linguistic expressions and returns one of the basic entailment relations defined in Section 2.4. We will use this classifier to automatically add fine-grained semantic relations to each of the phrase pairs in PPDB in Section 3.3.
3.2.1. Classifier Configuration
We will train our classifier using the labeled datasets collected in Section 3.1. Because of the low frequency of exclusion relations in PPDB (Table 12), we do not attempt to automatically differentiate between the finer-grained aopp and aalt relations. Additionally, for simplicity, we fix the direction of the @ and A pairs so that all are considered as A
relations. Thus, we build our classifier to distinguish between 5 classes: {≡,A,a,∼,6∼}. We use the scikit-learn toolkit (http://scikit-learn.org) to train a logistic regression classifier. In order to overcome the imbalanced distribution of our data, we subsample training examples from each class inversely proportionally to the class’s frequency in the training data (Table 12); this is corresponds to theclass weight=‘auto’parameter setting. We tune the regularization parameter using cross-validation on the training data.
3.2.2. Feature Groups
We compute a variety of features, which we organize into six feature groups, named as follows and described below: Lexical, WordNet, Distributional, Pattern, Paraphrase, and Translation. For more precise definitions and feature templates, see Appendix A.3. For analysis purposes, we differentiate between features which rely on patterns derived from large monolingual corpora and those which rely on patterns derived from bilingual parallel corpora. When relevant,Monolingual refers to the combination of the Distributional andPatternfeature groups, andBilingualrefers to the combination of theParaphrase and Translationfeature groups.
In the descriptions below,w1 andw2refer to lexical items andt1and t2 are their respective syntactic categories.
Lexical Features
We compute a variety of simple lexical features for each phrase pair, including: the lemmas, part-of-speech tags, and phrase lengths of w1 and w2; the substrings shared byw1 and w2; and the Levenstein, Jaccard, and Hamming distances between w1 and w2. This feature group is referred to as Lexical.
WordNet Features
For each pair h(w1, t1),(w2, t2)i, we include indicator features to capture the relation or relations to which the pair can be assigned according to WordNet. This feature group is referred to as WordNet.
Distributional Features
We follow Lin and Pantel (2001) in building distributional context vectors from dependency- parsed corpora. Given a dependency context vectors forw1 andw2, we compute the number of shared contexts, as well as the cosine similarity, Jaccard distance, and several perviously- proposed distributional similarities measures. Specifically, we compute lin similarity, a symmetric similarity measure proposed by Lin (1998) as defined below:
lin similarity= P c∈W1∩W2 W1(c) +W2(c) P c∈W1 W1(c) + P c∈W2 W2(c) (3.1)
where Wi is the set of contexts in whichwi appears and Wi(c) is the number of times wi
has been observed in contextc. We also computeweeds similarity, a variation proposed by Weeds et al. (2004) and aimed at capturing asymmetric similarity, as defined below.
weeds similarity= P c∈W1∩W2 W1(c) P c∈W1 W1(c) (3.2)
This group of features is referred to collectively asDistributional.
Lexico-Syntactic Pattern Features
Hearst (1992) and Snow et al. (2004) exploit certain textual patterns (e.g. “x and other y”) in order to infer hypernym relations from text. We follow Snow et al. (2004) in using dependency parsed corpora to automatically recognize these “lexico-syntactic patterns”, but extend it to include all of our basic relations. We refer to the features in this group collectively asPattern.
Paraphrase Features
There are a variety of features distributed with PPDB, which we include in our classifier. These include 33 different measures used to sort the goodness of the paraphrases, including distributional similarity, bilingual alignment probabilities, and lexical similarity. These features combined are referred to as Paraphrasefeatures.
Translation Features
PPDB is based on the “bilingual pivoting” method, in which two phrases are considered paraphrases if they share a foreign translation. The English PPDB was built by pivoting through 24 foreign languages. We use the pivot words from all of these languages to derive a set of features, including the number of foreign language translations shared byw1andw2for each of the languages separately and collectively. We compute translation similarity, an asymmetric measure of the bilingual similarity of two words, as follows.
translation similarity= |τ∗(w1)∩τ∗(w2)|
|τ∗(w1)|
(3.3)
whereτ∗(wi) is the set of all the translations ofwi across all 24 languages. We refer to this
3.2.3. Feature Analysis
The features used in our classifier are largely based on previously-used methods for auto- matically inferring related words from text. However, in most prior work, these methods are used in isolation, or in applications which focus on a specific type of semantic rela- tion (e.g. synonymy or hypernymy). It is therefore interesting to analyze the strengths and weaknesses of each feature group for differentiating between our five fine-grained entailment relations.
All of the below results are obtained by running ten-fold cross validation on the training split of thePpdbSick dataset (Section 3.1.3).
Ablation Analysis
Table 15 shows the classifier’s overall performance. The classifier achieves good overall performance, even for relations which are relatively infrequent in the training data.
Frequency Accuracy F1 Unrelated (6∼) 39% 88% 0.79 Equivalence (≡) 8% 81% 0.57 Entailment (A) 26% 76% 0.68 Exclusion (a) 8% 73% 0.49 Otherwise Related (∼) 19% 64% 0.51
Table 15: Accuracy and F1 score by classifier on 10-fold cross validation over PpdbSick training data.
Table 16 shows the performance when ablating each of the feature groups. TheBilingual features (Paraphrase and Translation) are especially important for distinguishing the Equivalence class (≡), and the Pattern and WordNet features are important for the Exclusion class (a). TheLexicalfeature group exhibits strong performance for classifying all relation types; this is likely because this group indirectly captures both negation words (e.g. “no”) and substring features (“little boy” @ “boy”).
∆ F1 when excluding
All Lexical Distr. Pattern Para. Trans. WordNet
6∼ 79.0 -1.99 -0.24 -1.23 -1.67 -0.24 -0.12
≡ 56.8 -3.53 +0.22 -0.75 -2.44 -3.67 +0.46
A 67.9 -4.58 -0.25 -0.83 -0.76 -0.65 -1.59
a 48.5 -4.02 -0.76 -2.88 +0.29 -0.00 -2.23
∼ 50.6 -4.93 -0.46 -0.75 -1.19 -0.89 -0.32
Table 16: Change in F1 score (× 100) achieved by classifier when ablating each feature group.
Monolingual vs. Bilingual Similarity Metrics
Table 17 shows the “most similar” pairs in the PpdbSick training set, according to the various types of similarity metric defined among our features (see Section 3.2.2). Our sym- metric monolingual score (lin similarity, Eq. 3.1) consistently identifies Exclusion (a) pairs, while our asymmetric monolingual score (weeds similarity, Eq. 3.2) is good for identifying Entailment (A) pairs; none of the monolingual scores we explored were effective in making the subtle distinction between Equivalent and Entailment. In contrast, the bilin- gual similarity metric (translation similarity, Eq. 3.3) is fairly precise for identifying Equivalent pairs, but provides less information for distinguishing between the different types of non-equivalent relations, such as distinguishing Entailment (A) from Unrelated (6∼). These differences are further exhibited in the confusion matrices shown in Figure 6: when the classifier is trained using only the Monolingual feature groups, it misclassifies 26% of Exclusion pairs as Equivalent, whereas the classifier trained with theBilingualfeature groups makes this error only 6% of the time. However, the classifier trained with the Bilingualfeature groups completely fails to predict the Entailment class, calling over 80% of such pairs Equivalent or Otherwise Related (∼).