Evaluation - Word meaning in context : a probabilistic model and its application to question an

We test our method for context-sensitive similarity between syntactic patterns on two related tasks: 1) Given a pattern p1 instantiated in a particular context

c and p2, a second pattern, assess the similarity between the two patterns, given

c and 2) Given a pattern p, and a context c, generate the top-N most similar patterns in context c.

The first evaluation scenario is similar to that of the related work on contextual preferences for DIRT. In these evaluations, given an instantiated pattern p1

and a second phrase, p2, systems have to decide if the paraphrase holds. Un-

fortunately, since none of these evaluation data sets were made available to us, a comparison with these previous methods was not possible. To evaluate our proposal, we compare it against the original DIRT algorithm. For this purpose, we build an evaluation data set by adapting the data present in the lexical substitution task.

To our knowledge, none of the previous related work has approached the second, more difficult task, that of inducing context-sensitive paraphrases. This tasks consists of generating context-appropriate paraphrases rather than assessing if a given paraphrase is correct or not. Throughout this section we use our framework for generating context-sensitive paraphrases for a set of patterns extracted from question answering (QA) data. In this chapter, we exemplify the main issues we observe when performing paraphrase induction, while in Chapter 9 we induce paraphrases for question expansion in QA.

We follow Lin and Pantel and, for each pattern occurring in the corpus, we also add its inverse, in which the X and Y fillers are interchanged, and the direction of the rule is reversed. This way, one can also identify similar patterns in which the X filler of one pattern matches the Y filler of the second pattern.

Evaluation 91

7.4.1 Experimental setup

Data We use the lexical substitution data (short LST, described in detail in Section 5.4.1) to build a set of instantiated patterns together with appropriate substitutes. We start by parsing the sentences using the Stanford dependency parser. We extract all dependency paths containing the target word from each LST sentence. An example of such path is pound ←−− shedobj −−→ dog for thesubj target word shed and the following sentential context:

(5) Feeding an Overweight Dog [ offsite link ] To help your overweight dog shed some pounds, you might need to change his eating habits - either what or how much he consumes.

In the next step, we use the word substitutes provided by the LST data to build pattern substitutes. An example is obtaining the pattern dog ←−− losesubj −−→obj pound as a substitute for the pattern dog←−− shedsubj −−→ pound. The confidenceobj score assigned to it is given by the number of people that suggested lose as a good alternative for shed.

Pattern in context Gold substitutes Score virus←−− shedobj −prep−−→ to−−→ catpobj ←−− passobj −prep−−→ to−−→pobj 2

obj

←−− give−prep−−→ to−−→pobj 2

obj

←−− transmit−prep−−→ to−−→pobj 2 pound←−− shedsubj −−→ dogobj ←−− losesubj −−→obj 5

subj

←−− relinquish−−→obj 1

subj

←−− discard−−→obj 1 Table 7.3: Data instances obtained from the LST data.

Table 7.3 shows instances of target patterns in the obtained data, together with their correct substitutes. A system is presented with such a target pattern together with a total set of substitutes; this has been obtained from pooling together all the substitutes for that target word. The similarity scores returned by a system are used to rank this list, ideally with the correct substitutes being ranked at the top.

The particular syntax-based representation that the DIRT method uses is best suited for learning verbal paraphrases, i.e. patterns which are verb-rooted (Lin and Pantel [2001a]). For this reason we only use the verb subset of the LST data.

Models We test our method using LDA for latent class induction against the DIRT algorithm baseline, both using the same input frequency matrix.

92 Evaluation

The input frequency matrix is extracted from the XIE GigaWord fragment containing approximately 100 million tokens6. We parse the text with the Stanford dependency parser to obtain dependency graphs from which we extract patterns together with counts of their left and right fillers. We extract paths containing at most four words, including the two noun anchors. Furthermore we impose a frequency threshold on patterns and words, leading to a collection of a total of ≈80K distinct paths, with filler nouns ranging over a vocabulary of ≈40K words.

We use the LDA model to estimate latent senses using the Gibbs sampling algorithm. As in the previous chapter we set α = 50_K and β = 0.01. We test a set of 5 K values: {800, 1000, 1200, 1400, 1600}. These are chosen to be large since they represent the global set of meanings shared by all the patterns in the collection. The DIRT method is implemented following the description in Lin and Pantel [2001b].

7.4.2 Results

We start by investigating the effect of parameter K on the performance of the models. Figure 7.1 plots the Kendall τb score obtained with each of the five K

values. The similarity measure used is scalar product. Similarly to the other experimental evaluations in Chapters 5 and 6, we also build a mixture model which averages the similarity scores returned by each individual K setting.

As it can be observed, the individual LDA models outperform DIRT for all K values. As suggested by the previous experiments in Chapter 6, the mixture model outperforms all of the individual models. This is an advantage, since tuning the parameter K becomes unnecessary.

Figure 7.1: LDA and LDA-MIX (scalar product similarity) vs. DIRT

Evaluation 93

In figures 7.2 and 7.3 we plot the same models, this time using cosine and inverse Jensen-Shannon (JS) as similarity measures. Cosine performs similarly to scalar product, while in the case of JS we notice a significant drop in performance with the K = 1600 setting performing slightly worse than DIRT. However, the mixture model still outperforms DIRT.

Figure 7.2: LDA and LDA-MIX (cosine similarity) vs. DIRT

Figure 7.3: LDA and LDA-MIX (JS similarity) vs. DIRT

The results of the LDA MIX model using scalar product, both in Kendall τb and

in GAP evaluation metrics, are given in Table 7.4. We also test an LDA model ignoring context which scores in the [11 − 14] τb interval, depending on the

similarity measure used. This scores lower than DIRT (14.5 τb), indicating that

DIRT is indeed a good method for computing (isolated) pattern similarity. We perform significance testing using randomized shuffling as described in Chapter 5. The LDA methods using context outperform DIRT at significance level p < 0.005. Using scalar product as similarity outperforms cosine, which in turn outperforms JS divergence.

94 Towards context-sensitive paraphrase induction Model τb GAP Random 0.0 34.91 DIRT 14.53 48.06 LDAsp 21.27 52.37 LDAcos 21.38 51.12 LDAJS 17.31 50.06

Table 7.4: Results on Lexical Substitution data

In Table 7.5 we list the rankings returned for three occurrences of the pattern X ←−− shedsubj −−→ Y . The contexts considered are (you, blood) extracted fromobj sentence (6), (dog, pound) (sentence (5)) and (study, light) (sentence (7)), each of them illustrating a different sense of the verb shed. The gold substitutes are highlighted in bold and the confidence score is given in parentheses.

(6) You have shed blood for us and we thank you .

(7) A mouse study sheds light on the mixed results coming from investiga- tions into the cognitive effects of hormone replacement therapy.

The DIRT method is context-insensitive and therefore returns the same rank for all instances (first column of the Table 7.5). The context-sensitive methods allow us to obtain more informative, instance-specific, rankings. For each of these contexts, the rankings differ to a great extent and favor the context- appropriate substitutes, such as lose for dog shed pound or reveal for study shed light. It is interesting to notice that shed light is ambiguous as it can also refer to radiating light ; although study dismisses this meaning, it is still reflected in the ranking obtained, as substitutes give and emit rank second and third.

In document Word meaning in context : a probabilistic model and its application to question answering (Page 100-104)