• No results found

Paraphrase Methods without Deep Learning

Part I Selecting a Meaning Representation

2.5 Conclusion

3.1.1 Paraphrase Methods without Deep Learning

Early approaches to paraphrase detection emphasized lexical matching; for instance,Zhang and Patrick (2005) used rules to transform input sentences into canonical forms, then ex- tracted lexical matching features. However, it quickly became apparent that simple lexical matching was not enough.

Mihalcea et al. (2006) described a semantic similarity method incorporating eight corpus- and knowledge-based measures of word-level semantic similarity, ranging from PMI to WordNet-based methods. The combination of these measures significantly outperformed simple lexical matching. Numerous others followed similar knowledge- and corpus-based approaches. Kozareva and Montoyo (2006) compared the performance of kNN, SVM, and MaxEnt classifiers; their feature set included word overlap, longest common subse- quence, and word similarity features based on WordNet. Fernando and Stevenson (2008) used WordNet with a similarity matrix to recognize paraphrases that replaced words with synonyms or near synonyms.Ramage et al. (2009) performed a random walk over a graph that incorporated Wordnet and corpus statistics, with a bias towards the neighborhood of the bag of words representation of the input sentence. Islam and Inkpen (2009) measured

semantic similarity of sentences using the semantic similarity of the words (calculated with a Pointwise Mutual Information method) and modified versions of Longest Common Sub- sequence string matching. Ul-Qayyum and Altaf (2012) used LCS and bag of words to identify lexical overlap, then added a number of semantic heuristic features, such as syn- onymy and antonymy.

Others have emphasized the structure of sentences. Wan et al. (2006) used n-gram overlap, dependency relation overlap, dependency tree-edit distances, and difference in sentence lengths as features. Rus et al. (2008) combined lexical overlap enhanced with synonymy and antonymy information and dependency graph matching. Das and Smith (2009) combined a logistic regression model trained on lexical overlap features with a gen- erative, quasi-synchronous grammar model that estimated whether the class paraphrase or not paraphrase maximized the probability of seeing the two sentences. Their approach thus took both lexical overlap and syntax into account. Heilman and Smith (2010) describe a tree-edit distance measure for semantic similarity; if the dependency tree of one sentence can be transformed into the other in relatively few steps, the sentences are more likely to be paraphrases.Bu et al. (2012) used a string rewriting kernel to measure the lexical and struc- tural similarity between pairs of strings without having to construct syntactic trees. Bach et al. (2014) emphasizes that sentences are comprised of elementary discourse units and computes the similarity of sentences based on the similarities of their component discourse units.

Rather than attempting to detect sentence similarities,Qiu et al. (2006) sought to detect dissimilarities and determine if they were important. Their two-step process identified predicate argument tuples that could be aligned between the two sentences, then passed tuples that could not be aligned to a classifier that determined if the dissimilarity mattered. Some latent variable models have been developed for this task in recent years. Guo and Diab (2012) use a latent variable model that builds a semantic profile of each sentence based on both the observed words and missing words of each sentence. Their weighted

matrix factorization approach is similar to SVD, except that it enables them to force the representation of missing words in a sentence to be zero. Xu et al. (2014) sought to detect tweets that are paraphrases using a joint word-sentence approach that assumes two tweets that share a topic and at least one “anchor” word pair are paraphrases. Their latent variable model is specific to the short context and unique wording that appears in tweets.

Several approaches have related paraphrase to machine translation. Wu (2005) used inversion-transduction grammars, which had previously been applied to machine transla- tion and alignment, to outperform the baseline without using a thesaurus, lexical similarity model, or parameter training. Finch et al. (2005) used evaluation metrics developed for machine translation to predict whether two sentences were paraphrases. More recently, Madnani et al. (2012) was quite successful in training a meta-classifier using eight machine- translation metrics as features.

Vector and matrix based approaches have seen recent successes as well. Blacoe and Lapata (2012) compared shallow composition of word embeddings with deeper ones such as recurrent neural networks, and found that the shallow approaches performed approxi- mately as well while requiring less computation.Milajevs et al. (2014) compared a number of types of word vectors using several compositional methods. For paraphrase, they found unlemmatized neural word embeddings gave superior results compared to co-occurence vectors.

Ji and Eisenstein (2013) invented a term-weighting scheme called TF-KLD, an alter- native to the commonly used TF-IDF that takes advantage of paraphrase training data in calculating a term’s weight. They built a matrix where each row represented an input sen- tence using this scheme, then applied a matrix decomposition technique called nonnegative matrix factorization to generate shorter vectors for each sentence. They compared sentence pairs using an SVM on the elementwise sum of the two sentence vectors concatenated with the absolute value of their elementwise difference. They also tested a simpler cosine sim- ilarity measure, but found the SVM more effective. Yin and Sch¨utze (2015b) built on Ji

and Eisenstein (2013)’s method, modifying the TF-KLD scheme to TF-KLD-KNN, which handled words and phrases not seen in the training data using k-nearest-neighbors. They used embeddings not only for individual words, but also for both continuous and discontin- uous phrases. Finally, they appended the eight machine translation metrics fromMadnani et al. (2012) to the vectors they input into the SVM, generating a slight improvement in performance over previous methods.