Tanveer J Siddiqui*
3 CORPUS-BASED TECHNIQUES
There are two possible approaches to corpus-based WSD systems: supervised WSD and unsupervised WSD. Supervised approaches use machine-learning algorithm to learn a classifier for disambiguation. The unsupervised approaches use raw corpus butsuffer from low accuracy.
3.1 Supervised Techniques
Supervised WSD use some machine-learning algorithm to learn a classifier from a sense-annotated corpus automatically. The existing WSD works have used decision list, decision tree, Naive-Bayes (NB), Neural Network, Support vector machine, etc. for disambiguation. Decision lists is one of the most efficient supervised algorithms which learns a list of rules by extracting features from a training corpus and apply these rules to identify correct sense of a word. The commonly used features include part of speech (POS), collocation vector, neighboring words and their POS’s, co-occurrence vector, etc. These features are used to create rules of the form (feature value, sense, score) which constitute the decision list.
NB classifiers have been used extensively utilized in WSD [14] [22] [27] and have been proven quite effective in WSD task. Mooney [14] experimentally compared seven different learning algorithms for the problem of learning to disambiguate from context. In experimental evaluation simple Bayesian and neural-network methods, were shown to perform better than alternative methods such as decision-tree, rule-based, and instance-based techniques on the problem of disambiguation. Le and Shimazu [27] used rich knowledge, represented by ordered words in a local context and collocations in their NB classifier and achieved 92.3 % accuracy for four common test words (interest, line, hard, serve). In [22] an ensemble of Naive Bayesian classifiers is used which uses co- occurring features of varying size. This simple classifier achieved an accuracy that was comparable to best previously published results on line and interest corpus. Towell and Voorhees [20] used a combination of two feed forward network to form a contextual representation that is used for disambiguation. The neural networks
separately extract topical and local contexts of a target word from a set of sample sentences that are tagged with the correct sense of the target word.
Gale et al. [6] used k-NN classifier for disambiguating six ambiguous nous in Hansard corpus. K-Nearest Neighbor (kNN) works by learning context in the training set. During testing the test instance is matched with the learned context and k most similar context to it in the training set are selected. Each of these contexts is then assigned a score the sense having maximum score is assigned to the test instance. Gale achieved an accuracy of 90% with this simple classifier. Rezapour et al. [33] proposed a supervised learning algorithm for WSD based on k- NN. In order to improve accuracy they used a heuristics to weight the extracted features. Instead of giving similar weight to all features the heuristics give more weight to features that are important for disambiguation. It attempts to capture the importance of features in disambiguating a word based on its occurrence frequency. They used TWA (WWW.cse.unt.edu/~rada/downloads.html) corpus and used 5-fold cross-validation to estimate the performance of the algorithm. When compared with other existing corpora-based method proposed in [7] [3] [9] [13] their method outperformed all but Dagan & Itai [9]. Singh et al. [37] evaluated NB-classifier using 11 different features on a dataset consisting of 60 polysemous Hindi nouns[34]. They obtained a maximum accuracy of 86.1% when the base form of nouns appearing in the context was used as features.
Support Vector Machines (SVMs) are based on the idea of learning a hyperplane from a set of the training data so that it separates positive and negative examples and is located in the hyperspace in a manner that maximizes the distance between the closest positive and negative examples (called support vectors). SVM has shown to achieve the best results in WSD compared to several supervised techniques [21] [24].
3.2 Semi-supervised and Unsupervised Techniques
Unlike supervised algorithms, semi-supervised techniques small amount of seed instances for training. Most important contribution in this direction was by Yarowsky’s [11] using decision list for disambiguation. His algorithm starts with some training examples representative, called seed set, for each sense These examples are extracted using seed collocations that are strong indicator of a particular sense. Then a supervised algorithm is used to identify collocations within the user-specified window that reliably partition the seed training data. These collocations are ranked using the log-likelihood ratio. The set of collocations having log-likelihood ratio higher than a set threshold constitute the decision list. The resulting classifier is then applied on the entire sample set to tag additional instances and new collocations are extracted from the tagged instances. The process is iterated until convergence. exploited two powerful properties of human language Yarowsky additionally applied two important properties of natural languages to tag remaining instances:
1. One sense per discourse: The sense of a target word is highly consistent within any given document or discourse.
2. One sense per collocation: Nearby words provide strong and consistent clues to the sense of a target word, conditional on relative distance, order and syntactic relationship.
A completely unsupervised sense disambiguation algorithm can only discriminate word senses without assigning sense tags. These algorithms require manual evaluation, as the sense clusters derived by these algorithms may not match the actual senses. The existing unsupervised WSD works are based on contextual clustering, word
clustering [15] or graph-based approaches. Lin [15] clustered two words if they share some syntactic relationship. If in a context w1, w2, w3, …, wn represents the content words and w represent the target word, then the similarity between w and wi is determined by the information content of their syntactic features. Co-occurrence graph based clustering uses graphs instead of vectors. Nodes in the graph correspond to words and the edges correspond to syntactic relation between them. One notable algorithm using co-occurrence graph based approach is Hyperlex, proposed by Veronis [28]. In this algorithm first a co-occurrence graph is built, such that the nodes are words occurring in the paragraphs of text corpus and edges between a pair of words are added to the graph if they co- occur in same paragraph. The relative co-occurrence frequency of the two words connected by an edge is used to assign weight to that edge. Then an iterative algorithm is applied and the node with highest relative degree is selected as a hub. The hubs are linked to the target word with zero weight edges and the minimum spanning tree (MST) of the entire graph is computed. The MST is used to disambiguate instances of the target word.