• No results found

Machine learning

Chapter 3 Related work

3.2 Extracting keyphrases from text

3.2.1 Machine learning

The history of supervised keyphrase extraction began with two competing meth- ods: GenEx (Turney, 1999), developed first, closely followed by Kea (Witten et al., 1999). Kea received more attention because it is publicly available and simple enough to be extended with new features. It serves as the state-of-the-art baseline, over which new systems, including the Maui algorithm developed in this thesis, can potentially improve by using better candidate generation, more features and a different classifier. This section first describes GenEx and Kea, and then other methods that build on these two.

Turney (1999) proposes GenEx, a hybrid genetic algorithm for keyphrase ex- traction, consisting of two components: Genitor and Extractor. The Extractor is applied to document text in order to determine a set of weighted keyphrases. Can- didate keyphrases are all phrases consisting of up to three consecutive words that are not stopwords. The candidates are stemmed by truncation to five characters.2

Next, each candidate is scored by its frequency multiplied by its position in text. Scores of multi-word candidates are boosted. After selecting the most frequent full form for each stemmed phrase, Extractor presents the top-ranked phrases as output based on 12 numeric parameters, such as the boosting factor for longer phrases and the size of the final keyphrase set. Genitor is a genetic algorithm that uses training data to determine the best parameter settings. Evaluated on 360 articles from various domains, GenEx achieves precision of 24%, while recall is not re- ported. Human subjects judged 80% of keyphrases as acceptable.

Witten et al. (1999) develop Kea, the Keyphrase Extraction Algorithm,3 based

on similar principles but using a different learning technique. In the candidate generation stage, Kea first determines textual sequences defined by orthographical boundaries such as punctuation marks, numbers, and newlines. These sequences are then split into tokens. Next, Kea extracts candidate phrases that consist of one or more words and do not begin or end with a stopword. The minimum and maximum length of a keyphrase can be set by the user. Each candidate is stemmed using the iterated Lovins (1968) stemmer and the most frequent full version is saved for the output. In the filtering stage, two features for each candidate are computed: the TF×IDF measure (a phrase’s frequency in a document compared to its inverse frequency in the document collection, discussed in Section 5.2.1) and the position of the first occurrence (Section 5.2.2). A Naïve Bayes classifier (Dom- ingos and Pazzani, 1997) analyzes training data and creates two sets of weights: for candidates matching manually assigned keyphrases and for all other candi-

2Stemming by truncation is fast, but has disadvantages: words with different meaning are stemmed to the same string (e.g. center and century), allomorphs are disregarded (moder- ate and modest receive different stems) and short words remain unstemmed (e.g. terms and term).

dates. In the filtering stage, the overall probability of each candidate being a key- phrase is calculated based on these weights. The candidates are ranked according to their probabilities, and the top ranked phrases are included into the resulting key- phrase set. After training on 100 documents and testing on 500, KEA extracts 0.9 correct keyphrases among the top 5. The authors do not report precision and recall values.

Frank et al. (1999) compare Turney’s GenEx and Kea directly on the same data sets and find that their precision is similar, but Kea creates the model much faster (it takes minutes instead of hours required by GenEx). They introduce a new fea- ture, called keyphrase frequency, which counts the number of times a candidate appears as a keyphrase in the training collection. Adding this feature significantly improves the results. Depending on the corpus, the top 5 Kea’s keyphrases contain on average 1.35 or 1.46 correct keyphrases, which corresponds to precision of 27 and 29%, respectively. The results continue to improve as the size of the training collection increases.

Turney (2003) modifies Kea by adding a semantic feature enhancing the coher- ence of the resulting keyphrase sets. Coherence is computed using Pointwise Mu- tual Information (PMI) (Church and Hanks, 1989). Turney first ranks candidate keyphrases using Kea’s three features (Frank et al., 1999). Next, he uses PMI to compare the similarity of top L candidates to the top K candidates, where K < L. PMI is computed from co-occurrences retrieved using a search engine. For the top L candidates, Turney computes how often they co-occur with top K candidates in the search results. These values are added as new features for final processing by the classifier. Evaluated on two different collections, the quality of the resulting key- phrase sets improves. However, querying the search engine significantly slows down the extraction process. Csomai and Mihalcea (2008) compute PMI statistics offline, using co-occurrence in Wikipedia articles—a faster alternative. Their tech- nique is discussed below.

Hulth (2004) proposes both new candidate generation and filtering methods. For candidate generation, she compares the original n-gram extraction with shal- low parsing and part-of-speech sequence matching, which extract only valid noun

phrases. The most accurate results are achieved with parsing and the least accurate with n-gram extraction. For filtering, Hulth separates TF×IDF into term frequency and inverse document frequency features, adopts Kea’s first-occurrence feature, and adds a new feature that records the part-of-speech pattern of the candidate. Certain patterns are more likely to denote keyphrases. Experiments with classifiers, including Naïve Bayes, bagged decision trees and other ensembles of classifiers, show that a combination of several prediction models yields the best results: an F- measure of over 45%, one of the highest reported among keyphrase extraction methods. However, precision and recall are computed based not on the total num- ber of assigned keyphrases, but on those that actually appear in the documents. Therefore Hulth’ figures are not directly comparable with others.

Nguyen and Kan (2007) extend Kea with several new features: the part-of- speech sequence as in Hulth (2004), the suffix sequence of the candidate and a bi- nary feature recording whether the candidate is an acronym. They use a classifier to identify document’s structural parts, such as introduction, applications and ref- erences, and include this information as a nominal feature, which lists parts in which a candidate occurs. Given 120 test documents, the authors achieve some- what better results than the original Kea baseline: precision improved from 30% to 32% (recall is not reported). Unfortunately, the individual contribution of their new features is not clear.

Csomai and Mihalcea (2008) propose a supervised method for back-of-the-book indexing, a task related to keyphrase extraction. They combine common features, such as term frequency, TF×IDF and term length, with novel features, which util- ize discourse, syntactic and encyclopedic information. For computing discourse features a shallow parser first, sentence by sentence, extracts noun phrases, which are then treated as nodes in a graph. Edges in this graph are weighted with scores derived using lexical semantic analysis (LSA) and PMI (Turney, 2003). Next, Pag- eRank (Brin and Page, 1998) is applied to find the most central nodes in this graph. After adding new noun phrases from each batch of sentences, the scores are re-computed. Three features record a) how often a noun phrase receives the central

rank across all sentences; b) how often is it central given its total number of occur- rences; and c) the maximum centrality it achieves.

Csomai and Mihalcea transform Hulth’s (2004) nominal part-of-speech feature into a numeric one by computing the probability of the phrase’s part-of-speech pattern to denote a keyphrase. A further encyclopedic feature is the Wikipedia keyphraseness that is also used in Maui (Section 5.2.3). In the evaluation, Csomai and Mihalcea automatically create back-of-the-book indexes for 30 manually an- notated books, after training on 259 books. The best results, an F-measure of 28% in both cases, are reported using a multilayer perceptron and a decision tree classi- fier. The authors note that nearly as good results were achieved with only 10% of the training corpus, 25 books.

Because keyphrase extraction is a clearly defined task and both the data sets and the baseline systems are publicly available, this area has been exhaustively explored in the machine learning community. Many other experiments have been reported but are for space reasons excluded from this overview (e.g. Barrière and Jarmasz, 2004; HaCohen-Kerner et al., 2005; D’Avanzo and Magnini, 2005).

Machine learning provides an elegant solution for the keyphrase extraction task. The methods are subdivided into clear steps: identifying the candidates, defining the features, computing feature values, and finally deciding on the significance of a candidate. Maui’s design follows these principles and, like many methods de- scribed in this section, builds on the original Kea system (Witten et al., 1999; Frank et al., 1999).

The results achieved by some supervised keyphrase extraction methods are im- pressive. As reported above, 80% of automatically determined keyphrases are ac- ceptable (Turney, 1999) and over 45% match keyphrases assigned by the authors (Hulth, 2004). However, existing approaches are rarely applied in real-world situa- tions. Some of them are fairly complex, others have unsustainably long processing times; but the main problem is perhaps the requirement of training data. Best re- sults have been reported in experiments incorporating a few hundred manually indexed documents. This is significantly less than in classification-based ap- proaches to topic indexing (Section 3.1), but is still a major obstacle in practice.