• No results found

Heuristic methods

Chapter 3 Related work

3.2 Extracting keyphrases from text

3.2.2 Heuristic methods

There are many universally applicable keyphrase extraction techniques that do not require training data; in other words they are unsupervised. Researchers manually analyze the data and identify the strongest properties of typical keyphrases, which they then combine into a fixed scoring function. Examples of this kind of ranking were mentioned in Section 3.1.2 where candidate vocabulary terms were filtered to identify the topics. This section presents keyphrase extraction methods that use heuristic filtering.

Barker and Cornacchia (2000) describe one of the earliest keyphrase extraction systems that utilizes part-of-speech information for parsing grammatically correct candidates. They use a dictionary to assign basic part-of-speech tags to each word and then extract all nouns and, optionally, their adjectival or nominal modifiers. All noun phrases computed in this fashion serve as candidate keyphrases. In the filtering stage, Barker and Cornacchia compute the frequency of the head noun in each candidate phrase and keep all candidates with the N most frequent heads. Each candidate is then scored using its frequency multiplied by phrase length and the top K highest scoring phrases are selected as keyphrases. N and K are user- specified thresholds. Evaluation experiments involving human judges have shown that this unsupervised approach performs as well as the more complex GenEx sys- tem (Turney, 1999).

A different way of improving candidate generation is proposed by Paice and Black (2003). They add new steps to the standard n-gram extraction process, which results in a stronger conflation factor. Given document n-grams up to a length of four, stopwords are removed and the remaining words are stemmed and sorted alphabetically. For example, similar phrases such as algorithm efficiency, effi- ciency of algorithms, the algorithm's efficiency and even the algorithm is very effi- cient map to the same “pseudo phrase” algorithm effici. The most frequent original form is preserved to display in the final result. This conflation strategy identifies morphological similarity more efficiently than mere stemming and provides a stronger boost factor for the overall score of a phrase group. In the filtering stage, each pseudo phrase is weighted as:

quency of the phrase, N is the number of words in it and W is the sum of their in- dividual frequencies. Next, the best scoring candidate phrases are collected. Paice and Black discuss how to use the generated keyphrases for information extraction, but do not provide evaluation of their method.

Like Barker and Cornacchia (2000), Mihalcea and Tarau (2004) begin the extrac- tion process by annotating the documents with part-of-speech tags. Unlike many other systems that filter candidates using weighting formulas, their unsupervised method uses a graph-based ranking model. First, all nouns and adjectives are ex- tracted and added as nodes in the document graph. Edges are added between those words that co-occur within a pre-defined window. The graphs are constructed from abstracts only and are therefore relatively small. The nodes are weighted iteratively using TextRank, a graph ranking technique similar to PageRank (Brin and Page 1998). The top third of the best scoring nodes are analyzed in the post-processing stage to determine single and multi-word keyphrases. On the same data set, this approach outperforms Hulth’s (2003) supervised keyphrase extraction in terms of F-measure (36% instead of 34%), however its recall is much lower (43% instead of 52%). In later work (see Section 3.2.1), Hulth (2004) achieved higher results than the ones in Hulth (2003) used for evaluating TextRank.

Paukkeri et al. (2008) propose a language-independent keyphrase extraction method. Instead of the popular TF×IDF weighting, they rank all candidate n- grams up to a length of four words based on counts determined from the multi- lingual reference corpus Europarl. All n-grams are ranked according to their fre- quency in the document, divided by their frequency in the reference corpus of the same language. The frequencies are then normalized so that the most frequent phrase receives the count of one and longer n-grams have the same distribution as single words. For the evaluation, Paukkeri et al. use Wikipedia articles in different languages and treat their internal links as keyphrases. Precision and recall values are better than in the TF×IDF baseline, averaging 15% and 25% respectively across all languages. Note that the average number of links in Wikipedia articles is sig- nificantly higher than the number of manually assigned topics in a typical key-

phrase set. Thus, Paukkeri et al. perform terminology extraction rather then key- phrase extraction.

Unsupervised methods employ more accurate candidate generation techniques than supervised ones. While the majority of machine learning approaches simply extract word n-grams, heuristic methods compensate the lack of training data by complex analysis using shallow parsing (Barker and Cornacchia, 2000), morpho- logical conflation (Paice and Black, 2003) and reference corpora (Paukkeri et al., 2008). However, unlike supervised methods, they do not take into account charac- teristics of a particular document set. Depending on the domain and document type, the significance of ranking features may vary. Thus it is questionable whether a fixed ranking function derived from particular documents will perform as well on any collection. Machine learning techniques are more flexible in this respect and are therefore applied in this thesis.

The main disadvantage of both supervised and unsupervised extraction is that the resulting keyphrases are inconsistent. In term assignment (Section 3.1), a pre- specified vocabulary controls the terminology for referring to concepts. In key- phrase extraction, topics can only be as consistent as the word choices made by the document’s authors. Section 3.4.2 discusses this problem in more detail.

Keyphrase extraction is often applied to related tasks such as terminology extrac- tion (Paukkeri et al. 2008), back-of-the-book indexing (Csomai and Mihalcea, 2008), information extraction (Paice and Black, 2003) and text summarization (Mihalcea and Tarau, 2004). Instead of limiting the output to the top scored key- phrases all significant terms in a given document or collection can be extracted. Coherence analysis and lexical patterns can be applied to identify relations be- tween the keyphrases. Text summaries can be generated using sentences contain- ing the keyphrases.

Among the topic indexing tasks, keyphrase extraction has the most clear objec- tive: to extract the main phrases from text, no pre-requisites like a vocabulary are given. This has led to much competition and the creation of many versatile meth- ods and applications. In this thesis, keyphrase extraction is integrated with the less

thoroughly explored area of term assignment and with the newly discovered task of automatic tagging.