• No results found

tendency to form clusters is greatly diminished by their lack of occurrences. Finally, al- though the Syllable approach scores, in average, higher than the other methods, it tends to harm smaller concepts, such as "air", "CDC" (acronym for Center for Diseases Control) or "CBP" (acronym for Calcium Binding Protein), which do occur in the English corpus.

As for multi-words, regarding LocalMaxs, the lower results are due to the fact that the method classifies terms by comparing them with their immediate neighbors. For in- stance, irrelevant multi-words such as "which is", "from the", "rather than", "responsible for", among others, tend to be considered relevant by this extractor. This happens, es- sentially, because the inclusion of a new word before or after the multi-word does not increase its SCP _f (.) score. For instance, immediate neighbors of "responsible for" in- clude terms such as "branch responsible for", "responsible for suppressing", "responsible for skin", etc. However, although they seem more relevant than "responsible for", these neighbors are infrequent resulting in lower scores. As for the recall, it may be due to the fact that the method tends to prefer the largest terms. For instance, "genetic informa- tion", which is undoubtedly a concept, is not considered as such by LocalMaxs because it has better immediate neighbors, such as "genetic information research" or "cell’s genetic information".

4.3

Summary

In the first part of this thesis I presented a new methodology for the extraction of single- word and multi-word concepts from large texts. This methodology uses tools and ideas, such as the specificity of terms and the Rel_var metric, which may be potentially usable outside the scope of the extraction of concepts. For instance, the idea of specificity can be used in the identification of anchor points in parallel texts for the task of automatic translation: if the texts are truly parallel (one being the exact translation of the other), the specificity of a term in language A should be similar to the specificity of the translated term in language B.

Considering the limitations of most approaches regarding the dependence on tools which are language-specific, such as parsers, Part-of-Speech taggers, external lexicons, etc., the ConceptExtractor is a language-independent approach. However, the main cri- terion for its successful usage on untested languages is that the terms in an untested language must follow the same basic "rule" as on the tested languages – the single-word concepts in compound concepts must tend to co-occur in fixed positions relatively to each other. That is the basis of this approach.

Regarding other language-independent approaches, beside the fact that most are in- capable of extracting single-words and multi-words using the same methodology, I’ve shown that the ConceptExtractor shows higher comparative results.

However, the ConceptExtractor is not without its drawbacks. Most of these drawbacks arise from the fact that some multi-word concepts, such as President of the United, score

4. THEConceptExtractor –CORPORA,METHODOLOGY AND RESULTS 4.3. Summary

high in their specificity, although they are clearly incomplete. In this specific case, al- though one cannot say that President of the United does not contain any concept, clearly President of the United States or President of the United Nations are better and more complete concepts. These are frontier cases, although quite uncommon. A possible solution could be to include a new rule for concepts such as "multi-word concepts must start and end with

complete concepts". However, the problem would be to define programmatically or sta-

tistically, what a complete concept is. Algorithms such as LocalMaxsLocalmaxs could be of help for those highly specific situations, but not as complete replacements.

Another improvement could be done on the identification of synonyms and of singular- plural concepts. For instance, although abortion and abortions are the same basic concept, both the extractor and downstream applications are unaware of the similarity.

Finally, although the ConceptExtractor presents quite encouraging results, future work could be done in order to increase the performance of the extractor.

Part II

Applicability of automatically

extracted concepts

5

Extraction of explicit and implicit

keywords from documents

Part II of this thesis presents some applications for concepts automatically extracted by the ConceptExtractor, as described in Part I. In this specific chapter, I will present an ap- proach based on concepts for the extraction of explicit and implicit keywords from doc- uments. This approach is language-independent and comparative results for three dif- ferent European languages will be presented. The work in this chapter was published in [VS13a].

5.1

About explicit and implicit keywords

Keywords are semantically relevant terms that are used to reflect the core content of documents. Some of the first works related to the automatic extraction of keywords were addressed in [Luh58], [Jon72] and [SY73]. However, in many applications, as in library collections, the extraction of keywords remains mainly a manual process.

In the context of this thesis, I argue that keywords are essentially concepts that are meaningful in the documents: they either describe the content of a document or of a part of a document. This approach starts by automatically extracting the concepts of the documents, using the ConceptExtractor. By doing this extraction, we are in fact reducing the search space from all possible sequences of single-words and multi-word expressions to a much smaller set of semantically meaningful concepts. Then, by applying Tf-Idf to the extracted concepts, the first ranked concepts are selected as explicit keywords of the document.