One of the main goals for using external repositories and generating new features is to be able to apply some reasoning on a text document. Reasoning, after all, about the different concepts in a natural language is a common human task. As described in the previous chapter, researchers have been working for decades to supply machines with similar capability. In my thesis, an attempt is made in a similar direction. I use external repositories to generate features that enable machines to apply reasoning by measuring semantic distance between different human concepts.
Semantic Distance is a generic measure used to define how close or distant two units of text are in terms of their meanings [125]. The units of text can be words, groups of words, sentences or paragraphs. The text units may sometimes be referred to as concepts. As an example for two units which are close in their meanings are Apple and Watermelon. They are both fruits containing seeds and edible. However, the word Apple in a sentence such as
“Apple released the latest Mac last June” carries a different meaning since it refers to the
based on the context it is placed in. The same can be said about semantic distance between different units of text.
It is therefore possible to outline a set of rules that should be met in the repositories utilized for FG. First of all, the repository should contain Text Units (TU) or Concepts defined by humans. Each TU or concept should have a corresponding meaning understood by humans when placed in a specific context. Secondly, each TU should have its different meanings clearly defined and distinguished from the rest based on the context it appears in. As shown in the previous example, the word Apple may refer to the company or the fruit. Third, it should be possible to induce the semantic distance between the different TU and their different meanings. For example, it should be possible to quantify the relation between the fruits Apple and Watermelon and say they are more related than Apple and the animal Lion.
The above-described rules do not have to be directly mentioned or explicitly existent within the repository. As long as a method exists to adapt or induce the missing information, the repository may still be suitable for the tasks at hand. After choosing the right repository, a mapping algorithm is applied within the parser and/or analyzer to match the extracted TUs and Concepts to the test documents at hand.
Several repositories exist which meet the mentioned requirements. Some of them are domain-specific while others are more generic. Some were prepared by human experts in different fields (closed repositories) while others were the result of a joint effort by the web community (open repositories). Among the common closed repositories is WordNet which has been used to enrich text documents for different types of applications including classification, summarization and categorization. Another example for one of these
Classification System (CCS) [127] ontology is another example of a closed ontology built and maintained by experts. It undergoes periodic updates and redesigns but always seems to be out of date as a classification of computer science concepts. It has gone through six revisions with the first version being published in 1964, and then it was revised in subsequent versions in 1982, 1983, 1987, 1991 and the latest in 1998. The Medical Subject Headings (MeSH) is yet another domain-specific ontology defining over 18,000 categories and is linked to scientific articles. A shortcoming shared by all closed ontologies is the costly task, in time, labour and other resources, of designing and building it at first. After being built, the ontology would need to be continuously maintained and updated to reflect the addition and evolution of concepts. CYC [128] for example, which has been under continuous development and maintenance by experts for almost two decades, still suffers from incompleteness and incomprehensiveness. Its aim is to contain all common sense knowledge which an average adult person should already know, but it is
not intended to cover people’s general information needs. It has been estimated by its
author that it would take 350 man-years of effort to complete the CYC project [129]. It has a smaller open-source version called OpenCYC4.
The open ontologies on the other hand which are usually created and maintained by the web community, have the advantage of being more up to date. Wikipedia is an example of such large-scale knowledge repositories. It has been developed and maintained by the web community and has considerably good accuracy surpassing some of the experts-made ontologies [130]. The Open Directory Project (ODP) and Wiktionary are other open ontologies. While the breadth of the open ontologies exceeds that of the closed ones in most cases, their format and structure are not as easy to handle. This is due to them being
4
developed from the start to be browsed by web surfers and without the intent of being used in NLP applications.
With the different types of available ontologies, the NLP community always had to deal with several issues. First, it had to choose the most suitable ontology for the task at hand depending on the scope and the domain of the ontology and the targeted application. Second, an understanding of how the concepts within the ontologies are defined, what they mean or represent and how they are used by humans is important. For those concepts which overlap in their meanings with others or share the lexicons, it may be important to differentiate between them or define a degree of relevancy among the concepts. Third, it is necessary to decide how to map the concepts or entries existing within the chosen ontologies to text documents. Optimally, the degree of relevancy would be declared during the mapping process.
In my work, the methodologies that have been developed rely on two repositories: the first is the hierarchically-structured repository that was created by linguistic experts and is rich in its explicitly defined lexical relations: WordNet. The second is the open-World knowledge ontology Wikipedia. In the next section, I give an overview on each repository and highlight how the semantic distance is being computed with the features extracted from each. Also, I describe how the mentioned issues are being addressed by my use of the chosen repositories.
It should be noted that with WordNet, I use the terminology Semantic Similarity to describe how close in meaning two TUs or concepts are while Semantic Relatedness is being used with Wikipedia. Both are two types of Semantic Distance and may have been used in the literature interchangeably in certain contexts [125]. However, the former is in
and Oranges are similar while Apples and Seeds are related. On the other hand, Apples and
Lions are unrelated and not similar.
Figure 3.2: An Example showing that Relatedness is a subset of Similarity