• No results found

When humans summarize a document, they usually attempt to understand it first. This requires an understanding of the language the document is written in. Also, it may require that the summarizer has background knowledge about the concepts mentioned within the document. When machines face a similar task, it is necessary to take the mentioned human factors into account. Machines need to be supplied with background knowledge, and the best suitable source for this is an encyclopaedia. This is supported by the breadth hypothesis proposed by Lenat in [138] in which he says “to behave intelligently in

unexpected situations, an agent must be capable of falling back on increasingly general

knowledge”. However, the use of encyclopaedia presents yet another set of challenges.

First, using the textual data available in encyclopaedia requires natural language understanding. In addition, common sense may also be required for understanding text documents, especially for humans [3]. In an attempt to address part of the problem, Lenat started the CYC project to create a repository of common sense knowledge of human beings. The aim of the CYC project is to create a repository containing all common sense

knowledge an adult person would have. It is not its purpose to resolve people’s

information needs. As mentioned earlier, the author estimates that 350 man-years are required to complete building the repository. A smaller version of the repository exists in

In this thesis, I attempt to use the largest encyclopaedia known to date [139], Wikipedia, in the task of Automatic Documents Summarization.

3.4.1 Wikipedia

Wikipedia is known to be the largest available, fastest growing, and most recent encyclopaedia. It is hosted and funded by the Wikimedia Foundation6, a non-profit organization which hosts some other related projects such as Wikibooks and Wikinews. Its articles, over 15 million, are written, revised, updated and maintained by over 153,000 volunteer editors and it spans over 240 languages. Its nearest competitor, the Britannica Encyclopaedia, has been in development since the 1700s and has approximately 120 thousand articles7, which is orders of magnitude less than that of Wikipedia.

The articles in each language vary in quantity ranging from few pages to 3,289,927 pages for the English version8. An article can be seen as the basic unit in Wikipedia describing a single topic thoroughly while being constantly revised and updated causing its depth and breadth to increase with time. The continuous updates and revisions to articles give Wikipedia a unique adaptability feature allowing it to reflect the most recent major events or concepts.

The issue of Wikipedia’s accuracy has captured the interest of Media and many

researchers. In a study [140] that conducted an experiment to compare some Wikipedia articles against their Britannica counterparts by academics, it was found that subtle errors exist in both such as omissions and misleading statements. However, the study concluded that Wikipedia approaches the accuracy of Britannica. In [141], some chosen Wikipedia articles were compared against their equivalents in the Medscape Drug Reference and

6

http://en.wikipedia.org/wiki/Wikimedia 7

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons

8

found no factual errors. However, it was noted that Wikipedia found only 40% of the addressed questions while their experts-made counterpart found 83% which hints that Wikipedia has an omission problem. However, as mentioned in the more recent study in [142], Wikipedia was still found to compare favourably against all other sources which people would turn to if Wikipedia did not exist and the strengths it has outweigh its weaknesses.

Figure 3.4: the growth rate over time of the English Wikipedia9

Each article in Wikipedia contains text describing the topic of the article along with links (internal and external) to other relevant topics. The aim of the links is to provide the readers with insight and additional information about other relevant topics. In addition to the text describing the article, each article is uniquely labelled with a set of terms forming a title. When two or more articles discuss a topic with a word carrying more than one meaning, the title is usually augmented with a descriptive term to differentiate between the two articles. For example, the term bar carries more than one meaning and each meaning has an associated article with a unique title describing that meaning as in Bar

differentiate the different senses of a term is not always the case as evident for the topics titled Tree, and Tree (data structure) where the former refers to the tree plant while the later refers to the computer data structure. This can be viewed as an inconsistency within Wikipedia caused by the participation of large number of editors carrying different opinions.

Another aspect worth mentioning in Wikipedia is the disambiguation pages which have been created for ambiguous terms carrying more than one meaning. The disambiguation pages provide links to different articles each describing one meaning of the term. The title

Rice (disambiguation) for instance is the title of the disambiguation page for the term Rice

listing links to different articles providing different meanings for the word. There also exist redirect links which simply provide alternative terms describing the same topics as the one existing within Wikipedia. The purpose of forming the alternative names in the redirect links is to highlight alternative names, abbreviations, shortcuts, alternative spellings, or likely misspellings. For example, the article titled United Kingdom has the redirect link UK pointing to it.

One more aspect in Wikipedia is the categories and their overlapping trees structure. Every article is assigned to one or more categories which it belongs to. Every category belongs to one or more parent categories and can contain subcategories. There is one top- level category named Contents which only have subcategories but no parents. All other categories are below this parent category. The whole structure of the articles and their categories in Wikipedia can be viewed as a directed acyclic graph.

3.4.2 Semantic Relatedness

In Wikipedia, just like other encyclopaedias, articles exist describing a variety of topics. Each article can be viewed as a concept and is attached to a body of text (the article

content) describing the article’s main topic. The articles and their content are in the same

form as the documents to be summarized since they are all text. The use of a devised semantic similarity measure allows for augmenting text documents with the features extracted from Wikipedia and its large amount of world knowledge. In effect, this replaces the need for understanding the actual content of text documents and allows bypassing the

difficulties highlighted above. Take the concept “Lion” as an example. One way to describe it is by the definition “large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male” as given in WordNet. Another way

is to say that it is strongly related to “Big Cat”, “Scavenger”, “Felidae” and “Mammal” and is less strongly related to “Tiger” and “Leopard”.

The goal of the Wikipedia Feature Generator is to enrich the representation of a text document by augmenting it with features extracted from Wikipedia. The features include the detected concepts and the relevancy between them within a document and others within the same document set (in multi-document summarization tasks for example). Each detected concept is represented with an attribute vector whose elements are all the other concepts and the degree of relatedness between each and the main concept. In the vector list, concepts with weak association or rather small relevancy degree are removed from the list. To compute the relevancy degree between all concepts, I use the features extracted from Wikipedia for the task including the articles titles, redirect links, articles content, articles categories, and articles links. Figure 3.5 gives an overview of Features Generation

for a Wikipedia article titled “Mouse”. More details about the methods used and the

Figure 3.5: an Example for Wikipedia Features Generation

It should be noted that filtering and preprocessing is first applied to the used Wikipedia dump. Although there are 3.5 million content pages in the English version of Wikipedia, they are not all with the same importance. Some of the articles are too short, while others contain only statistical data and tables or dates. The developed filtering module applies a set of rules to ensure that all concepts used in any task are attached with rich text contents.