The occupation taxonomy in existing knowledge bases including Wikipedia, Freebase, YAGO and DBpedia are not suitable to establish personal name concepts. Firstly, Wikipedia and Freebase categories are not well-formed arrangements and are based on thematic domains. While the, YAGO knowledge base fixed the above problems by constructing a backbone class from WordNet synsets before connecting them to Wikipedia categories to form a semantic taxonomy. In this thesis found that some connections between WordNet and Wikipedia are wrong. Some ofthe WordNet classes are not related to Wikipedia categories. For example, the Wikipedia categories: President of FIFA is assigned to the WordNet class: head of state president, but the president of FIFA is not a head of state president. Finally, DBpedia ontology is well organised, but the hierarchy level is not deep enough to create personal name concepts.
An occupation is an important feature when distinguishing one person from another [48]. This thesis aims to propose a new occupation taxonomy for generating a personal name con-
2.4 Conclusion 33
cept in each instance. To define a characteristic in each person, the basic element of occupa- tion taxonomy should be semantic and organised in multiple orders. Therefore, we propose a new occupation ontology architecture based on YAGO taxonomy and web directories to create an occupation taxonomy for personal name concepts.
The NED framework consists of three components: extractor, searcher and disambigua- tor. The extractor fulfils the task of extracting a list of proper names within a document. The searcher generates a set of candidate entities from each proper name by matching a proper name over the entity surface form. However, the exact-match lookup over the entity surface form in the previous works is insufficient to generate a set of candidate entities because the two names (the mentioned name and the entity surface form) are equivalent if all letters are the same. However, one challenge in NED is name variations, as a proper name often has multiple representations and typological errors occur. The disambiguator has the task of solving lexical ambiguity by linking the best candidate to a mentioned name. Two funda- mental techniques of contextual similarity and topical coherence are usually used in a NED task. However, contextual similarity is not flexible for natural language. As a result, we in- troduce a new approach based on tree matching and topical coherence in the disambiguator task.
There are three main advantages to using STM to compare a similarity between two personal name concepts. Firstly, the structure of a personal name concept is a tree. Sec- ondly, the computation needed to make a decision that a pairs of personal name concepts are similar or not is a short one. This is because STM is a top-down algorithm and the two root nodes are compared first, so if the two root nodes are dissimilar this means that there are significant differences in conceptualisation between the two personal names. Thirdly, when considering the similarity between two trees, it is not necessary for every nodes in the two tree to be similar, but enough for some nodes in the two tree to be similar. The STM returns the maximum number of similarity nodes and the matching pairs. However, the performance of STM is limited by assigning equal weight to every nodes in the tree. Ad- ditionally, it is time consuming and it does not make sense to compare ambiguous personal name concepts to every identified personal in a web document.
The seed tree growth technique in PTA provides a useful idea to our work for creating a comparison tree. This is because it integrates identification of personal name conceptualisa- tions which have the same root node and because the tree which has the maximum number of nodes becomes the comparison tree. A comparison tree technique can reduce the number of times similarity matching occur between two trees.
As a result, we can introduce the new algorithm Simple Partial Tree Matching(SPTM) for personal name disambiguation by combining the two algorithms of STM and PTA.
Chapter 3
Personal Name Surface Form and
OAPnDis
This chapter introduces three main ideas used in personal name entity linking: Personal Name Surface Form Modules(PNSFM), Occupation Architecture for Personal Name Dis- ambiguation (OAPnDis) and the Personal Name Disambiguation Data Catalogue (PNDDC). Firstly, Section 3.2 describes PNSFMs that are used to generate personal name surface form and extracted occupation categories, personal name entity’s occupation and their relations. Secondly, Section 3.3 presents and overview of OAPnDis, a new occupation architecture that is used to generate personal name concepts. Finally, Section 3.4 presents a detailed description of the Personal Name Disambiguation Data Catalogue (PNDDC), which is used in personal name entity linking.
3.1
Motivation
Data catalogues are one of the important components of Personal Name Entity Linking (PNEL) and plays a critical role in the precision of PNEL. The major problems of PNEL, hoever, are lexical and referential ambiguity. Referential ambiguity means that the name can be represented in multiple forms [7, 24] such as abbreviations (e.g. John F. Kennedy vs. JFK), shortened form (e.g. Willard Carroll Smith vs. Will Smith), alternate spellings e.g. Barack Obama vs. Barak Obama and aliases e.g. DB7 vs. David Beckham. Lexical ambiguity means that a single name can refer to different persons [7, 24]. For example, the name George Bush may refer to more than six well known people on Wikipedia as shown in Table 3.1.
36 Personal Name Surface Form and OAPnDis
Table 3.1 The Wikipedia disambiguation page focuses on the personal name: George Bush[71].
ID Personal Names Professional Category
1 George H. W. Bush 41st President of the United States 2 George W. Bush 43rd President of the United States
3 George Bush(biblical scholar) 19th-century biblical scholar and preacher
4 George Washington Bush First black settler in what is now the state of Washington 5 George Bush (NASCAR) Former NASCAR driver
6 George P. Bush (born 1976) Attorney and real estate developer
Traditionally, knowledge bases such as YAGO [42], Wikipedia and Freebase [30] pro- vide information about referential ambiguity for each person. For example, Wikipedia uses ’the redirect pages’, YAGO uses the property ’means’ and Freebase uses the property ’Also known as’ to solve the referential ambiguity of each person. The knowledge bases above therefore provide available data of referential ambiguity of names that is useful for generat- ing a personal name dictionary.
Knowledge bases uses multiple features such as occupation, nationality and birth of date for solving problems of lexical ambiguity by inserting these feature at the end of personal name. Occupation is widely used in Wikipedia or Google snippets to distinguish people who have the same name. Furthermore, the experimental results in a study [48] shows that occupation is the most important feature for solving the problem of lexical ambiguity. Therefore, building personal name entities and a corresponding ’occupations catalogue’ is an important task in PNEL.
Knowledge bases and web directories such as Wikipedia, Freebase and Dmoz [23] use a thematic domain hierarchy for grouping contents and entities. The categories have a hi- erarchical arrangement; however, they are not well-formed and barely effective for their ontological purpose [63, 65]. For example, Stephen King is assigned to the super-category Fiction, even though Stephen King is a writer and not fiction. Therefore, thematic domain hierarchy provides sometimes meaningless to a person and may affect the performance of personal name disambiguation.
YAGO and DBpedia [20] provide an ontology and semantic classes. However, the tax- onomy in YAGO suffers from the error connections between WordNet and Wikipedia class in that, some classes are not related [22]. The major problem in DBpedia is that it is offering a smaller number of level in hierarchy structure so it does not do enough for generating the personal name concept for each person. It is more valuable in personal name disambiguation if the system can create a concept of each person based on their occupation taxonomy, or if