• No results found

1.3 Draft Papers

2.1.2 YAGO

YAGO [42, 64] is an ontology and part of the YAGO-NAGA project developed at the Max Planck Institute for Informatics. YAGO stores information in the form of RDF triples: subject(S), property(P) and object(O). This SPO is called a fact. For example,

Nicolas Sarkozy (S) PresidentOf(P) France(O)

is a fact has been stored in YAGO. YAGO collects individual entities and their categories from Wikipedia ’infoboxes’ and links them to the clean taxonomy of WordNet. YAGO contains 365,372 classes, 2,648,387 entities and 104 relations [40]. The taxonomy of YAGO is well-formed and meaningful. For example, in the instance shown in Figure 2.1 Zenedine Zidaneis a soccer player and he is a person.

Zenedine Zidane→ instanceO f → Soccer Player Soccer Player→ subclassO f → Person

2.1 Ontology and Entity on Knowledge Base 15

Fig. 2.1 YAGO Structure [42]

The YAGO structure includes of three major components: classes (concepts, entity types), a set of individual entities and literals (names, phrases). Figure 2.1 shows an ex- cerpt of the YAGO knowledge base. These components are described as follows:

1. Classes: a class is a group of entities that shared particular characteristics (e.g. Per- son, Location, Countryand City). YAGO’s class is derived from two the main sources of: WordNet and Wikipedia. YAGO allows each class to be a subclass of one or mul- tiple classes (YAGO taxonomy). The Entity is a root class in YAGO taxonomy. The superclass and subclass are connected via property subclassOf. An example of the ’subclassOf’ connection is as follows:

NationalLeader → subclassO f → Politician.

WordNet [69] is a lexical database of English developed by Princeton University. WordNet uses the actual sense of the words for grouping the words. Synset is a set of words that share one sense. Words that have multiple meanings (ambiguous words) can be assigned to several synsets. YAGO considers only nouns and the relationship among synsets (super-subordinate or hyperonymy, hyponymy) to organise taxonomic classes. YAGO establishes class from these synsets and links them to Wikipedia cat- egories.

16 Theoretical Background

Table 2.1 YAGO’s classes distribution [22]

The lower classes in the Wikipedia categories are mapped to the higher classes in Wordnet by determining the most frequent sense of the head word in WordNet. YAGO allows only the conceptual categories in Wikipedia to be a class in YAGO [64]. The conceptual category is a category that has the head of word in a plural form. YAGO analyses the head of category name through shallow noun phrase parsing. For ex- ample, the′wikicategory_A f ghan_politicians′has to be assigned to a subclass of the WordNet class ’politician’ because ’Afghan politicians’ is a head compound word and the word ’politicians’ is plural form of ’politician’. The upper and lower class for these connections are as follows:

wikicategory_A f ghan_politicians → subclassO f → wordnet_politician_110451263 wordnet_politician_110451263 → subclassO f → wordnet_leader_109623038 Table 2.1 shows the total number of classes from WordNet and Wikipedia in each level of YAGO taxonomy. YAGO taxonomy contains 19 depths and most of the classes in YAGO are derived from Wikipedia. 90% of YAGO classes are in depth 4-10.

2. Entities: a set of individual entities consist of instances such as people, building, class or country. YAGO [40] divides entities into six categories: people, groups(e.g. mu- sic bands, football clubs, universities or companies), artifacts (e.g. buildings, paint- ings, books, music songs or albums), events (e.g. wars, sports competitions like the Olympics or world championship tournaments), locations and other. Each individ- ual entity could be an instance at least one class and is connected to its class via the

2.1 Ontology and Entity on Knowledge Base 17

property type. The connection between instance and its class is as follows: Zinedine_Zidane type Soccer_Player

Table 2.2 YAGO’s instances distributio [22]

A previous study [22] showed that 96.34% of individual entities in YAGO come from Wikipedia. Furthermore, most of the individual entities are located in the leaf classes and 90% of individual entities are located in depth 4-9 of the YAGO taxonomy. The details about instances of distribution are illustrated in Table 2.2.

3. Literals: YAGO deals with ambiguity and synonymy by mapping alternative names via relation means. The quotes are used to distinguish literals from the entities. The alternative names are derived from Wikipedia redirect pages. An example of literals is as follows:

"Zizou" means Zinedine_Zidane

YAGO has been used by many researchers. Melo et al. [21] integrates entities from YAGO into the Suggested Upper Model Ontology (SUMO). SUMO is a large scale formal ontology with a specific domain, such as countries, cities, companies or actors; the result is formal ontology on a rich scale. A further study [57] applied YAGO to automatically generate

18 Theoretical Background

group of queries and to match the search results into an appropriate category. Limaye et al. [46] used entities, types and their relationships in YAGO to identify entities that are extracted from web tables. Furthermore, Hu et al. [41] used YAGO and the Internet Movie Database (IMDB) to recommend movies and actors for users.

YAGO is an interesting knowledge base to crate our data catalogue, and it is a primary source for us to create a conceptualisation of personal name entity. This is because YAGO contains a large number of personal name entities that are automatically extracted facts from Wikipedia and WordNet. Furthermore, YAGO taxonomy merges Wikipedia categories with the concepts of WordNet. Therefore, YAGO taxonomy is well-formed, semantically accurate and provides multiple levels of a hierarchical taxonomy.