• No results found

Entity Extraction from the Semantic Web

5.2 Extraction Techniques

5.2.5 Entity Extraction from the Semantic Web

The previously described entity extraction algorithms all use the Visible Web (compare Sec- tion 2.2.3) as their extraction corpus. In the Visible Web, unstructured and semi-structured documents (see Section 2.2.1) dominate. The Semantic Web is growing at a fast rate and contains billions of structured data entries that can be used as a extraction corpus too. The Semantic Web Extractor (SWE) is targeted at exactly this corpus with the goal of extracting as many correct entities for our given ontology as possible. This task is often referred to as entity list completion.

Extraction Techniques 87

Querying the Semantic Web

Information pieces on the Semantic Web adhere to ontologies and therefore make it easy for machines to read and process the data. However, the information is highly distributed over thousands of Web pages. The Semantic Web contains a few major sources, such as DBpedia and Freebase (see Figure 1.2), but Web pages containing RDFa, for example, provide more triples for the Web of Data. In the conception of the algorithm, we assume that there is an

index12 or triple store that aggregated triples from many different sources. We then use this

index as the single query point, removing the need to work with distributed data.

Figure 5.13 shows the different processes in the SWE. The next sections explain these steps in more detail.

Figure 5.13: Overview of the Processes for the Semantic Web Entity Extraction

Detecting Ontology Concepts with Seed Entities

Figure 5.14 illustrates the first step of the SWE in greater detail.

Figure 5.14: Concept URI Detection Process of the Semantic Web Entity Extractor

Having only the concept names (concept ) and a few instances per concept (SeedSetconcept)

we need to find out which ontological concepts the seeds belong to. We therefore load a combination of seeds and query the index with each seed and its concept . We then find the seedURI that is about our seed and extract all triple objects that might state the type (type ∈ TypePredicateSet ) of the entity. Next, we put all found type candidates in a candidate list TypeCandidateList . After doing this for all the seed entities, we need to determine to which of the concepts in our candidate list all the seed entities belong. We therefore eliminate all concepts in the set that appeared fewer times than the number of seed entities we have.

Furthermore, we might have extracted concepts and their super concepts. We want the most detailed concepts and need to remove their broader super concepts. We do this by resolving each concept URI and eliminating their super concepts, or eliminating the concept if we find out that it is a super concept of another concept in our candidate set. If the TypeCandidateSet is not empty after the process, we move on to entity extraction. If we have not detected a concept, we use a different combination of seeds and try again until we find at least one common concept. To make this process clear, we provide the following example:

Let us assume we have the following seed entities of the actor concept: SeedSetActor =

{Jim Carrey, Josh Brolin}. First, we search the index for URIs with information about these seeds. From the result list, we now need to find the URI that is most likely about the seed entity. We do this by detecting the label of the subject that is described by the URI (see Section 5.2.5). If the label matches our seed entity name exactly, we take the URI as the subject and query the index for all triples that belong to that subject.

In our example we have found that the URI http://rdf.freebase.com/ns/en.jim_carrey matches our seed Jim Carrey. We would then retrieve the following triples for this URI (s is the subject, p is the predicate, and o is the object):

s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://rdf.freebase.com/ns/common.topic.image> o: <http://rdf.freebase.com/ns/wikipedia.images.commons_id.4945051> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://rdf.freebase.com/ns/base.popstra.celebrity.supporter> o: <http://rdf.freebase.com/ns/m.064hfhb> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://rdf.freebase.com/ns/film.actor.film> o: <http://rdf.freebase.com/ns/m.0jykww> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/film.actor> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/award.award_nominee> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/people.person>

In the next step, we need to find out which type our seed is. We can use different vocabularies to find the type of the entity. We want the algorithm to extract from arbitrary ontologies and use only the following ontology independent predicates in the TypePredicateSet :

Extraction Techniques 89

p: http://www.w3.org/2004/02/skos/core#subject p: http://purl.org/dc/terms/subject

We remove all triples that do not contain a predicate which is element of the TypePredicateSet . We are now left with the following triples:

s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/film.actor> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/award.award_nominee> s: <http://rdf.freebase.com/ns/en.jim_carrey> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/people.person>

After processing the first seed, the T ypeCandidateList contains the following URIs:

o: http://rdf.freebase.com/ns/film.actor

o: http://rdf.freebase.com/ns/award.award_nominee o: http://rdf.freebase.com/ns/people.person

We can now process our next seed, seed = Josh Brolin. We use the same procedure as with the first seed with one difference: when finding the seedURI for the seed , we limit the results to URIs from the same ontology that the first URI is from. This step is necessary for finding the common entity type among all seeds element SeedSet . For example, if the first seedURI is from DBpedia http://dbpedia.org/resource/Jim_Carrey and the second one is from Freebase http://rdf.freebase.com/ns/en.josh_brolin, their candidate types will not match since both ontologies are likely to use their own terms for the type. For example, in DBpedia Jim Carrey is of type http://dbpedia.org/ontology/Actor while in Freebase his type would be http://rdf.freebase.com/ns/film.actor.

After processing our second seed, seed = Josh Brolin and finding its seedURI http://rdf .freebase.com/ns/en.josh_brolin, we can add types to the TypeCandidateList from the following triples: s: <http://rdf.freebase.com/ns/en.josh_brolin> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/film.actor> s: <http://rdf.freebase.com/ns/en.josh_brolin> p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/tv.tv_guest_role>

s: <http://rdf.freebase.com/ns/en.josh_brolin>

p: <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> o: <http://rdf.freebase.com/ns/people.person>

After adding the types to the TypeCandidateList , it contains the following URIs:

o: http://rdf.freebase.com/ns/film.actor o: http://rdf.freebase.com/ns/award.award_nominee o: http://rdf.freebase.com/ns/people.person o: http://rdf.freebase.com/ns/film.actor o: http://rdf.freebase.com/ns/tv.tv_guest_role o: http://rdf.freebase.com/ns/people.person

We now want to find all types that all seeds from the SeedSet have in common, that is, we remove all types from the TypeCandidateList that occur fewer times than |SeedSet |. After this step our set is reduced to two types:

o: http://rdf.freebase.com/ns/film.actor o: http://rdf.freebase.com/ns/people.person

Our goal, however, is to find the most precise concept among all seeds. In a last step, we resolve the URIs from the TypeCandidateSet and remove all super concepts from each candidate from the set. We remove all concepts that are objects to the predicate http://www.w3.org/ 2000/01/rdf-schema#subClassOf for a subject from the TypeCandidateSet . In our example, we can remove freebase:people.person from the set after discovering that freebase:film.actor is a subclass of that concept. Our final detected concept URI in this example is therefore http://rdf.freebase.com/ns/film.actor.

Extraction of Entities

Figure 5.15 shows the second step of the SWE in greater detail.

Figure 5.15: Entity Extraction Process of the Semantic Web Entity Extractor

After we have detected the common type URIs for all the seeds from the SeedSet , we can query the index to find more entity mentions and extract them. First, we collect an EntityCandidatesSet . Figure 5.15 shows the first way to accomplish this. We resolve each URI from the TypeCandidateSet and add all subject URIs to the EntityCandidatesSet that have a predicate which is element of TypePredicateSet and have a URI that is element of

Extraction Techniques 91

the TypeCandidateSet as object. In a second step, we query the index for all combinations of predicates and detected types. This means we have numQueries = |TypePredicateSet | × |TypeCandidateSet |. In our example, we would query the index with one detected concept URI (freebase:film.actor) and the three URIs from the TypePredicateSet . From each result, we add the subject URI to the EntityCandidatesSet .

After having collected candidate URIs in the EntityCandidatesSet , we need to find the cor- responding label for each of the candidate URIs. To retrieve the label, we resolve the entity candidate URI and analyze all triples with a predicate element of LabelPredicateSet . Again,

we want to be ontology independent and use only generic vocabulary13. The LabelPredicateSet

contains the following predicates:

p: http://www.w3.org/2000/01/rdf-schema#label p: http://www.purl.org/dc/elements/1.1/title

If we find a label we extract the entity. If there are several labels, we search until we find the English one. Usually the language is denoted with a @LANGUAGE_CODE after the literal. If we are unable to find the label by analyzing the entity candidate’s triples, we try to guess the label from the entity candidate URI itself. The URI does not have to be human readable and contain the entity label but sometimes it still does. For example, the following two URIs describe the same entity Jim Carrey :

s: http://sw.opencyc.org/concept/Mx4rvo3dlZwpEbGdrcN5Y29ycA s: http://dbpedia.org/resource/Jim_Carrey

If we were not able to find a proper label, we take the last part of the URI (everything after the last slash), clean the string (replace underscores with spaces), and perform a plausibility check. For this last check, we use simple heuristics. We consider a label implausible if

ˆ the longest consecutive string in the label is longer than 25 characters, or

ˆ the longest consecutive string in the label is longer than 15 characters and contains more than two digits, or

ˆ the label starts with an @ symbol.

In the example above, if we did not find a proper label for either of the two URIs, we would consider “Mx4rvo3dlZwpEbGdrcN5Y29ycA” implausible, but would extract the term “Jim Carrey” from the DBpedia URI.

We perform these steps for each candidate entity from the EntityCandidateSet and proceed to the ranking step once we have processed all candidates.

13

An example of an ontology dependent vocabulary is http://rdf.freebase.com/ns/type.object.name, which is similar to http://www.w3.org/2000/01/rdf-schema\#label, but is primarily used by Freebase and not by other ontologies.

Ranking Extractions

Figure 5.16 shows the third and last step of the SWE in greater detail.

Figure 5.16: Entity Ranking Process of the Semantic Web Entity Extractor

After we extract entities from the EntityCandidateSet , we have an ExtractedEntitiesSet . To ensure we only use the most likely entities, we need to rank the entities from the set.

To rank the entities, we calculate the similarity of each entity from the ExtractedEntitiesSet with the entities from the SeedSet . The more similar the entity, the higher the rank. As a similarity function, we use the Jaccard similarity coefficient as shown in Equation 5.4, where

Triplesseeds is the union of the sets of all predicates and objects (URIs and literals) that were

found for the seed entities from the SeedSet . Triplesentity is the set of all predicates and

objects that were found for the entity that is being ranked.

Rank (entity) = Triplesseeds∩ Triplesentity

Triplesseeds∪ Triplesentity

(5.4)

The rationale behind this similarity approach for ranking is that ranking should be relative to the entities from the SeedSet. For example, if we used only comedy actors as seeds, other comedy actors should be ranked higher than musical actors. The amount of predicates and objects that comedy actors have in common is usually higher than the amount of predicates that musical actors and comedy actors have in common.