• No results found

4.4 Advanced and Adaptive Entity Modelling in the Web of Data

4.4.1 Baseline Entity Models

In this section we formalise and draw upon two existing entity modelling approaches within a generative language modelling framework.

Retrieval Framework

We study the entity retrieval problem in a generative language modelling setting. Language models (LMs) are attractive because of their solid theoretical foundations that couples with good empirical performance [124]. LMs have been successfully applied to a wide range of entity-related search tasks; see, e.g., [9, 30, 39, 54]. This is in line with our observations in Section 4.3. A general overview is given in Sections 2.3.1 and 2.3.2 for the fielded variants. The following notation deviates from the background section in that it regards entities instead of documents. We rank candidate entities (e) according to their probability of being relevant given query q: P (e|q). Instead of estimating this probability directly, we apply Bayes’ rule and drop the denominator as it does not influence the ranking (for a given query): P (e|q) = P(q|e)P (e)P(q) rank= P (q|e)P (e). Here, P (e) is the prior probability of choosing a particular entity e, that we subsequently attempt to draw the query q from, with probability P (q|e). Here, we assume that P (e) is uniform, thus, does not affect the ranking. Entity priors could be used to incorporate query-independent features into the ranking, for example, based on the RDF graph structure [26, 33]. Each entity e is represented by a multinomial probability distribution over the vocabulary of terms. The entity model θeis used to predict how likely the entity

would produce a given term t, that is, P (t|θe). Assuming that query terms are

sampled identically and independently, the query likelihood is obtained by taking the product across all the terms in the query, such that:

P (q|θe) =



t∈q

P (t|θe)tft ,q, (4.1)

where tft ,q is the (raw) frequency of term t in the query. Note that P (t|θe) > 0

must be ensured for all vocabulary terms, otherwise the product might end up being zero.

The main question we will be concerned with for the remainder of this chapter is the estimation of the entity model, i.e., the probability distribution, P (t|θe).

Unstructured Entity Model

The simplest approach to constructing entity models is to fold all text associated with the entity into one “bag-of-words”; see Table 4.2(a) for an illustration. Follow- ing the standard language modelling approach to document retrieval, we implement the entity model as a Dirichlet-smoothed multinomial distribution:

P (t|θe) =

tft,e+ μP (t|θc)

|e| + μ , (4.2)

where tft,e is the raw frequency of term t in the representation of e and|e| is the

Table 4.2: Examples of baseline entity models corresponding to Figure 1.3. URIs are resolved (i.e., replaced with the name of the corresponding entity).

(a) Unstructured Entity Model.

Text Audi A4

1996 2002 2005 2007 The Audi A4 is a compact executive car... Volkswagen Group Compact executive cars All wheel drive vehicles Front wheel drive vehicles Product

Audi A4 Audi 80 Audi A5 Audi A4

(b) Structured Entity Model.

Pred. type Value

Name Audi A4

Attributes 1996 2002 2005 2007 The Audi A4 is a compact executive car... OutRelations Volkswagen Group

Compact executive cars All wheel drive vehicles Front wheel drive vehicles Product

Audi A4 InRelations Audi 80

Audi A5 Audi A4

average entity representation size in the collection. The term generation process is shown in Figure 4.1 (Left).

Several SemSearch participants employed variations of this approach: consider- ing only triples where the entity stands as a subjects (thereby ignoring incoming relations) [13, 30, 56] or extracting text only from the subject itself [36].

Structured Entity Model using Predicate-folding

Instead of folding all predicates together, they might be grouped into multiple categories in order to preserve some of the original structure. Prior work presents examples of such grouping based on the type of the predicates [13, 30, 47, 90] or based on their manually determined importance [21]. We group RDF triples into four main predicate types ptfor a given entity e as follows:

• Name: the subject is e, the object is a literal, and the predicate comes from

a predefined list (e.g., foaf:name or rdfs:label ) or ends with “name”, “label”, or “title”.

• Attributes: the subject is e, the object is a literal, and the predicate is not of

type name.

• OutRelations: the subject is e and the object is a URI. We resolve the URI

by replacing it with the name of the corresponding entity; see Section 4.3.1 for details.

• InRelations: e stands as the object; the subject entity URI is resolved.

Figure 4.2(b) presents an example. A separate language model θpt

e is estimated

for each predicate type from all predicates of that type (associated with the given entity): P (t|θpt e ) = tft,pt,e+ μptP (t|θ pt c ) |pt, e| + μpt , (4.3)

where tft,pt,e is the sum of the term frequencies for t for all predicates of type pt

associated with e, and|pt, e| =



ttft,pt,e. Essentially, we concatenate all text from

the predicates of type pt and then apply Dirichlet smoothing using a collection-

wide background model P (t|θpt

c ) (that is, a maximum-likelihood estimate from all

predicates of that type in the collection). The smoothing parameter μpt is set to

the average predicate type representation length in the collection.

Subsequently, the entity model is a linear mixture of the predicate type language models (P (t|θpt

e )). This is additionally weighted with the importance of that pred-

icate type (P (pt)): P (t|θe) =  pt P (t|θpt e )P (pt). (4.4)

Figure 4.1 (Middle) illustrates the term generation process. Viewing entities as documents and predicate types as document fields, this model is equivalent with the Mixture of Language Models approach by Ogilvie and Callan [86].