Probabilistic Framework - Domain Description

5.4 Domain Description - Information Extraction

5.4.2 Probabilistic Framework

A probabilistic classifier for relationships intuitively answers the question: “Which re-lationship is likely expressed when two entities appear with these patterns?” It is also manually verifiable, which makes it a good candidate for a prototype application.

Before discussing solutions to this problem the general idea of the probabilistic clas-sification is outlined.

Table 5.4: Terminology

C Concept

C Set of all concepts

S, O Subject/Object concept of the triple

L Term/Label

L_S, L_O Term expressing the subject/object of a triple T_S, T_O Semantic type/class of the subject/object P Surface pattern

M_D, M_R Domain and Range prior probability matrices

Derivation of the Classifier The following paragraphs will show the derivation of the probabilistic extraction framework. IE is hereby cast as a classification from concept

pairs into relationship types using surface patterns as features. The terminology used in the derivation is given in Table5.4.

It is important to observe that a distinction is made between the concepts that par-ticipates in the relationship and the labels that denote the concepts in text. Similarly, the extraction algorithm operates on concepts rather than labels. However, since concepts do not actually appear in text, the grounding of the concepts in text has to be done using the denoting labels. Given the ambiguity in mapping, a single occurrence of a term pair in text is not enough to indicate the concept pair that is sought. Similarly to the mapping of pattern occurrences to types of relationships, the mapping of multiple occurrences different terms to one concept assure the proper references.

Suppose a given concept pair hS, Oi from the set of all concepts on Wikipedia C that should be classified into one or more relationship types R. The general goal is hence to find a solution to the conditional probability p(R|S, O) and identify all relationships RS,O

that have a sufficiently high confidence of relating S to O (Equation5.14).

R_S,O= {R ∈ R|p(R|S, O) > _rel} (5.14)

The features of the classifier are the patterns found in text that are potential mani-festations of the concept pair and the relationship as described in section5.4.1. Free text, however, will contain ambiguous terms that denote S and O. Section5.3.3, explained how to find probabilities for synonyms of article names on Wikipedia in Equation5.12.

Ideally, the joint probability that a term pair indicates a concept pair would be com-puted, i.e. p(S, O|L_S, L_O), because both terms help at mutually disambiguating each other.

For example, a term pair htable, chairi puts both terms in the realm of furniture, whereas the term pair htable, columni puts them in the spreadsheet or database field. Unfortunately this joint probability is a) expensive to obtain, considering that over 9 million terms were

identified that link to about 3.7 million Wikipedia articles b) restricts the classifier to term pairs that have already been analyzed and c) is even more corpus-dependent than finding the probability that a single term denotes a single concept. For this reason we assume conditional independence of p(S|L_S) and p(O|L_O).

A factor that helps in classification, but is dependent on the availability of background knowledge, is the information on domain and range of relationships, expressed here as the probability p(R|TS, T_O) of seeing a relationship given a domain T_S and a range TO. Here, as well, the joint probabilities are too restrictive for an open extraction so we assume independence of p(R|T_S) and p(R|T_O).

Depending on the background knowledge that is present in the form of ontologies, taxonomies and dictionaries, the terms p(S|L_S), p(O|L_O), p(R|T_S) and p(R|T_O) may be only partially available or not at all. In this case the classifier operates purely on the lan-guage model, rather than on the combined semantic model/langauage model. In a well-designed ontology, relationships will be assigned domains and ranges. However, in the case of community-created sources such as DBpedia this is more difficult. Even though an ontology exists that covers the entities on DBpedia, it is too coarse-grained to properly match the models produced in the Domain Definition step and also does not assign domain-and range restrictions to all properties. In this case, the probabilities p(R|T_S) and p(R|T_O) can be derived bottom-up, by analyzing the category-coverage of the facts in the KB as shown in Section5.4.5.

The third and central component of the probabilistic framework are the relationship-pattern probabilities p(R|P ), i.e. the probability of seeing a relationship in the presence of a specific pattern or a vector of relationship probabilities given a vector of pattern frequen-cies. Separating p(R|P ) allows us to build a fixed pattern representation for relationships.

Figure5.5depicts a Bayesian Network that graphically models the classifier, showing how it operates on a Semantic model and a statistical language model in a unified manner.

The probability p(R, S, O) of a relationship occurring with a subject and an object can

Figure 5.5: Network representation of the classifier.

be rewritten as p(R, S, O, P, L_S, L_O, T_S, T_O), which, based on the Bayesian Network, is formalized in Equation5.15. The equation makes use of the independence assumption of p(S|L_S), p(O|L_O), p(R|T_S), and p(R|T_O) and approximates p(R|T_S, P, T_O) as the product of p(R|TS), p(R|T_O) and p(R|P ). Specifically, as the probability of the presence of a relationship R_j between S and O is computed over all the patterns that L_S and L_Oappear in, the classifier sums over the probabilities of all occurrences of a pattern with S and O, each weighted by the probability that its pattern indicates the relationship R. For the types, the probability is maximized over the domain and range types, indicating that the T_S and TO form hierarchies and the type that has the strongest support as a domain or range for R is chosen.

p(R, S, O) ≈ X

LS∈S

LO∈O

P ∈docs

p(P |L_S, L_O) · p(R|P ) · p(S|L_S)·

p(O|LO) · maxt_S∈T_Sp(R|tS) · maxt_O∈T_Op(R|tO) (5.15)

The values for p(R|P ) are the most difficult to derive, because a distant supervision approach is used for training without access to negative training data to learn these prob-abilities. Also, as mentioned above, no apriori knowledge is assumed of the relationship semantics and thus the extent of the semantic overlap between relationships needs to be obtained during training. In the following I will describe the distantly supervised training process to create a vector space representation of p(R|P ). The next subsection describes the general acquisition procedure and subsection5.4.4 details the derivation of the perti-nence measure that modifies p(R|P ) to account for intensional and extensional relational similarity.

In document Knowledge Acquisition in a System (Page 133-137)