Generalization into Patterns - Interpreting Web Product Metadata

7.3 Interpreting Web Product Metadata

8.3.3 Generalization into Patterns

The main goal is to generalize the extracted candidate patterns. To ensure flexibility in our approach, this process is performed using operations and strategies. A confidence score enables the selection of the best patterns.

Operations and Strategies

In our context, various operations are applied to candidate patterns to generalize

them. Given a set of candidate patterns CP, an operation o∈ O will return a subset

CP0 _{of these candidate patterns. We do not describe all individual operations but we}

rather present the main categories with a few examples.

• Clean. As the name suggests, this type of operation is in charge of cleaning the candidate patterns from useless words, plural forms, etc. It also discards

irrelevant candidate patterns (e.g., those with a too short middle text tm).

• Tagging is a category which enables the processing of natural language. For instance, such an operation is based on Part-of-Speech (POS) to annotate the candidate patterns and generalize them. Another one relies on Named En- tity Recognition to detect entities in the candidate patterns (date, locations, works, persons, etc.).

• Merge. Candidate patterns may be very similar. In that case, it is interest- ing to merge them. The decision for merging usually requires a rule with

a comparison function and a decision-maker. In this context, the comparison functions are similarity measures (N-grams, Jaro-Winkler, etc.) while the decision-maker is a threshold. For example, such a rule could be (3grams

similarity with threshold> 0.8).

We could define more operations based on other works. For instance, we could rely

on the Falcons tool to remove entities in the candidate patterns [47], use different

similarity metrics to merge the candidate patterns [50], etc. All these operations can

be combined into a strategy.

A strategy is defined as a sequence of operations applied to a set of candidate patterns and it returns a set of patterns. More formally, a strategy s =< o1, o2, ..., on > is a function such that s(CP) → P. The advantage of the strategies is the promotion of flexibility since the design of new strategies is simple. In addition, a pool of different strategies reduces the probability of a “blockage” of the system. One of the simplest strategy consists of merging identical candidate patterns and POStagging them. In the next section, we present a contextual strategy based on frequent terms.

The Contextual Strategy

This strategy, used later in the experiments, is based on term frequency. Our intuition is based on two facts. First, most candidate patterns contain a few interesting terms to denote the type of relationship. For instance, the sentence “Bored of the Rings is

a parody of Lord of the Rings” mainly includes one meaningful term (parody). This

means that the verb “is” or the determinant “a” could be replaced by other terms of the same nature. Second, many similar approaches only consider the text between the labels of the two entities. Therefore, they miss interesting patterns. On the other hand, SPIDER takes into account the whole context surrounding the two entities when needed and identifies part(s) which should be stored as a pattern.

That is the reason for indexing the most frequent terms from all candidate patterns

after a cleaning process3_{. By applying stemming to these frequent terms and matching}

these terms to the Wordnet dictionary, we are able to build clusters of concept based on Wordnet relationships such as synonyms, direct hyponyms, related terms, etc. Each cluster is labeled using one of its terms, i.e., the most centric one for representing the

concept given the Resnik distance between all terms [170]. The main issue is the

selection of the relevant cluster(s) for the given type of relationship. Indeed, several clusters which represent distinct concepts could be created. For instance, the two entities “Lord of the Rings” and “Tolkien” may lead to the creation of three clusters:

book, fantasy and writer. Therefore, we apply the Resnik distance between the label

of each cluster and the type of relationships to select the relevant cluster(s). Finally, we use a POS-tagger for all words that are not frequent. A frequent term may be replaced by any related term from its cluster. This generality enables the merging of

similar patterns. Examples of patterns are shown in Figure8.2.

8.3.4 Selection of Patterns

The last issue deals with the selection or ranking of the generic patterns. Thus, a

confidence score noted con fp is computed for each pattern p with Formula8.1. Our

intuition is to exploit all information which allowed the discovery of the patterns and to compare a pattern with the ones of the same type of relationship.

con f(p) =

αsupp+ βoccp+ γprovp

α + β + γ

(8.1)

The support sup_p is defined as the ratio between the number of examples e x_p that

this pattern is able to discover and the total number of examples e x_τ discovered by all

patterns of the same type of relationshipτ. Note that the support cannot be computed

at the first iteration.

Similarly, the occurrency occ_p stands for the number of candidate patterns which led

to the generation of the pattern p. It is normalized by the total number of candidate

patterns used to generalize all patterns of the same type of relationshipτ.

supp_p= e xp

e x_τ occp=

occ_p

occ_τ

The provenance pr ovprefers to the relevance of the documents from which the candi-

date patterns which generalize a given pattern have been extracted. The relevance is evaluated given three metrics: the relevance score namely applies tf-idf on the content of the document and its values are bounded by a maximal value which depends on

the query. PageRank4 is widely known due to the Google search engine. The PageR-

ank scores of our collection are in the range[0.15, 10]. Finally, SpamScore indicates

the probability that a document is a spam or not [129]. The idea is to average the

scores returned by these three metrics for all documents from which patterns have

been derived. We note D_pthis set of documents for the pattern p, and di

p a document

of this set. The following formula computes a score in the range[0, 1] to evaluate the

average relevance of this set of documents, and thus the provenance of the pattern:

pr ov_p= Pd_pi 1 3 _{r el evance}_(di p) ma x(relevance(Dp))+ spamscor e(di p) 100 + pr ank(di p) 10 |Dp|

8.4 Relationship Discovery

In the previous step, we have generated patterns for a given type of relationship. Our approach aims at discovering relationships between entities in three different use cases: discovering the type of relationship, searching for an entity, and discovering new examples. Since these use cases roughly tackle the same challenge from different angles, we first describe the different issues related to the exploitation of the patterns.

In document Extracting Knowledge for Cultural Heritage Knowledge Base Population (Page 170-173)