7.3 Interpreting Web Product Metadata
8.3.3 Generalization into Patterns
The main goal is to generalize the extracted candidate patterns. To ensure flexibility in our approach, this process is performed using operations and strategies. A confidence score enables the selection of the best patterns.
Operations and Strategies
In our context, various operations are applied to candidate patterns to generalize
them. Given a set of candidate patterns CP, an operation o∈ O will return a subset
CP0 of these candidate patterns. We do not describe all individual operations but we
rather present the main categories with a few examples.
• Clean. As the name suggests, this type of operation is in charge of cleaning the candidate patterns from useless words, plural forms, etc. It also discards
irrelevant candidate patterns (e.g., those with a too short middle text tm).
• Tagging is a category which enables the processing of natural language. For instance, such an operation is based on Part-of-Speech (POS) to annotate the candidate patterns and generalize them. Another one relies on Named En- tity Recognition to detect entities in the candidate patterns (date, locations, works, persons, etc.).
• Merge. Candidate patterns may be very similar. In that case, it is interest- ing to merge them. The decision for merging usually requires a rule with
a comparison function and a decision-maker. In this context, the compari- son functions are similarity measures (N-grams, Jaro-Winkler, etc.) while the decision-maker is a threshold. For example, such a rule could be (3grams
similarity with threshold> 0.8).
We could define more operations based on other works. For instance, we could rely
on the Falcons tool to remove entities in the candidate patterns [47], use different
similarity metrics to merge the candidate patterns [50], etc. All these operations can
be combined into a strategy.
A strategy is defined as a sequence of operations applied to a set of candidate patterns and it returns a set of patterns. More formally, a strategy s =< o1, o2, ..., on > is a function such that s(CP) → P. The advantage of the strategies is the promotion of flexibility since the design of new strategies is simple. In addition, a pool of different strategies reduces the probability of a “blockage” of the system. One of the simplest strategy consists of merging identical candidate patterns and POStagging them. In the next section, we present a contextual strategy based on frequent terms.
The Contextual Strategy
This strategy, used later in the experiments, is based on term frequency. Our intuition is based on two facts. First, most candidate patterns contain a few interesting terms to denote the type of relationship. For instance, the sentence “Bored of the Rings is
a parody of Lord of the Rings” mainly includes one meaningful term (parody). This
means that the verb “is” or the determinant “a” could be replaced by other terms of the same nature. Second, many similar approaches only consider the text between the labels of the two entities. Therefore, they miss interesting patterns. On the other hand, SPIDER takes into account the whole context surrounding the two entities when needed and identifies part(s) which should be stored as a pattern.
That is the reason for indexing the most frequent terms from all candidate patterns
after a cleaning process3. By applying stemming to these frequent terms and matching
these terms to the Wordnet dictionary, we are able to build clusters of concept based on Wordnet relationships such as synonyms, direct hyponyms, related terms, etc. Each cluster is labeled using one of its terms, i.e., the most centric one for representing the
concept given the Resnik distance between all terms [170]. The main issue is the
selection of the relevant cluster(s) for the given type of relationship. Indeed, several clusters which represent distinct concepts could be created. For instance, the two entities “Lord of the Rings” and “Tolkien” may lead to the creation of three clusters:
book, fantasy and writer. Therefore, we apply the Resnik distance between the label
of each cluster and the type of relationships to select the relevant cluster(s). Finally, we use a POS-tagger for all words that are not frequent. A frequent term may be replaced by any related term from its cluster. This generality enables the merging of
similar patterns. Examples of patterns are shown in Figure8.2.
8.3.4
Selection of Patterns
The last issue deals with the selection or ranking of the generic patterns. Thus, a
confidence score noted con fp is computed for each pattern p with Formula8.1. Our
intuition is to exploit all information which allowed the discovery of the patterns and to compare a pattern with the ones of the same type of relationship.
con f(p) =
αsupp+ βoccp+ γprovp
α + β + γ
(8.1)
The support supp is defined as the ratio between the number of examples e xp that
this pattern is able to discover and the total number of examples e xτ discovered by all
patterns of the same type of relationshipτ. Note that the support cannot be computed
at the first iteration.
Similarly, the occurrency occp stands for the number of candidate patterns which led
to the generation of the pattern p. It is normalized by the total number of candidate
patterns used to generalize all patterns of the same type of relationshipτ.
suppp= e xp
e xτ occp=
occp
occτ
The provenance pr ovprefers to the relevance of the documents from which the candi-
date patterns which generalize a given pattern have been extracted. The relevance is evaluated given three metrics: the relevance score namely applies tf-idf on the content of the document and its values are bounded by a maximal value which depends on
the query. PageRank4 is widely known due to the Google search engine. The PageR-
ank scores of our collection are in the range[0.15, 10]. Finally, SpamScore indicates
the probability that a document is a spam or not [129]. The idea is to average the
scores returned by these three metrics for all documents from which patterns have
been derived. We note Dpthis set of documents for the pattern p, and di
p a document
of this set. The following formula computes a score in the range[0, 1] to evaluate the
average relevance of this set of documents, and thus the provenance of the pattern:
pr ovp= Pdpi 1 3 r el evance(di p) ma x(relevance(Dp))+ spamscor e(di p) 100 + pr ank(di p) 10 |Dp|
8.4
Relationship Discovery
In the previous step, we have generated patterns for a given type of relationship. Our approach aims at discovering relationships between entities in three different use cases: discovering the type of relationship, searching for an entity, and discovering new examples. Since these use cases roughly tackle the same challenge from different angles, we first describe the different issues related to the exploitation of the patterns.