Keywords Disambiguation Algorithm - Usage-Based Profile Modeling

III. A Framework for Usage Analysis in Information Retrieval

6. Usage-Based Profile Modeling

5.1. Keywords Disambiguation Algorithm

Input: Q = {w0, w1, . . . , wn} /*the keyword query Q

Output: Sb_Q = {s_b₀,s_b₁, . . . ,s_b_n} /*the set of synsets corresponding to Q

1: Create G(S_Q, E) /*G is the graph of Q, SQ is a combination of sysnets corresponding to Q, and E the set of edges representing the length of the shortest path between each couple of synsets

2: for all node N in SQdo /*Initialize the nodes

3: Initialize N with the first synset of the corresponding keyword in Q

4: Mark N as non-final and non-visited

5: end for

6: for all edge e in E do /*Initialize the edges

7: Initialize e with the length of the shortest path between the two first synsets

8: end for

9: Calculate the MST of G

10: while G contains non-final nodes do

11: for all non-visited node N do

12: Mark N as visited

13: Gtemp= G

14: if N is non-final then /*N represents the keyword w and sw is the set of synsets of w

15: Replace the current synset s by s+, the next synset in (sw,<)

16: Update G

17: Calculate the MST of G

18: if MST is not improved then

19: Mark N as final

20: end if

21: Replace s+by s /*return to the previous synset 22: G = Gtemp /*return to the previous graph

23: end if

24: end for

25: for all N non-final do

26: Replace s by s+

27: Mark N as non-visited

28: end for

29: Update G

30: end while

31: Return the list of final nodes corresponding toSb_Q

In the next step, only non-final nodes are considered. The process continues until all the node are marked as final, which means that no improvement is possible. The synsets of the final nodes represent the approximation of the best combination of the query synsets (Algorithm 5.1).

We notice that in the particular case of a query composed of unrelated keywords in WordNet, the query is split into several sub-queries (i.e., sub-graphs) and the

disambiguation method is applied separately to each sub-query. In the case of a single keyword sub-query, the first synset is automatically selected.

The complexity of the disambiguation method depends on the number of keywords in the query and on the number of synsets of each keyword. Let n be the number of keywords and k1..knbe the number of synsets corresponding respectively to keywords

w1..wn. The complexity of the MST algorithm is O(n + m) [FW94] (n the number of nodes and m the number of edges). According to Algorithm 5.1, the MST is recalculated as many times as one synset is replaced by the next synset. Thus, the MST algorithm is applied at most

i=1

ki times because the method follows a greedy approach and hence a synset is visited only once. Consequently, the complexity of the disambiguation heuristics is equal to

i=1

×O(n + m). Suppose kmax be the maximum number of synsets that a keyword in Q could have. The disambiguation complexity is then equal to: k×n×O(n + m)=O(n2_{+ nm), “m” being the number}

of edges among the graphs of Q. In the worst case of a complete graph (i.e.,

m = n(n − 1)

2 ), the complexity of the disambiguation is equal to: O(n

3_{) which is}

polynomial.

The proposed method has three main advantages. First, it uses a reliable source of semantics, which is WordNet with possible mappings to other resources (cf. sec. 5.2.1.3). Second, it enables to find with a satisfactory precision (cf. sec. 5.5.1) the sense of a keyword by considering the context. Third, it is based on a greedy algorithm. However, in spite of these strong points, the method can relatively suf- fer from some problems. The first problem is related to the size of the query (cf. sec. 5.5.1). In fact, the method works with queries of multiple keywords, and sup- poses that the query contains only one subject (i.e. the context). However, if a query contains several sub-subjects the method can fail to find the right synset, since it cannot find the context of the keyword. Indeed, the method works well with relatively short queries. The second problem that can affect our method is related to the capabilities of WordNet. Indeed, even though the WordNet database covers most English words, it contains only a few named entities. Fortunately, thanks to the mapping possibility included in WordNet, it is possible to extend its capacity by considering other databases of named entity like DBpedia sec. 4.2.3. Such a mapping is for instance realized in the YAGO project which aims to unify WordNet and Wikipedia [SKW07]. The third problem is related to the heuristics. The principle of the heuristics is to exclude a node from the next step when it does not improve the

MST. This heuristics may fail (i.e., its precision can be affected) if the excluded node could improve the MST when it is considered with nodes of farther steps. However, we consider this case to be rare since it concerns synsets of less common sense.

5.2.1.3. Discussion: WordNet as a Main Source of Semantics

There are several reasons that motivated us to choose WordNet as a main source of semantics:

1. First, WordNet is one of the richest thesaurusi, as it comprises more than 200000 word-synset pairs;

2. Second, it provides many types of relations which enables to connect the synsets between them. This property is particularly used by the Disambigua- tion method;

3. Third, it contributes to the definition of the structure of the taxonomy thanks to the IS-A relation between synsets;

4. Finally, it is compatible with a large number of dictionaries and other semantic sources e.g. DBpedia, which provides mappings to the synsets and therefore extends its semantic scope (cf. sec. 4.2.3).

5.2.1.4. Keywords and Semantic Relations

The process of keyword disambiguation enables to specify the meaning of a keyword. However, the construction of the taxonomy needs also to determine the relations between keywords. In our context, what we mean by semantic relations between keywords is more precisely defined as relations between synsets. As we deal with textual queries, a linguistic ontology can provide us with information about synsets and the relations between them. To this end, we use the WordNet structure that includes several types of semantic relations (about twenty) e.g. hyponymy vs. hypernymy (IS-A), holonymy vs. melonymy (is part of), etc. In the framework of the disambiguation method, all the relations are considered to construct the graphs of the query. Indeed, using different relations contributes to the enrichment of the query graphs since it enables the emergence of all possible relations between synsets. Fig. 5.3 shows the different relations included in WordNet that are related to the synsets of the word “paper”. This figure was created using visuwords [Log13].

Figure 5.3.: The WordNet Structure

In the remainder of the process of constructing the taxonomy, we use the hypernymy relation (IS-A) as a basic structural characteristic to define a hierarchical order between the keywords (i.e., sysnsets). Semantically, it cooresponds to the relation of being super-ordinate or belonging to a higher rank or class, e.g., “paper” is the hypernym of “wallpaper”. Fig. 5.3 shows the synsets of the word “paper” and the relations with their direct subordinates.

5.2.2. Basic Hypernymy Structure

The aim of this step is to construct a hierarchical structure that semantically relates the synsets corresponding to the keywords extracted from search logs. To this end, we use the hypernymy relation “IS-A”. We choose the IS-A relation for two main reasons: first, it enables to classify the keywords with a high degree of granular- ity; second, it provides a means to measure the distance between keywords in the taxonomy (cf. sec. 5.2.3).

Such a relation enables to identify for each synset the hierarchy of hypernyms and to relate them by merging those hierarchies. Merging different hierarchies of hypernyms gives rise to two possible structures. If we consider a simple hypernymyi.e., a synset accepts only one hypernym, the merging process produces a tree structure. However, if we consider that a synset accepts multiple hypernyms, then the merging process produces a semi-lattice. In our proposal, we call both structures Taxonomy. The hypernymy structure is generated as follows. In the first step, for each synset we identify the list of its direct hypernyms in WordNet. It is a set of hypernyms to which the sysnet is directly connected by an IS-A relation. The process is then applied recursively to each hypernym until it reaches the last hypernym which hypernym is the root2. For example, let us take the term: “football”. Suppose that this term matches the synset: “football.1” ( the integer “1” reflects the rank of the synset). According to WordNet, sport.1 and game.2 are the hypernyms of football.1, diversion.1 is the hypernym of sport.1, activity.1 is the hypernym of game.2 and diversion.1 (cf. Fig. 5.4).

Figure 5.4.: Hypernymy Structure Construction by Hpaths Merging

Hence, we get the sub-paths “football.1” is − a sport.1 is − a “diversion.1 is − a “activity.1” and “football.1” is − a game.2 is − a “activity.1” (cf. Fig. 5.4). We call the hierarchy of hypernyms “Hpath”.

2_{Note that WordNet does not have a inique global root. For technical purposes, one can assume}

the existence of a virtual root. Moreover, in recent version of WordNet (from v2.1) there exists a root for nouns, which is “Entity”.

In document Usage-Driven Unified Model for User Profile and Data Source Profile Extraction (Page 100-105)