III. A Framework for Usage Analysis in Information Retrieval
5.2. Keyword Taxonomy Construction: A Global Semantic Representation
5.2.3. Semantic Distance Function
In order to provide the taxonomy with a means of quantitative evaluation of the semantic distance between two terms (keywords), we first introduce a weight associ- ated to the individual “is − a” links. This weight is defined as a decreasing function of the semantic proximity between the parent and the child (i.e., the higher the weight, the less related the terms). As all relations contained in the taxonomy are of type “is − a”, the nodes go from the most general at the top to the most specific at the bottom. Therefore, two connected terms at the bottom of the taxonomy are more closely related than two connected terms at the top. The weighting function should thus be decreasing with respect to the level of the terms. The choice of a decreasing function is motivated by the fact that the relations between concrete and less abstract terms are more relevant and hence stronger than relations between general terms. In addition to that, a decreasing weighting function enables to char- acterize the dimension of the domain covered by the query terms. This means that depending on the terms used, the query varies from general (i.e., large domain di-
mension) to specific (small domain dimension). We argue that the more general the terms, the larger is the distance between them and hence the larger is the domain covered by the query. For example, let us consider two queries Q1 and Q2 where Q1 is “Sport diversion”, which is a general query about sport, and Q2 is “football player X”, a more specific query about a football player. In this example, the do- main covered by “Sport” and “diversion” is larger that the one covered by “football player” and a given person “X”. It is why the weight of the relation “Sport is − a diversion” is higher that the one of the relation: “X is − a football player”. Based on this, we define the weight function W as:
Definition 1. Let x, y two terms (i.e., synsets) related by the direct relation y “is-a”
x. The weight function W is defined on x and y as:
W (x, y) = l(y)1 , with “l” is the function that returns the level of “y”.
Fig. 5.6 illustrates the use of this function.
Figure 5.6.: Decreasing Weight Function
In order to be able to compare every couple of terms of the taxonomy, we introduce a function that depends on the weights of the edges composing the shortest path between the terms. Depending on the type of path, there are two ways to compute the distance.
• If the path is straight (cf. Fig. 5.7), the distance is the sum of the weights of its edges; in case of multiple paths, the shortest one is considered.
• If the path is deviated (cf. Fig. 5.7), the distance is the sum of distances of the two sub-paths (straight paths) that have a common hypernym as a upper bound and form together the shortest path.
Hence, the function is defined as:
Definition 2. Let x, y two terms (i.e., synsets) of the taxonomy. The dissimilarity
(or distance3) function D is defined on x and y as:
D(x, y) = 0, if (x = y). l(x) X i=l(y)+1 (1
i), ifx → y (the shortest straight path) exists.
l(y)
X
i=l(x)+1 (1
i), ify → x(the shortest straight path) exists.
l(x) X i=l(c)+1 (1 i) + l(y) X i=l(c)+1 (1
i), withx → c ← y the shortest path deviated by c exists. with “l” being the level function (the level function returns the depth of a term, its value increases when going to the bottom of the taxonomy) and “c” the common hypernym of terms “x” and “y”.
Figure 5.7.: Distance Measurement
With this definition, we aim to focus on the abstraction level of terms to evaluate their distance. The notion of semantic distance between words has been extensively 3In the rest of this chapter, we call the function D “distance” as it is more illustrative and
therefore commonly used in the domain. In mathematical terms, a distance should satisfy the four conditions presented in sec. 3.2. However, the function D does not satisfy the triangle inequality as we will see in this section. Indeed, the function D is in this case a semimetric.
studied in the litterature, especially in the information retrieval domain (cf. sec. 3.2.2 for more details). However, the aim of the proposed semantic distance “D” is not to measure the absolute semantic relatedness of terms but to provide a means to measure the semantic proximity in the specific context of our taxonomy. Therefore, we aimed to model the semantic variation (the weight function) of the “is − a” relation depending on the abstraction level of the related terms in the taxonomy (cf. Fig. 5.6). This property had never been explicitly studied in earlier works. However, we could verify its effectiveness with respect to the other functions (cf. sec. 5.5.2). The function presented in definition 2 satisfies the following conditions:
1. Positivity. ∀x, y ∈ V, D(x, y) ≥ 0 (with V the set of terms)
• In case of a straight path (y → x) or (x → y), the distance is the sum of positive weights of the edges composing the path.
• In case of a deviated path (x → c ← y): the distance is the sum of the two straight paths x → c and y → c
Consequently, D satisfies the positivity condition. 2. Symmetry.∀x, y ∈ V, D(x, y) = D(y, x)
• In case of a straight path (x → y) : D(x, y) is the sum of positive weights of edges composing the path. Consequently:
D(x, y) = 1 l(y) + 1+ ... + 1 l(x) = 1 l(x) + ... + 1 l(y) + 1 = D(y, x)
• In case of a straight path (y → x):
D(x, y) = 1 l(x) + 1+ ... + 1 l(y) = 1 l(y) + ... + 1 l(x) + 1 = D(y, x)
• In case of a deviated path (x → c ← y): D(x, y) is the sum of two straight paths (cf. the previous case).
Consequently, D is symmetric.
3. Reflexivity. ∀x, y ∈ V D(x, y) = 0 Iff x = y. There are two cases :
• if x = y ⇒ D(x, y) = 0 (directly from the definition 2) • if D(x, y) 6= 0 ⇒ x 6= y
In fact, if D(x, y) 6= 0 means that at least:
– D(x, y) = 1
– D(x, y) = 1
l(x) (i.e., x ”is − a” y) ⇒ x 6= y
– Otherwise the path between x and y is the sum of positive weights
(positivity)⇒ x 6= y
Consequently, D is reflexive.
The distance D does not satisfy the triangle inequality (Fig. 5.8 shows a counter example).
By definition, the triangle inequality is formalized as: ∀x, y, z ∈ V, D(x, z) ≤
D(x, y)+D(y, z). Thus, let us suppose the following concepts represented in Fig. 5.8:
• X, Y , Z: three different players. • C1: a player in a football club
• C2: a player in a national football team
• C: the concept football player
In the example (cf. Fig. 5.8), the shortest path between player X and Z passes through the common concept “football player”, while the shortest path between players X and Y passes through the concept “player in a football club” and the shortest path between players Y and Z passes through the concept “player in a national football team”. According to definition 2, D(X, Z) = 3, D(X, Y ) = 1, and
D(Y, Z) = 1. Consequently, D(X, Z) > D(X, Y ) + D(Y, Z), which means that D
do not satisfy the triangle inequality.