Chapter 6 Automatic video annotation based on query clustering and
6.3. Dominant set query clustering using temporal information
During an interactive video search session many users could search for several topics from the same computer terminal. In most of the cases the queries submitted by the user can give a good idea of what she/he is searching for. However, the user usually queries the system using specific keywords and not the whole query itself. In addition, the user could change arbitrarily the search topic, which might further complicate the situation.
We consider a search session , during which, users are searching for topics { , , . . } and they submit queries { , , . . }. Since we assume that the users are searching in a sequential way (i.e. the after other), the goal is to find the time boundaries for each topic, as well as to define the topics in terms of textual description. We consider that each topic is described by a set of queries {. . , , …}. The timeline of the search session is illustrated in Figure
118
6.2. Each query can have as input either text or a shot. We declare as , the semantic distance between two queries and . The aim is to arrange the queries in such groups for which the , between the queries of the same group (topic) will be minimised, while the , between queries belonging in different groups (topics) will be maximum. This can be considered as a clustering problem, in which we want to organise queries into an unknown number of topics considering pairwise similarity and time dimension.
Figure 6.2. Search session and queries 6.3.1. Dominant set clustering
Dominant set as defined in (Pavan and Pelillo 2003) is a combinatorial concept in graph theory that generalises the notion of a maximal complete subgraph to edge-weighted graphs. It simultaneously emphasises on internal homogeneity and external inhomogeneity, and thus is considered as a general definition of cluster. The authors in (Pavan and Pelillo 2003) establish an intriguing connection between the dominant set and a quadratic program as follows:
max f(x) = x Sx, x ∈ Δ (6. 1)
where = { ∈ ℝ : ≥ 0 and ∑ = 1} where is the similarity matrix. Specifically, it is proven that if is a dominant subset of vertices, then its weighted characteristic vector x , which is the vector of Δ defined as:
x =
w (i)
( ), ∈
0, ℎ
(6. 2)
is a strict local solution of (6.1). Conversely if is a strict local solution of the above problem, it is proven by (Pavan and Pelillo 2003) that ( )= { | > 0}
119
is equivalent to a dominant set of the graph represented by . Then, the replicator equation is used to solve (6. 1):
( + 1) = ( ) ( ( ))
( ) ( ) (6. 3)
Table 6.1. Dominant set clustering algorithm
Input: the similarity matrix
1. Initialise , = 1 with
2. Calculate the local solution of (6. 1) by (6. 3): and ( ) 3. Get the dominant set: = ( )
4. Split out from and get a new similarity affinity matrix
5. If is empty, break, else = and = + 1, then go to step 2
Output = ∪ { , , ( )}
The concept of dominant set provides an effective framework for iterative pairwise clustering, which is required in our problem. Considering a set of samples, an undirected edge-weighted graph is built, in which each vertex represents a sample and two vertices are linked by an edge, the weight of which represents their similarity. To cluster the samples into groups, a dominant set of the weighted graph is iteratively found and removed from the graph until the latter is empty. Table 6.1 outlines the algorithm. The dominant set clustering automatically determines the number of the clusters and has low computational cost.
After we employ the dominant set clustering algorithm and form the clusters, the cluster labels are formed by the most frequent keywords included in the queries that comprise each cluster.
6.3.2. Query similarity
As explained in the previous section, in order to identify the topic time boundaries, we propose to compare the queries submitted and identify clusters that correspond to search topics. Based on the analysis we performed in Chapter
120
4 (Vrochidis, et al. 2011), we can identify autonomous and dependent queries and make the assumption that a topic change takes place only at the autonomous query submission. According to this definition, the autonomous queries do not depend on previous results, while the dependent do. In this clustering approach and in order to simplify the problem, we propose to compute similarities between autonomous queries (i.e. textual queries in our case) and assign cluster labels to them. Given the fact that the autonomous queries contain textual information, we need to model a similarity measure between the queries submitted as keywords. In addition, we need to incorporate the temporal dimension in the similarity metric.
6.3.2.1. WordNet-based similarity
One of the state of the art techniques for comparing textual information is to use thesaurus such as WordNet13. In this work we have applied the WordNet “vector”
similarity after experimenting with other WordNet metrics (i.e. lesk and path). Each concept (or word sense) in WordNet is defined by a short gloss. The vector measures use the text of that gloss as a unique representation for the underlying concept. The vector measure creates a co–occurrence matrix from a corpus made up of the WordNet glosses. Each content word used in a WordNet gloss has an associated context vector. Every gloss is represented by a gloss vector that is the average of all the context vectors of the words found in the gloss. Relatedness between concepts is measured by finding the cosine between a pair of gloss vectors (Pedersen, et al. 2004).
An additional problem in our case is the inability of dealing with term disambiguation (since the search topics and the context are considered unknown). To overcome this problem we calculate the maximum similarity between the senses of the two textual queries. Although the lack of this information could lead in many cases to erroneous results, we assume that the temporal information,
13
121
could help in distinguishing irrelevant queries that have been submitted in moments varying in time.
6.3.2.2. Temporally enhanced similarity
The aim of query clustering is to temporally segment the search time into sessions, in which a user searches for a specific topic. In this case, not only the query similarity but also the temporal constraint has to be taken into consideration. For this reason, we incorporate the temporal dimension into the computation of the similarity matrix with a Gaussian kernel. Hence, the similarity , between queries , is computed by:
w, = v, ∙ e( | | ) (6. 4)
where , is the WordNet similarity between the two queries, and are the temporal moments, in which the queries , are respectively submitted, and are the decay factors, which reflect the decreasing rate of the similarity with the temporal interval increasing and , correspond to the elements of the final similarity matrix .
6.3.2.3. Smoothing process
We employ the clustering approach to our problem with the assumption that the queries that fall into one cluster constitute a semantic topic. However, there are cases, in which either the user might submit semantically irrelevant queries during a topic search, or the WordNet similarity might not perform very well. Thus, after conducting the standard clustering process, we introduce the following smoothing process:
a) if the cluster label of a query does not coincide with its two adjacent frames, we assume it was initially misclassified;
b) small clusters are merged with the adjacent ones. The minimum number of members for defining a small cluster is selected experimentally.
122