Affinity matrix - Gaze-driven random forests

Chapter 6 Automatic video annotation based on query clustering and

6.4. Query clustering based on gaze-driven random forests

6.4.3. Gaze-driven random forests

6.4.3.1. Affinity matrix

In the previous clustering method (section 6.3.2) we have generated an affinity matrix by considering the WordNet distances of the autonomous textual queries and the temporal information. As we have already discussed, in this scenario we do not consider the temporal information. Although the WordNet distances can give an indication of relevance, there are several cases, in which this metric fails to provide an acceptable result. Especially, when the context of the query is unknown (as in our case), the inability of term disambiguation (e.g. distinguish “jaguar” car and animal) further complicates the problem. To this end we propose to enrich this distance with semantic similarity on the involved images clicked during each subsession and which comprise the dependent queries.

Let’s define the semantic similarity between two subsessions and . The idea of this comparison is illustrated in Figure 6.6. As it was described in Chapter 4, each subsession includes one autonomous query and a set of dependent queries. We calculate the semantic similarity between the two autonomous queries using the WordNet similarity as described in section 6.3.2.1. However, the dependent queries consider keyframes (i.e. images) as input and therefore each subsession includes a set of images that were clicked by the user. To calculate a distance

132

between two sets of images we need to consider a metric that represents such a similarity.

One of the most well known metrics for set comparison is the Jaccard coefficient (Jaccard 1908). This coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. Formally, for two sets and , the Jaccard similarity coefficient

( , ) is given by:

( , ) =| ∩ |

| ∪ | (6. 9)

Figure 6.6. Comparison of subsessions. On the top the direct comparison between the textual queries is illustrated. The common images identified in the dependent

queries are shown in green circles.

However, when a set is comprised of images there are cases, in which several images are very similar to each other and can be considered near duplicates. In this context we proposed to enhance the Jaccard similarity coefficient and introduce the visually enhanced Jaccard similarity, which takes into account near

133

duplicates. We avoid introducing a similarity coefficient totally dependent on visual similarity due to the fact that this could lead to misleading results, since in many cases the visual distance doesn’t correspond to the semantic distance.

Figure 6.7. Sets A and B in a bipartite graph representation. Distances for non duplicate shots are represented with red dashed edges, while black solid edges

indicate distances between near duplicates

The idea is to identify near duplicate images between the different sets and consider them identical in order to compute the Jaccard similarity. However, the problem is not that simple, since each image might have more than one near duplicates and a random selection would lead to different results. For instance let us assume set = { , } and = { , }, where is near duplicate with and , while is near duplicate only with . A random assignment would result to different similarity coefficients depending on the sequence we consider. In this case one assignment could be ≡ and ≡ , which leads to Jaccard similarity equal to 1, while another assignment would be only ≡ ( ≡ cannot be considered, since each image is allowed to have only one near duplicate), which leads to Jaccard similarity equal to 0.5. It is obvious that from such a case the

134

most meaningful result is the first, while in the second case important parts of information are neglected.

Since the members of are linked only with the ones of (i.e. and no connections exist between them), we can represent these connections by considering a bipartite graph as it is shown in Figure 6.7.

Then we model the problem of identifying the maximum number of duplicates as a minimum weight perfect matching problem (or assignment problem) (Burkard, et al. 2009) in a bipartite graph. The minimum cost (weight) perfect matching problem is often described by the following story: There are n tasks to be processed on m agents and one would like to process exactly one job per machine such that the total cost of processing the jobs is minimised.

To this end we assign in each edge a cost c = 0, when the interconnected vertices represent duplicate images and c = 1 when the images are not considered duplicates. This is performed by considering a distance threshold T as shown below:

c_, = 0 c, ≤ T

1 c, > (6. 10)

Then the problem is considered as a minimum weight matching, in which we want to identify a matching M, which minimises c. In this case the problem we face is non linear, since the member cardinalities of both sets are not necessarily equal. However, and given the fact that we are only interested in the assignments that have to do with the duplicate shots, we can easily transform it to a linear problem either by removing shots that do not have any near duplicate (i.e. remove shot for which c_, = 1 ∀ ) or by introducing dummy shots that satisfy this requirement.

In order to solve this problem we apply the Hungarian algorithm (Kuhn 1955). Let assume a matching M between the shots of sets A and B. Its incidence vector would be x where x_, = 1 if (i, j) belongs to M and 0 otherwise. Then the minimum weight perfect matching problem can be formulated as follows:

135 Min c,x, , (6. 11) subject to: x_, = 1 ∀iϵA x_, = 1 ∀jϵB x_, ≥ 0, x_,ϵℕ, iϵA, jϵB

Then the Hungarian algorithm solves this problem in two steps: a) it constructs a cost matrix C , where c_, is the cost for duplicating shot i and j and b) it uses equivalent matrix reduction to obtain the optimal assignment with respect to the cost matrix (Kuhn 1955).

Table 6.2. Visually enhanced Jaccard similarity algorithm

Input: the two image sets = { }, = { }

1. Eliminate any duplicate images separately in and 2. Calculate all the visual distances _,

3. Transform the problem to a linear one by removing or introducing dummy shots.

4. Apply the Hungarian Algorithm to identify the best matching

5. Update the two sets and to and respectively after the identification of near duplicates (i.e. in case and are duplicates replace all with ).

6. Calculate the Jaccard similarity of the two updated sets and

Output = ( , ) = ∩

∪

It should be noted that the Hungarian algorithm could be also used in order to solve the problem in the case that non-binary costs have been defined. An alternative could be to define as cost the distance between near duplicates and make the cost infinite between the non-near duplicate shots.

136

Finally, we propose to compute the Jaccard similarity after we have identified the maximum number of assignments between the images of the different sets based on near duplicates. The overall algorithm for calculating the enhanced Jaccard similarity is presented in Table 6.2.

Assuming that the WordNet similarity between the terms of the textual query is v_, as described in section 6.3.2.1 and eJ_i,j is the visually enhanced Jaccard similarity, the final similarity w_, is defined as:

w_, =

v_, ∙ eJ_, ℎ v_,, eJ_, ≠ 0 v_,, eJ_, = 0 eJ_{, ,}, v_,, = 0

(6. 12)

In document Interactive video retrieval using implicit user feedback. (Page 147-152)