Agglomerative Clustering - Clustering Variable-Length Codewords into Movemes

5.4 Modeling the Dynamics

5.4.4 Clustering Variable-Length Codewords into Movemes

5.4.4.2 Agglomerative Clustering

The Smith-Waterman distance metric just introduced is an excellent tool to determine the similarity of codewords. It can handle pairs which are different in length, and allows for some shifting of descriptors during the match. One drawback, however, is the fact that it does not scale well to the alignment of multiple codewords: although its extension is trivial, the computational complexity grows exponentially in the number of sequences. The problem itself remains an open question and, once again, is of great interest in biology.

Since some clustering techniques, such as k-means or mixture of Gaussians, go through the computation of some form of average element, the alignment of multiple sequences be- comes necessary. We bypass the problem by settling for an alternative clustering algorithm, known as agglomerative clustering, which requires nothing more than the affinity matrix.

Agglomerative clustering groups the data in a hierarchical structure called adendrogram. Initially, each one of the N data points is assumed to be a cluster of its own. Then, the two most similar clusters (really, single data points, up to now) are replaced by a cluster containing them, which has the two clusters as children. This leaves N −1 items to be

6_Generally,_Cdel

<0 since we want to penalize deletions/insertions. Substitutions have score Csub ≥

2Cdel_{, since a deletion on each side effectively amounts to a substitution. The better the match between} two symbols being exchanged, the larger the score that is assigned to it.

1 5 6 26 27 28 5 6 27 28 28 27 26 1 5 6 26 27 28 1 5 6 d = 0 d = 1 d = 2 d = 3 d = ...

Figure 5.15: Agglomerative Clustering. Initially, each one of the six elements is a cluster. A dendrogram is constructed by recursively grouping together the two most similar clusters. Similarity is determined by the average distances of elements in one cluster to elements in the other. Finally, a threshold on maximum intra-cluster distance breaks the links at higher levels, yielding a set of clusters.

clustered. Next, the two most similar clusters are merged and replaced by their union, leaving N −2 clusters. This continues until all elements are in one single cluster. When deciding which pair to merge, several criteria can be used. The average linkage chooses the two sets where the elements in one set have smallest average distance to the elements of the other set. When the average is replaced by the maximum or minimum, we have the complete linkage or single linkage, respectively. Figure 5.15 shows a sample hierarchy obtained by single-linkage clustering of six data points. Pairwise affinity is based on the Euclidean distance.

Once the dendrogram is constructed, clusters can be defined by choosing a threshold on the maximum intra-cluster distance, or other such criteria that eliminates links, starting from the top and descending to some level in the dendrogram. The resulting connectivity

defines what each cluster contains.

5.4.5 Experiments

In this section we show a few results obtained with the algorithm we have presented. Although quantitative results are generally desirable, we provide only a limited, and mostly qualitative, evaluation. This is in part due to the difficulty of representing on paper what is most naturally understood through a video signal.

More importantly, we feel that the process of establishing a ground-truth decomposition into movemes is at best subjective, hence making a quantitative measure of the performance of little interest.

Finally, we observe that our focus has been directed at the discovery of movemes, which is the very first step of the activity/action hierarchy which we have hypothesized. We believe that a quantitative evaluation would be beneficial when performed on an end-to-end system, that can produce descriptions of behaviors which are at the same level of abstraction as those of a typical human being.

Representing the Dynamics :

In this experiment we apply the learning procedure of Section 5.4.1 to 17 video sequences with a total of 4,610 frames. The input to the algorithm is a set of locations in each frame which identify the joints of the body. For each limb we learn a model of its four coordinates. Each coordinate is trained independently by means of a SLDS with five switches. The training is stopped when the likelihood increase is below a small threshold.

Figure 5.16 shows results on sequence 8. Each one of the four plots represents the probability mass of the switch for one of the coordinates within the right arm. These

SLDS Switching Probability Distribution Pr[S_t(X_ha)] 12 34 5 Pr[S_t(X_el)] 12 34 5 Pr[S_t(Y_ha)] 12 34 5 Pr[S_t(Y_el)] 0 50 100 150 200 12 34 5

Figure 5.16: Estimating the Switches. We show experimental results on sequence 8 of 17. Each plot represents the probability mass of the switch (which takes values 1. . .5) for one of the coordinates within the right arm. All observations (including future ones) are used to compute the smooth estimate of the switching variable.

are smoothed estimates since all the observations (including future ones) are used. A glance at the plots (qualitatively) shows that the repetitive nature of the motion is captured by the switches alternating over time.

Moveme Representation of Video Sequence :

In this experiment we show a decomposition in movemes of a video sequence. As before, we train our model for the right arm on a total of 4,610 frames, obtaining a codeword-based representation of the data. The codewords are clustered according to the procedure detailed in Section 5.4.4. We set the number of clusters to 15. This results in 10 visually distinct (to a human) motions. The reason for the lower number is that some of the motions are represented by multiple clusters. By observing the fragment of video contributing to each cluster, we manually assign a textual label to each cluster with the purpose of describing its content.

The vertical axis of Figure 5.17 lists in colors the subset of movemes that appeared in sequence 8. The top half reports the raw trajectories of the four coordinates in time. At the bottom, the height of each bar corresponds to a different (to a human) moveme.

Clustering of Right Arm Evolution into Movemes X_hand X_elbow Y_hand Y_elbow −20 0 20 Movemes Spread Cross Reach Pull Cross Rotate Raise 50 100 150 200 250

Figure 5.17: Decomposition into Movemes. We show a moveme-based representation of the motions in sequence 8. At the top we report the trajectories of the right arm’s four coordinates. The bottom plot shows the movemes over time. Height indicates the identity of the moveme, as perceived by a human. Names for the movemes are pro- vided on the vertical axis. The 10th bar, which represent an “arm crossing”, is (to a human) visually similar to the 2nd, 6th, and 14th (they have the same height). However, its color is different since the model returned a different moveme/cluster for its representation.

Notice how the 2nd, 6th, 10th, and 14th bars have identical height, indicating that a human would consider all four motions as a “crossing of the arm in front of the chest”. Yet, bar number 10 carries a different color, since our algorithm produced multiple movemes (two in this case) representing the same motion.

In document Towards automatic discovery of human movemes (Page 135-139)