4. A Computational Model of Acoustic Packaging
4.2. Related Work
4.2.2. Temporal Visual Segmentation
In computer vision systems temporal segmentation of human actions is often implicitly defined by the activation of specialized classes as, for example, the start- and endpoint of a hand trajectory. Regarding acoustic packaging a method is required that does not rely on the detection of classes depending on a high interpretation level of the sensory information. This way acoustic packaging is consistent with the limited amount of world knowledge that is available to preverbal infants. The related work on event and action segmentation reviewed in Chapter 2 shows that motion features and the capability to detect visual changes seem to be important cues for action segmentation. These features are also commonly used in systems that have a slightly different aim, namely the segmentation of video sequences. However, the problem is comparable on a technical level. Previous work associated with this area considers different ranges of motion segmentation like detecting scene cuts in movies or segmenting group actions in meeting recordings. In the following, we will group the relevant approaches according to their segmentation goal and look at properties such as online processing or the capability of handling multimodal input (see Table 4.1).
Scene Cut Detection
The problem of finding scene cuts in video sequences is often regarded with the goal to summarize or index the video. The idea is to extract a sequence of stationary images from the video in which each image represents the salient content of a certain video segment.
Chapter 4. A Computational Model of Acoustic Packaging
These images are called key frames. Some of the work is focusing on detecting structure in the video, which results from the video editing such as scene cuts (Gargi et al., 1998; Janvier et al., 2006). Other work is focusing on selecting key frames within shots marked by scene boundaries (Wolf, 2002). The key frames are selected at the local minima of a motion feature based on optical flow. To put it in other words, in this approach, discontinuities are detected in the feature stream. While some approaches are capable of online processing (Wolf, 2002), others are designed for offline processing (Janvier et al., 2006). The commonality is that all approaches use the visual modality only.
Action and Activity Segmentation
In many approaches, developments on action segmentation are motivated by recognizing predefined classes as, for example, in Davis and Bobick (1997); Schuldt et al. (2004). Even if generic features are used, these systems need to be trained on human labeled data (Hunter, 2009). However, if the goal is to create a system inspired by developmental learning, the categories and the structure of the action cannot be a-priori assumed. Following the idea of analyzing video sequences without using pre-trained classes a more complex approach than scene cut detection but with a similar basis is presented in Rui and Anandan (2000). This approach specifically aims at segmenting human actions into key poses. A key pose is understood as the boundary of a video segment, which captures important human action changes. The key poses are detected by searching temporal discontinuities in features based on optical flow that are supposed to carry information about the movements of the human in the image. The authors discuss potential applications such as summarizing video sequences, action recognition and segmentation, and selecting key frames in video compression tasks. Recently, Buchsbaum et al. (2011) described a different approach. Here, videos displaying human activity are segmented into spatio-temporal features called visual words and clustered subsequently. The authors were able to show that changes in the distribution of these clusters correlate to human boundary judgments. Additionally, their model is able to identify further structure based on the statistical occurrence of visual words in similar contexts which is especially the case for repeated actions. Systems to find action structure have also been used to analyze parent infant interaction. In Nagai and Rohlfing (2009), a visual saliency model is used to detect structural information in parent-infant interaction. With a view on designing developmental capabilities in action learning on robots, Nagai and Rohlfing showed that their model is able to detect the initial and final states of actions as well as highlighting properties of objects.
Summary
As outlined above, both approaches, scene cut detection and action segmentation, have the detection of discontinuities in features derived from the video sequence in common. But as can be seen in Table 4.1, most of the work focuses on one modality exclusively
Reference Segmentation Goal Online Predefined Classes
Temporal Representation Buchsbaum et al. (2011) Human actions from multiple
corpora including everyday actions
no no intervals
Davis and Bobick (1997) Classes of aerobic actions yes yes intervals Janvier et al. (2006) Scene cuts in broadcast video no no boundaries Nagai and Rohlfing (2009) Initial and final states of a
manipulation task
yes no boundaries
Rui and Anandan (2000) Key poses of household chores ? no boundaries Schuldt et al. (2004) Classes of human actions
(e.g. running, boxing)
? yes intervals
Wolf (2002) Key frames in movies yes no boundaries
Table 4.1.: Overview of visual motion segmentation approaches.
and is rarely online capable. This is especially the case with increasing complexity of the method. Furthermore, most approaches use points in time as the only representation of their segmentation results. Thus, there is no explicit representation of the segments found, which can further be interpreted.