Video Summarization via Temporal/Spatial Compression

2.2 Automated Video Summarization

2.2.1 Video Summarization via Temporal/Spatial Compression

2.2.1.1 Time Compressed Video

2.2.1.1.1 Speeding up video One intuitive way of making compact video surrogates to cover all information contained in the video is to simply speed

up the video clip. However, the increased playback speed often comes with decrease in comprehension due to the audio distortion (i.e., a degradation of the speech signal) and a processing overload of short-term memory. Therefore, compression of this kind is limited to a maximum compression factor of 1.5-2.5 depending on the particular program genre and speech speed (Heiman et al.,

1986), beyond which the speech audio becomes perturbing and incomprehen- sible.

2.2.1.1.2 Dropping short sequences One practical way to time-compress the audio is to remove redundant information from the speech signal. Thesam- pling methods drops short segments from the speech signal at regular intervals. For example, for the original sequences {1,2,3,4,5,6,7,8,9} of 50 milliseconds each, short sequences{2,4,6,8}can be dropped. By dropping alternate chunks of speech from the original signal, 2x compression can be achieved. Unfortu- nately, this results in an increase in pitch, making the audio less comprehen- sible and enjoyable.

An variant of the sampling methods is dichotic sampling, where different audio segments are played to each ear. For example, for original sequences {1,2,3,4,5,6,7,8,9} of 50 ms each, short sequences {1,3,5,7,9} are played to the left ear, and short sequences {2,4,6,8} are played to the right ear. Di- chotic sampling takes advantage of the auditory system’s ability to integrate information from both ears (Arons, 1997), which increases intelligibility and comprehension of the compressed audio when compared with the standard sampling methods (Gerber and Wulfeck, 1977).

2.2.1.1.3 Pause shortening or removal In addition to speeding-up the video and dropping short sequences, removing or shortening pauses can be used to further reduce 15%-20% playback time without compromising content (Gan and Donaldson, 1988). Simply removing all pauses in speech results in speech that is “natural, but many people find it exhausting to listen to because the speaker never pauses for breath”, as stated in Neuburg (1978). There are two categories of pauses in speech: Juncture pauses, average 500- 1000 ms, which are under talkers’ conscious control, usually occurring at major syntactic boundaries; and Hesitation pauses, averaging 200-250 ms, which are not under talker control (Minifie, 1974). Lass and Leeper (1977) suggested that when time compressing the speech, juncture pauses can not be removed or shortened without interfering with comprehension. For example, Arons

(1997) time compressed speech audio, such that the pauses are selectively shortened or removed. In particular, pauses less than 500 ms are removed, and pauses more than 500 ms are shortened to 500 ms. With these thresholds, speech audio is sped up while providing the listener with cognitive processing time as well as the pace of the utterance.

However, time compression via speeding-up and/or pauses shortening or removing, even when used together, can hardly lead to compaction rates of more than 2:1 (Arons, 1997). In many real-life video retrieval or audio/video summarization applications, a compaction rate of 10 and above is desirable.

To further reduce the playback time of the audio, skimming techniques can be used. For instance, if an audio clip takes 60 seconds to play at normal speed, it may take just 30 seconds when time compressed, while only takes

5 or 10 seconds with higher levels of skimming techniques. The following paragraphs review existing skimming techniques for summarizing videos.

2.2.1.2 Systematic Subsampling Video

Another simple and straightforward method for creating video summaries would simply increase the frame rate across the whole video. A computation- ally expensive way to get a two-fold video speed-up, is to render the frames at twice the original frame rate. This puts burden on a client’s CPU, which has to decode twice as many frames in the same amount of time.

On the other hand, the fast forward, a common summarization approach used in many video retrieval systems, is performed by taking every Nth frame from the original video, and concatenating them as a summary to be played at normal speed.

Fast forwards with audio is equivalent to the time compressed video by sampling discussed above: EveryN thimage frame is extracted from the visual stream, and the audio stream is time-compressed at the same compaction rate. This approach can not decrease the viewing time by more than five-fold without seriously degrading the audio coherence.

For fast forwards with no audio, the audio stream is not provided with the visual fast forwards. Wildemuth et al. (2003) reported on a study of the use of fast forwards for digital video, and recommended a fast forward default speed of 1:64 of the original video. Although this approach can achieve a much higher compaction/compression rate than fast forwards with audio, yet it can lead to severe coherence degradation and discomfort to the viewer.

Instead of taking the every N th frame of the video, video summarization can be simply performed by systematic subsampling: Extracting fixed- duration excerpts of the original video at fixed intervals. For example, select the first 10 seconds of the video, skip the next 50 seconds, select another 10 seconds, and skip another 50 seconds, so on and so forth. Then the selected 10-second segments can be joined together to form a video summary and played back to the viewer at the original frame rate. This subsampling summarization method by keeping and skipping frames at fixed intervals, will likely produce discontinuities at the interval boundaries and exclude essential information from the summary (Wactlar et al.,1996). To improve the quality of the summaries based on subsampling techniques, a windowing function or smoothing filter, such as a cross-fade, can be applied at the junctions of the selected segments (Omoigui et al., 1999).

Although the summaries created by systematic subsampling are likely sub- ject to exclusion of important segments, they are easy and inexpensive to implement. Therefore, subsampling is often adopted as the default or base- line method in evaluating other automated video summarization techniques (Christel et al., 1998).

2.2.1.3 Split-screen Display

Instead of doing time compression and systematic subsampling, some summarization techniques reduce video playback time by displaying multiple video streams at the same time.

sented a summarization approach where the most important and non redundant shots selected to appear in the summary were dynamically accelerated and optimally grouped into sets of four and presented simultaneously using a split-screen display, so as to maximize the content included in the summary per time unit. However, the resulting summaries increased the viewers’ cognitive load greatly and did not rate highly with the evaluation campaign assessors in terms of ease of use.

In document Multi-modal surrogates for retrieving and making sense of videos : is synchronization between the multiple modalities optimal? (Page 54-59)