Video Summarisation by Various Modalities

Chapter 2: Literature Review

2. Overview

2.3. Automatic Video Summarisation:

2.3.2. Video Summarisation by Various Modalities

In most of the methods that were described earlier, some good results were achieved but a prefect video summary can only be produced by extraction of semantic content of video. Since the previously mentioned methods are highly tied to low-level visual features of videos, they are unlikely to fully reflect the semantic content of the videos. Thus, a different research strand involves other modalities (in addition to Visual content) in the summarisation process as a potential information source.

2.3.2.1. Audio Data

Accordingly, Bhatt et al (2009) adopted auditory features solely in an attempt to generate dynamic video skims. After portioning the input audio into one second segments and removing DC component from all partitions, each section was further divided into frames with the length of 320 audio samples (20 msecs). Later, each segment was initially tested for silence or environmental noise, speech, music, and music with speech. Primarily, silence regions detection was carried out by measuring short time energy of each segment through aggregating the sum of squares of the signal samples. Segments with short time energy below a predefined threshold were identified as silent segments. Further, non-silent partitions were tested for environmental noise using short time entropy and the modified autocorrelation peak values. Non-environmental-noise audio segments were further assessed for detection of speech only versus non-speech (further to music only and music with speech) sounds using a number of auditory features including low short time energy ratio, Mel-Frequency Cepstrum Coefficients (MFCC) and variance of log energy. A Gaussian Mixture Model (GMM) and Fuzzy decision trees were adopted for training purposes. Finally, based on the identified category for each audio segment, video abstracts in accordance to that particular video genre were generated. However, the proposed algorithm can potentially fail to include many of the visually and semantically rich video content into the final summary due to silence of its corresponding video segment.

2.3.2.2. Audio-Visual Data

Audio analysis was the basis for another multi-modal technique in which keyframes were selected based on semantic analysis of shots, scenes and frames in a holistic structure (You et al., 2009). A video was preliminary segmented into scenes using audio features assuming the prolonged consistency of the audio track of a scene in terms of signal characteristics. Classification on the non-silent clips of audio was performed to fit each clip in one of five existing genres. Later, these scenes were segmented into shots using the luminance histogram. All audio clips were weighted based on the class which they belong to, and the computed average score of all clips in a scene was measured as the semantic audio importance of that scene. All the representative histograms of a scene are then compared in order to classify the shots into two groups of related and unrelated shots. A shorter scene with more unrelated content is better. The size of a face or text together with its region in the frame was adopted to produce the importance index for that frame. Additionally, the number of occurrences of detected faces or text in a single scene could generate the text and face saliency value for that particular scene. Affective features (pitch, loudness, motion speed and luminance) in one scene were measured and then compared to the whole sequence to show the level of semantic relevance of that scene in regards to the overall sequence. Shots were further semantically measured using the above semantic audio importance and face and text importance as well. Hence, other factors including camera motion, object motion, temporal motion coherence were all taken into account to build a semantic shot importance model. However, the existing video processing techniques for face and text recognition purposes still suffer shortcomings in terms of accuracy and scalability, which can directly affect the performance of the explained approach. Furthermore, the results produced by this approach are highly dependent on the audio-visual quality and noise level. For instance, a noisy audio environment or cluttered scenes can undermine the accuracy and performance of face recognition systems (Herranz and Martinez, 2008).

According to another multimodal technique based on audio-visual features, colour, motion and MFCC features of the audio signal were all analysed to generate the video abstracts (Jiang et al., 2000). Initially, the entire video was segmented into a number of one second length temporary partitions and colour histograms were calculated for each frame. The produced histograms were then averaged over a segment to produce a reference histogram for that partition. Subsequently, motion features were computed for each frame using the SIFT

algorithm and the Euclidean distance between these features were used in the computation of a singular motion vector for each video segment. For auditory analysis, MFCC features were calculated for each video segment. However, considering the vulnerability of this type of features against noisy conditions, tensor subspace analysis was adopted to extract audio characteristics for one second audio frames. Afterwards, the dynamic time warping algorithm was used to calculate the similarity measure between two audio segments. In order to perform segmentation, a dissimilarity matrix was computed, in which each element represented the pairwise distinction between two segments. The calculated values of colour, motion and sound for each segment were further normalised and adaptively weighted in order to be fused into a single value. In the final stage, a Fuzzy C-Means clustering algorithm alongside a maximum likelihood estimation approach was employed to cluster the video segments in an optimal manner. Finally, the video segments closest to the centroids of the clusters were extracted to be inserted into video summary.

2.3.2.3. Audio-Visual-Textual Data

In another multimodal summarisation technique, the saliency of auditory, visual and textual information was analysed separately and then integrated into a multi-modal saliency curve (Evangelopoulos et al., 2009; 2013). For audio saliency detection task, the primary objective was to build a data-driven and time-dependent function with capability to change in accordance to the importance level of auditory sensory information. Therefore, the audio frames were decomposed into a set of equally separated frequency bands (frequency components) and each band was modelled by an AM-FM signal. Gabor filters were further utilised to perform band-pass filtering, while the Teager-Kaiser energy operator and energy separation algorithm were all adopted to decompose each signal into instantaneous energy, amplitude and frequency signals. However, only one frequency component, which dominates the signal spectrum, was employed as a dominant modulation component (the one which produced the maximum energy response over the time frame) and provided the basis for yielding a feature vector comprising instant amplitude, frequency and source energy respectively. Then, each feature was normalised over a long-term window to scalar values that sum to one and the results formed a one-dimensional temporal saliency map. For visual analysis, the frames pixels were considered as the voxels whose saliency was analysed based on their intra-feature, inter-scale and inter-feature interactions. Each frame as a volume was decomposed into 3 conspicuity volumes (intensity, colour and orientation), after which each

volume was further decomposed into multiple scales representing a Gaussian volume pyramid. Intensities were then calculated based on the difference between RGB value of a point and the average value of the surrounding region; colour opponent theory was then used to generate a colour conspicuity score. Finally, orientation was computed employing spatiotemporal steerable filters adjusted to respond to a moving stimulus. Consequently, the outcome was a set of updated multi-scale volumes; the saliency for each point is the average of all volumes over all features and scales. At the end, a single saliency value for each frame was generated by multiplying the normalised feature scores with the calculated saliency value from the last step. In order to evaluate the textual content, forced segmentation on the audio stream was performed using speech transcripts generated by a Sonic ASR (automatic speech recognition) system and phone-based acoustic models as the pre-processing stage. The timestamps inside the provided subtitles can present the rough location of the text in an audio stream and were useful to start the forced segmentation procedure. Then, time-aligned transcripts were analysed using a decision-tree-based probabilistic tagger to carry out part of speech (POS) tagging and the highest scored POS tags were assigned respectively to proper nouns, common nouns, noun phrases, adjectives, verbs and the remaining parts of speech. Therefore, each frame could be scored based on its textual saliency. In the last step, the produced outcomes from different modalities were integrated to produce a single, composite saliency curve. Thus, an intra-model fusion scheme was adopted in which each individual saliency feature was normalised to a value [0,1] and weighted based on its variance. The most salient audio and video sub-clips based on a predefined skimming percentage were then chosen for inclusion in final summary. The proposed approach can produce some impressive results for a number of video categories. However, its performance degrades when the fluctuation in aural or visual features remains at a minimum over the course of video.

In document User-centred video abstraction (Page 38-41)