Domain-specific Video Summarisation - Automatic Video Summarisation:

Chapter 2: Literature Review

2. Overview

2.3. Automatic Video Summarisation:

2.3.3. Domain-specific Video Summarisation

Another category of video summarisation techniques is that of domain-specific methods with capability to generate summaries for particular video genres by utilising the exclusive features and attributes available in those categories. These methods are frequently being employed in summarisation of sport videos and are mainly based on the fusion of low-level and object level features in order to identify the most valuable events (Ekin et al., 2003; Zhang and Chang, 2002).

In earlier work, the low-level features of the video were analysed for event-detection purposes, while more recently studies employ ontology based approaches (Bertini et al., 2005). Accordingly, a formal ontology reasoning approach was proposed to produce semantic abstraction of sport videos (Ouyang and Liu, 2013). As a result, sport videos are annotated with ontologies in order to build a three-level hierarchy sports abstraction (keyframe, representative shots and video clips). In order to build the required knowledge infrastructure for semantic analysis, the sports video model was divided into an upper ontology and a domain-specific ontology. While the first one represented the general features of basic sport videos, the second was adopted to depict the details of general concepts. An XML scheme was utilised for describing and reasoning of the video ontology. An interactive keyframe selection technique was adopted to generate static video abstracts. While the semantic information of shots and keyframes was obtained directly through the users’ annotations, the semantic results for the representative scenes could be gained from the inference engine. This proposed algorithm requires a great extent of user involvement which can potentially affect its scalability.

In a proposed approach to summarise documentary movies, the generated summaries were represented in the format of a set of contiguous audio-visual segments that were homogeneous in a cross media space (Perez-Daniel et al., 2014). Adopting the Data Cube concept (Gray et al., 1997), several partitions of the same data set could be generated by employing various possible combinations of the audio-visual features space. In order to describe the visual features, a number of colour-based MPEG7 features including Scalable Colour Descriptor and Colour Structure Descriptor alongside a texture-based (representing a pyramid of blocks with the histogram of oriented gradients) feature were adopted. In addition, MFCC and chroma vectors were utilised to denote the auditory information. As a result, a consensus clustering algorithm with the capability of incorporating various combinations of dimensions of the description space was utilised to build such partitions. A consensus clustering is a procedure to merge agreements over several clustering on a similar data set with different dimensions. The median frame of each cluster was chosen to be inserted into the summary. Despite some considerable outcomes, the practicality of this algorithm is linked to the availability of MPEG7 data. Furthermore, presence of aural noise can increasingly deteriorate the quality of final summary.

In contrast to visual methods, there has been an attempt to summarise sport videos using the audio features, considering the fact that interesting events can lead to changes in the speech

excitement level (Otsuka et al., 2006). Accordingly, the percentage of excited speech in each audio segment is calculated alongside its energy level enabling the system to compute the importance level of each video segment.

Interestingly, in a combination model (Taskiran et al., 2006), the textual content of movies alongside its audio characteristics were both used for video abstraction. Using a speech recognition system, transcripts of the video are retrieved and subsequently an inverted word index alongside a phrase glossary index is created. In this system, it is audio pauses instead of shot boundaries which are used for segmentation purposes of the video. The importance score for each video segment is computed by applying information retrieval techniques. Each video segment is considered as a document and term frequencies within segments as well as the distribution of pairs of words within it could both potentially determine the importance of each segment. However, this type of summarisation technique does not generate satisfactory results when speech signals are noisy (Ngo et al., 2005). Moreover, this proposed algorithm is not applicable to silent videos.

As opposed to audio-visual oriented techniques, in a text-based approach sport video events are detected by analysing and alignment of webcast text and broadcast video (Xu et al., 2008). After filtering out the stop words and names of players, a probabilistic latent semantic analysis is applied to cluster the webcast text into different categories. Later, words with the highest number of occurrence in each category are chosen as keywords to represent the event types. Sentences containing these keywords are text events. In order to synchronise the webcast text and corresponding event in the video, a conditional random field model algorithm is employed to detect the start and end boundary of the event. However, the proposed algorithm can only function in presence of webcast data.

In contrast to the previous method, a visual-oriented approach was proposed for football video summarisation using an improved algorithm for the detection of replay shots. Shot boundary segmentation was carried out by detection of differences in the dominant colour pixel ratios and colour histograms. In the next phase, the shots were fed into the event detection engine to be examined for identification of the logos (TV logos are recently being adopted as a visual-effect before showing the slow-motion shots), score board, Goal-Mouth and shot classification. Finally, a rule-based classifier was used for interesting events detection (Eldib et al., 2009). Nonetheless, high quality video summaries could be generated adopting this method only in the presence of carefully developed replay shots.

In another domain-specific approach, a summarisation method for a basketball game was proposed based on monitoring the temporal changes in the score. For this purpose, a scoreboard region detection method was used and a text area detection algorithm was applied to identify the areas of an image with many vertical strokes. Only the regions which remained static for a second were then chosen as candidates for scoreboard (Kim at al., 2005). In the next step, a number recognition algorithm was applied to the filtered result in an attempt to determine the score regions. Simultaneously, the video shots were classified into play shots and non-play shots based on the ratio of dominant coloured pixels. Finally, by defining some semantic templates for exciting scenes, the importance score of changing score frames could be computed and important shots were included into the video digest. Unsurprisingly, the availability of scoreboard in the original video is a prerequisite for good performance of this algorithm.

Motion features at different video levels was the basis for a proposed framework to summarise the surveillance videos (Sujatha et al., 2014). Initially, the original video was divided into a number of blocks, each containing a non-uniform number of segments. The optical flow at frame level was computed and was further propagated to the segments and blocks levels. The frame motion was derived from the overall motion of the existing feature points in that particular frame. Those are the points where strong derivatives were observed in two orthogonal directions. Later, the motion entropy for each block is obtained by computing the probability of possible motion in a segment and therefore, the most salient blocks with the highest motion activity can then be extracted for the final summary.

In document User-centred video abstraction (Page 41-44)