2.2 Automated Video Summarization
2.2.3 Text-based Video Summarization using Closed-captions
captions or Transcripts
For news programs, instruction or presentation videos, and teleconferences, where the camera is fixed on the speaker for a long time, a large portion of the
important information is contained in the audio stream. Therefore, an intuitive and practical approach of summarizing videos is based on analyzing the speech text transcripts (Christel et al., 1996). Closed captions are readily available for most broadcast videos, like news programs. For other video genres, such as presentations and teleconferences, where closed captioning is not available, automatic speech recognition (ASR) techniques can be used to generate the speech transcript.
Agnihotri et al. (2001) presented a summarization system for generating summaries for talk shows using the closed-caption text. The system extracts and analyzes closed-caption text of the talk show videos, uses cue words and domain knowledge of program structure to determine the boundaries of in- dividual guests of the talk show and commercial breaks, and then creates a program summary. The authors experimented with their summarization sys- tem with seventeen hours of closed-caption data, and evaluated the system in terms of precision and recall. The summarization system produces high level summary information and a table of contents indexed by topics. The recall of finding the guests in a talk show is 93% (i.e., 25 out of 27 guests in the talk shows were correctly identified), while no guest was incorrectly identified (i.e., precision is 100%).
Taskiran et al. (2002) proposed an algorithm, referred to as FREQ in
Taskiran et al. (2006), to automatically generate video summaries based on video transcripts. The FREQ algorithm generates summaries based on word- frequency, word co-occurrence, and dispersion scores derived from program segments. The videos were first segmented into a number of segments based
on long inter-word pauses. Then the words in a segment are scored based on a method related totf-idf, and each segment is scored by summing the scores for all words contained in the segment. The log-likelihood ratio was used for detecting significant co-occurring words in the program to identify important phrases. To manage the tradeoff between detail and coverage of the summaries while maximizing the coverage of the summaries, a measure of similarity dis- persion over the whole video program was derived, where small dispersion is wanted when summaries are clustered in the full video, and large dispersion is wanted when summaries are distributed uniformly across the video. In each iteration of the greedy algorithm for selecting segments, the segment yielding the greatest increase in the dispersion value of the current summary is selected to be included in the summary, until the summarization ratio of 0.1 is reached.
Taskiran et al. (2006) designed a user study to compare the quality of the FREQ generated video summaries and the quality of summaries generated using two other algorithms, RAND and DEFT, which do not utilize word- frequency or dispersion scores (Taskiran et al., 2006). The FREQ algorithm has reliable performance even with transcripts obtained by ASR which has a high error rate. The FREQ algorithm was found to be statistically significantly better than RAND and DEFT in terms of the number of correct answers out of the 10 multiple choice questions, and the number of answers contained in the summaries. The study makes a great contribution in suggesting the use of video transcript to generate video summaries, and further suggested considering generating summaries using more modalities other than just the transcript in future studies.
Also note that Taskiran et al. (2006) used error prone speech transcripts from ASR in the FREQ algorithm to automatically generate summaries. If highly accurate transcripts such as closed-captioning are available, we can expect the automated video summaries using transcript will perform even better. The state-of-art ASR techniques, however, is not sufficient to be used solely to generate closed-captioning with high accuracy. Martone et al.(2004) proposed an algorithm for generating automated closed-captioning using text alignment. The algorithm aligns video transcripts with no time codes with ASR output containing time code for each word. With this technique, if the program transcript is available, highly accurate closed-captions can be automatically generated efficiently, and more effective video summaries can be created from the speech transcript.
Another example of video summarization based on transcripts is the MAGIC (Metadata Automated Generation for Instructional Content) system devel- oped at IBM, which utilizes various content analytics tools to automatically generate metadata for instructional video content (Li et al., 2005). The au- diovisual analysis modules recognize semantic sound categories and identify narrators and informative text segments, while the text analysis modules ex- tract title, keywords and summary from video transcripts. In particular, the text analysis tools extract a document title, a set of keywords (ranked by fre- quency and/or ranked from the most specific to the most generic), topic shift boundaries, and a summary description comprising a few important sentences from the video transcripts.