A system is called ‘smart’ if a user perceives the actions and reactions of the system as being smart. Media is therefore managed smartly if a computer system helps a user to perform an extensive set of operations on a large media database quickly, efficiently, and conveniently. Such operations include searching, browsing, manipulating, sharing, and reusing.
The more the computer knows about the media it manages, the smarter it can be. Thus, algorithms that are capable of extracting semantic information automatically from media are an important part of a smart media-management system. As part of this effort, our lab at Intel Corp. (Santa Clara, CA) is focusing on tasks such as reliable shot detection; text localization and text segmentation in images, web pages, and videos; and automatic semantic labeling of images.
Calling the shots
A shot is commonly defined as the uninterrupted recording of an event or locale. Any video sequence consists of one or more shots concatenated by some kind of transition effects. Detecting shot boundaries thus means recovering those elementary video units, which in turn provide the ground for nearly all existing video abstraction and high-level video segmentation algorithms. In addition, during video production each transition type is chosen carefully in order to support the content and context of the video sequences; therefore, automatically recovering all their positions and types may help the computer to deduce high-level semantics. For instance, feature films often use dissolves to convey a passage of time. Dissolves also occur much more often in feature films, documentaries, and biographical and scenic video material than in newscasts, sports, comedies, and other shows. The opposite is true for wipes, in which a line moving across the screen marks the transition from one scene to the next. Therefore, automatic detection of transitions and their type can be used for automatic recognition of the video genre.
A recent review of the state-of-the-art in automatic shot boundary detection techniques emphasizes algorithms that specialize in detecting specific types of transitions such as hard cuts, fades, and dissolves. In a fade, the scene gradually diminishes to a black screen for several seconds; when a scene dissolves, it fades as the next scene becomes clearer, not to black as a true fade does. Today’s cutting-edge systems can detect hard cuts and fades at a high hit rate of 99% and 82% and at a low false-alarm rate of 1% and 18%, respectively. Dissolves are more difficult to detect, and the best approaches report hit and false-alarm rates of 75% and 16% on a representative video test set.
Extracting truly high-level semantics from images and videos in most cases is still an unsolved problem. One of the few exceptions is the extraction of text in complex backgrounds and cluttered scenes. Several researchers have recently developed novel algorithms for detecting, segmenting, and recognizing such text occurrences.2 These extracted text occurrences provide a valuable source of high-level semantics for indexing and retrieval. For instance, text extraction enables users of a video database to query for all movies featuring John Wayne or produced by Steven Spielberg. Or it can be used to jump to news stories about a specific topic since captions in newscasts often provide a condensation of the underlying news story.
Detecting, segmenting, and recognizing text in nontext parts of web pages also is a very important operation. More and more web pages present text in images. Existing document-based text segmentation and text recognition algorithms cannot extract such text occurrences due to their potentially difficult background and the large variety of text color used. The new algorithms allow users to index the content of image-rich web pages properly. Automatic text segmentation and text recognition might also help in automatic conversion of web pages designed for large monitors to small LCD displays of appliances, since the textual content in images can be retrieved.
Our latest text segmentation method is not only able to locate text occurrences and segment them into large binary images, but also to label each pixel within an image or video whether it belongs to text or not. Thus, our text detection and text segmentation methods can be used for object-based video encoding. Object-based video encoding is known to achieve a much better video quality at a fixed bit rate compared with existing compression technologies. In most cases, however, the problem of extracting objects automatically is not solved yet. Our text localization and text segmentation algorithms solve this problem for text occurrences in videos. Using this technique, the multiple video object video (multiple video object plane, or VOP) achieved a peak signal-to-noise ratio about 1.5 dB better than the single object encoded MPEG-4 video. Thus, encoding the text lines as rigid foreground objects and the rest of the video separately achieved a much better visual quality.
Figure 9.2. A classification scheme for web images enables an algorithm to sort automatically
Although much research has been published on extraction of low-level features from images and videos, only recently has the focus shifted to exploiting low-level features to classify images and videos automatically into semantically meaningful and broad categories. Examples of broad and general-purpose semantic classes are outdoor versus indoor scenes and city versus landscape scenes. In one of our media indexing research projects, we crawled about 300,000 images from the web. After browsing carefully through those images, we came up with broad- and general-purpose categories (figure 9.2).
Although it uses only simple, low-level features, such as the overall color diversity in the image, the average noise level in the images, and the distribution of text line positions and sizes, our classification algorithm achieved an accuracy of 97.3% in separating photo-like images from graphical images on a large image database. In the subset of photo-like images, the algorithm could separate true photos from ray-traced/rendered images with an accuracy of 87.3%, while the subset of graphical images was successfully partitioned into presentation slides and comics with an accuracy of 93.2%. Sample images illustrating the chaos before and the order after their classification are shown in figure 2.5 We are now working to increase the number of categories that can be classified automatically and will have to explore how joint classification can be done accurately and efficiently.
Although automatic media content analysis capabilities provide the basis of a smart media-management system, efficient methods to browse a media database in a random but directed way are equally important. Every 3 s while the main selection is playing, the system queries the whole video database for shots that are most similar to the currently visible video sequence. The result of the query is shown as a decorative border around the main video player. At any time, the user can select any of those similar shots as the current video. In the example, similarity is based on color, but any similarity measure can be applied. For instance, similarity based on the text visually occurring in a video sequence can be a useful criterion for browsing through a database of newscasts recorded from a diverse set of broadcast channels.
Another equally important task is automatic video abstraction. A video abstract is a sequence of still or moving images (with or without audio). The video abstract is designed to rapidly provide the user with concise information about the content of the video while preserving the essential message of the original. Different abstraction algorithms for edited video (newscasts, feature films) and raw video (home video and raw news footage) have been developed in the past, but even better methods are needed for the future.
Many interesting challenges are still waiting to be addressed by researchers. The SPIE conference titled Storage and Retrieval of Media Databases (20–26 January, San Jose, CA) was one of the major research meetings on this topic. A new special feature track is on peer-to-peer media sharing and distributed media searching and indexing.
By Rainer Lienhart, Intel Corp., OEMagazine, 2001, July