Content-based Video Parsing and Querying

Jing Huang and Vera Kettnaker

Overview

Our project is to implement user-friendly tools for video segmentation (parsing) and searching (querying) on top of Rivl, a resolution independent video language. We first segment a video sequence into shots and condense it into a much shorter skimmed video , which is composed of representativekey frames. We then facilitate video searching by using key frames and moving objects as query keys. We plan to use algorithms that work directly on compressed MPEG data [Yeo95] to make segmentation more efficient. Recently developed techniques such as Color Coherency Vectors[Greg96] and motion tracking will be used for retrieval.

We will use CCV-based features as indexing vectors for retrieving key frames with similar scenes. This makes querying on similar scenery effective and efficient. Extensions to CCVs, such as using color moments and gradient directions, will also be tested for image comparison.

While most of the exisiting work on video searching is based on whole images, we will also try to implement a query engine that allows users to retrieve all video shots in which a particular objects appears. Since the general problem is very hard, we will restrict our effort to queries for moving objects. We will segment the moving objects from the background by exploiting the motion information already contained in MPEG-compressed video data.

Technical Rationale

The recent proliferation of multimedia resources has created a substantial demand for video searching and browsing. For example, social scientist often have to gather empirical data from huge collections of old news shows. VCR functions such as fast forward and rewind are ill-suited for this purpose because the user can easily miss some vital contents in the process. This tedious task could be much alleviated if they could quickly browse through a content-preserving summary or if they could directly query for the scenes they are interested in. All higher-level processing of video, however, is severly constrained by bandwidth limitations of hardware. Therefore, effective and efficient tools for searching and browsing video data are greatly in demand.

Multimedia information systems in general, such as video on demand (VOD), still have poor interactive interfaces. The attractiveness of VOD would increase if users could acess a short summary of a video which they could search or browse before deciding to watch it. The small size of a video summary would increase the network load only to a negligible extent.

Efficiency, interactive functionality and search for interesting objects and events pose the major challenges for the design of user-friendly interfaces to content-based video searching. Because it is relatively easy to improve the efficiency and to design some interactive interfaces, most research work to date ignores object-based or event-based queries. While users often desire to retrieve some interesting objects and events directly from a video, this higher-level information cannot be extracted easily and few work has been done in this area. We propose to use motion cues for extracting moving objects and to let users pose object queries.

Previous Work

There are at lease four research groups working on the problem of video parsing (or segmentation). In the following, we will describe what has been done in each group and how our proposal relates and compares to these approaches.

1. Advanced Image Processing Lab, Princeton [Yeo95][Yeung95a][Yeung95b]

The Princeton group have build a system that detects scene breaks, extracts a number of representative frames per shot, clusters them and represents the result as a rather complicated scene transition graph. All work is done in the compressed domain. They use Swain's histogram intersection method [Swain91], computed directly on DC images (i.e. 1:64 subsampling), and a simple luminance projection for key frame comparison.

We are also interested in trying how much the quality degrades by just using the DC images. However, we will first develop a reliable method on full images and then explore if it also works on subsampled images.

2. Group of Zhang et al in Singapore [Zhang93][Zhang95a][Zhang95b]

Zhang et al have developed various (patented) methods for keyframe extraction. They have tried a number of measures to compare keyframes: quadratic similarity measure [Niblack93] on Munsell colorspace histograms, dominant colors, color moments[Stricker95], and mean brightness. They also used different texture measures, correlation between edge maps and cumulative angular comparisons of shapes.

CCVs is a new technique which combines color with spatial information and performs much better than histogram comparisons in image search engines. We hope that CCVs or a variant of them will also yield better results for keyframe queries. We will not compute texture measures in the scope of this project.

3. Siemens Corp. Research [Arman93][Arman94]

This group of people develops fast algorithms for key frame extraction. They work on the DCT coefficients of I-frames of MPEG-encoded videos, using only a small number of coefficients of a small number of blocks per image. They also subsample along the temporal axis by performing something like a binary search for cuts. When a clear decision about a cut cannot be made, they decompress frames and compute histogram comparisons.

No further manipulation of key frames is done, such as clustering or querying.

4. Informedia Project [Inform][Wactlar96][Hauptmann96][Smith95]

The Informedia Project is a big project at CMU to build large queriable video databases. They are making extensive use of audio information as well as text retrieval techniques for segmenting the video into shots and highlighting "relevant" words in their video skims. They also do a simple camera motion analysis that distinguishes between static scenes, zooms, pans and scene changes. In addition, they use face recognition and text detection to determine the importance of a frame to users. They have not actually implemented any image query functions, but plan to use very high-level, model-based methods.

We might be able to use their motion analysis ideas to separate moving objects from the background.
All these groups use only very simple methods for comparing key frames and we think that we can make significant progress by applying image comparison measures that capture more information than previous methods.

Deliverables

Video Parsing and Key Frame Querying

A small database of digitized video sequences with short shots and recurring scenes and objects (Oct. 18)

A keyframe extractor (probably an extension/modification of existing code)(Oct. 22)

A user interface that allows the user to select a query frame and that displays the result by splitting the keyframe sequence into related and unrelated keyframes. (Oct. 22)

Separation of objects from the background; (Oct. 30)

A few comparison methods (feature/similarity function pairs) to find frames of the same scene, possibly operating on semi-compressed data. (Nov. 10)

Evaluation of these methods for a number of video sequences from a few different genres. (Nov. 20)

As time permits: Querying for objects

A user interface that allows the user to select an object;

An object tracker (probably an extension/modification of existing code)

A model-based searching algorithm for querying object in key frames;

Evaluation of these methods.

Special Needs

We will need an MPEG-encoder (there is one in the system lab) and some disk space for storing video data.

References

[Arman93] F. Arman, Arding Hsu, M. Chiu, "Image processing on compressed data for large video databases",ACM Multimedia 93, p.267-272

[Arman94]F. Arman, A. Hsu, M. Chiu, "Image processing on encoded video sequences", Multimedia Systems,1:211-219, 1994

[Hauptmann96] A.G. Hauptmann, M.A. Smith, "Text, Speech, and Vision for Video Segmentation: The Informedia Project", AAAI Fall Symposium, Computational Models for Integrating Language and Vision, 1995

[Greg96] Greg Pass and Ramin Zabih, "Comparing Images Using Color Coherent Vectors" MM96, to appear.

[Inform]fttp://www.informedia.cs.cmu.edu

[Koechlin92] O. Koechlin, "L'analyse automatique de l'image: Vers un traitement du contenu", Dossiers de l'Audiovisuel, 45:76-82, 1992

[Niblack93]W. Niblack, R. Barber, et. al, "The QBIC project: Querying images by content using color, texture and shape.", Storage and Retrieval for Image and Video Databases I 1993, SPIE Vol. 1908

[Smith95] M.A.Smith, T. Kanade, "Video Skimming for Quick Browsing based on Audio and Image Characterization", Tech-Report CMU-CS-95-186, 1995

[Swain91] M.J.Swain, D.H. Ballard, "Color Indexing", IJCV, 7:1, 1991, p.11-32

[Stricker95] M. Stricker, M. Orengo, "Similarity of Color Images", Storage and Retrieval for Image and Video Databases III 1995, SPIE Vol. 2420, p. 381-392

[Wactlar96] H.D.Wactlar, T. Kanade, M.A. Smith, S.M. Stevens, "Intelligent Access to Digital Video: Informedia Project", Computer, May 96, IEEE Computer Society

[Yeo95] B. L. Yeo and B. Liu, "Rapid Scene Analysis on compressed Video", IEEE Trans. on Circuits and Systems for Video Tech., 1995

[Yeung95a] M.M.Yeung, B. Liu,"Efficient Matching and Clustering of Video Shots", ICIP 1995

[Yeung95b] M.M.Yeung, B. Yeo, W. Wolf, B. Liu,"Video Browsing using Clustering and Scene Transitions on Compressed Sequences", MM Computing and Networking 1995, SPIE Vol.2417

[Zhang93] H.J.Zhang, A. Kankanhalli, S.W. Smoliar, "Automatic Partitioning of full-motion video" Multimedia Systems, 1,1993,p. 10-28

[Zhang95a]H.J.Zhang, S.W> Smoliar, J.H. Wu, "Content-Based Video Browsing Tools", MM Computing and Networking 1995, SPIE Vol. 2417, p. 389-398

[Zhang95b] H.J. Zhang, C.Y.Low, S.W. Smoliar, J.H. Wu, "Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution", ACM Multimedia 1995, p. 15-24

Our Project Page