Content-based Video Parsing and Querying
Jing Huang
and Vera Kettnaker
Overview
Our project is to implement user-friendly tools for video segmentation
(parsing) and searching (querying) on top of
Rivl, a resolution independent video language. We first segment a
video sequence into shots and condense it into a much shorter skimmed video , which is
composed of representativekey frames. We then facilitate video
searching by using key frames and moving objects as query keys.
We plan to use algorithms that work directly on compressed MPEG data
[Yeo95] to make segmentation more efficient. Recently developed
techniques such as Color Coherency Vectors[Greg96] and motion
tracking will be used for retrieval.
We will use CCV-based features as indexing vectors for retrieving key frames
with similar scenes. This makes querying on similar scenery effective
and efficient. Extensions to CCVs, such as using color moments and gradient
directions, will also be tested for image comparison.
While most of the exisiting work on video searching is based on whole images, we will
also try to implement a query engine that allows users to retrieve all video
shots in which a particular objects appears. Since the general
problem is very hard, we will restrict our effort to queries for moving
objects. We will segment the moving objects from the background by exploiting the motion
information already contained in MPEG-compressed video data.
Technical Rationale
The recent proliferation of multimedia resources has created a
substantial demand for video searching and browsing. For example, social scientist often have to gather empirical data from huge collections of old news shows.
VCR functions such as fast forward and rewind are ill-suited
for this purpose because the user can easily miss some vital contents in
the process. This tedious task could be much alleviated if they
could quickly browse through a content-preserving summary or
if they could directly query for the scenes they are interested in.
All higher-level processing of video, however, is severly constrained by
bandwidth limitations of hardware. Therefore,
effective and efficient tools for searching and browsing video data
are greatly in demand.
Multimedia information systems in general, such as video on demand
(VOD), still have poor interactive interfaces. The attractiveness of VOD would
increase if users could acess a short summary of a video which they
could search or browse before deciding to watch it. The small size of a video summary
would increase the network load only to a negligible extent.
Efficiency, interactive functionality and search for interesting
objects and events pose the major challenges for the design of user-friendly
interfaces to content-based video searching. Because it is relatively
easy to improve the efficiency and to design some interactive
interfaces, most research work to date ignores object-based or event-based
queries. While users often desire to retrieve some interesting objects
and events directly from a video, this higher-level information cannot
be extracted easily and few work has been done in this area. We propose
to use motion cues for extracting moving objects and to let users pose object queries.
Previous Work
There are at lease four research groups working on the problem of video parsing
(or segmentation). In the following, we will describe what has been done in each group and how our proposal relates and compares to these approaches.
1. Advanced Image Processing Lab, Princeton [Yeo95][Yeung95a][Yeung95b]
The Princeton group have build a system that detects scene breaks, extracts a number
of representative frames per shot, clusters them and represents the
result as a rather complicated scene transition graph. All work is
done in the compressed domain. They use Swain's histogram intersection
method [Swain91], computed directly on DC images (i.e. 1:64 subsampling), and a simple
luminance projection for key frame comparison.
We are also
interested in trying how much the quality degrades by just using the DC images.
However, we will first develop a reliable method on full
images and then explore if it also works on subsampled images.
2. Group of Zhang et al in Singapore [Zhang93][Zhang95a][Zhang95b]
Zhang et al have developed various (patented) methods for keyframe
extraction. They have tried a number of measures
to compare keyframes: quadratic similarity measure
[Niblack93] on Munsell colorspace histograms, dominant colors, color
moments[Stricker95], and mean brightness. They also used different
texture measures, correlation between edge maps and cumulative angular
comparisons of shapes.
CCVs is a new technique which combines color with spatial information
and performs much better than histogram comparisons in image search engines.
We hope that CCVs or a variant of them will also yield better results for
keyframe queries. We will not compute texture measures in the scope of this project.
3. Siemens Corp. Research [Arman93][Arman94]
This group of people develops fast algorithms for key frame
extraction. They work on the DCT coefficients of I-frames of
MPEG-encoded videos, using only a small number of coefficients of a
small number of blocks per image. They also subsample along the
temporal axis by performing something like a binary search for
cuts. When a clear decision about a cut cannot be made, they
decompress frames and compute histogram comparisons.
No further manipulation of key frames is done, such as clustering
or querying.
4. Informedia Project [Inform][Wactlar96][Hauptmann96][Smith95]
The Informedia Project is a big project at CMU to build large
queriable video databases. They are making extensive use of audio
information as well as text retrieval techniques for segmenting the
video into shots and highlighting "relevant" words in their video
skims. They
also do a simple camera motion analysis that distinguishes between
static scenes, zooms, pans and scene changes. In addition, they use face
recognition and text detection to determine the importance of a
frame to users. They have not actually implemented any image query
functions, but plan to use very high-level, model-based methods.
We might be able to use their motion analysis ideas to separate
moving objects from the background.
All these groups use only very simple methods for comparing key frames and
we think that we can make significant progress by applying image comparison
measures that capture more information than previous methods.
Deliverables
Video Parsing and Key Frame Querying
A small database of digitized video sequences with short shots and recurring scenes and objects (Oct. 18)
A keyframe extractor (probably an extension/modification of existing code)(Oct. 22)
A user interface that allows the user to select a query frame and that displays the result by splitting the keyframe sequence into related and unrelated keyframes. (Oct. 22)
Separation of objects from the background;
(Oct. 30)
A few comparison methods (feature/similarity function pairs) to find frames of the same scene, possibly operating on semi-compressed data. (Nov. 10)
Evaluation of these methods for a number of video sequences from a few different genres. (Nov. 20)
As time permits: Querying for objects
A user interface that allows the user to select an object;
An object tracker (probably an extension/modification of existing code)
A model-based searching algorithm for querying object in key frames;
Evaluation of these methods.
Special Needs
We will need an MPEG-encoder (there is one in the system lab) and some disk space for storing video data.
References
[Arman93] F. Arman, Arding Hsu, M. Chiu, "Image processing on compressed data for large video databases",ACM Multimedia 93, p.267-272
[Arman94]F. Arman, A. Hsu, M. Chiu, "Image processing on encoded video sequences", Multimedia Systems,1:211-219, 1994
[Hauptmann96] A.G. Hauptmann, M.A. Smith, "Text, Speech, and Vision for Video Segmentation: The Informedia Project", AAAI Fall Symposium, Computational Models for Integrating Language and Vision, 1995
[Greg96] Greg Pass and Ramin Zabih, "Comparing Images Using Color Coherent Vectors" MM96, to appear.
[Inform]fttp://www.informedia.cs.cmu.edu
[Koechlin92] O. Koechlin, "L'analyse automatique de l'image: Vers un traitement du contenu", Dossiers de l'Audiovisuel, 45:76-82, 1992
[Niblack93]W. Niblack, R. Barber, et. al, "The QBIC project: Querying images by content using color, texture and shape.", Storage and Retrieval for Image and Video Databases I 1993, SPIE Vol. 1908
[Smith95] M.A.Smith, T. Kanade, "Video Skimming for Quick Browsing based on Audio and Image Characterization", Tech-Report CMU-CS-95-186, 1995
[Swain91] M.J.Swain, D.H. Ballard, "Color Indexing", IJCV, 7:1, 1991, p.11-32
[Stricker95] M. Stricker, M. Orengo, "Similarity of Color Images", Storage and Retrieval for Image and Video Databases III 1995, SPIE Vol. 2420, p. 381-392
[Wactlar96] H.D.Wactlar, T. Kanade, M.A. Smith, S.M. Stevens, "Intelligent Access to Digital Video: Informedia Project", Computer, May 96, IEEE Computer Society
[Yeo95] B. L. Yeo and B. Liu, "Rapid Scene Analysis on compressed Video", IEEE
Trans. on Circuits and Systems for Video Tech., 1995
[Yeung95a] M.M.Yeung, B. Liu,"Efficient Matching and Clustering of Video Shots",
ICIP 1995
[Yeung95b] M.M.Yeung, B. Yeo, W. Wolf, B. Liu,"Video Browsing using Clustering and
Scene Transitions on Compressed Sequences", MM Computing and Networking 1995, SPIE Vol.2417
[Zhang93] H.J.Zhang, A. Kankanhalli, S.W. Smoliar, "Automatic Partitioning of full-motion video"
Multimedia Systems, 1,1993,p. 10-28
[Zhang95a]H.J.Zhang, S.W> Smoliar, J.H. Wu, "Content-Based Video Browsing Tools", MM Computing and Networking 1995, SPIE Vol. 2417, p. 389-398
[Zhang95b] H.J. Zhang, C.Y.Low, S.W. Smoliar, J.H. Wu, "Video Parsing, Retrieval and Browsing: An Integrated and Content-Based Solution", ACM Multimedia 1995, p. 15-24
Our Project Page