Mosaicing for Lecture Browser

Image Mosaics For Lecture Browser

Cornell University
Computer Science Department
Masters of Engineering Project
Spring 1998

Submitted by : Sanjeev Topiwala

Advisor : Prof. Brian Smith.

Abstract
Introduction
Background
Implementation
Results
Enhancements
Conclusions
References

Abstract

The goal of this project is to develop an automated system for generating an image mosaic out of the video data captured in the lecture browser project. Image mosaics are useful for a variety of tasks in vision and computer graphics and applications like lecture browser. This paper presents a way to reconstruct a mosaic, which is a representation of the contents of the blackboard/whiteboard from a video feed which pans over the board. Unlike conventional techniques that have used optical flow and affine transformation, we will use the Hausdorff fraction to track camera motion. Also, due to a priori knowledge about the scene and the camera motion, the problem is constrained to be a graph problem. We will use Dalí and create new primitives in Dalí as a platform to put the images together.

Introduction

Image mosaics are useful for a variety of tasks in vision and computer graphics; applications include: virtual environments, panoramic photography, increased image resolution and video compression. The goal of the Lecture Browser project is to support digitization, storage and on-line browsing of the lectures in Philips 101. For this application, one of the issues was when the lecturer uses a black board as a teaching aid, it is not present in digital form and hence a mechanism was needed to generate a representation of the blackboard. The basic idea is to periodically have one camera zoom in, pan over the blackboard in an elevator fashion and give this data to mosaicing routines which generate the mosaic out of it. Ideally, a camera pans overs the region of interest (blackboard in our application) and obtains a complete description of the scene captured in a single image of mosaic. In practice, however, several issues must be considered when designing a system.

First a method to extract representative images from the video feed is needed. Second, a method to detect the motion between chosen adjacent frames is needed. Third, a method to register any pair of images is needed.

First, the data is in the form of MPEG stream and we want to decode it so that only representative frames are selected. The techniques we tried were motion vector analysis for the b and p-frames of the MPEG stream to determine when and which frame to take for the mosaic.

Secondly, Traditionally, optical flow methods, minimization of the sum of squraed intensity errors and computation of affine transformations were used. These methods are prone to errors due to intensity changes, excessive image noise and object movements. Also, in the Lecture Browser project, all the data was in the MPEG format and the lighting conditions in the lecture hall varies a lot. So a robust technique which does not depend upon minor changes in intensity information was needed. So instead of using optical flow techniques, global edge information is used to compute the motion. The generalized Hausdorff distance is used to compare edge points of two frames to obtain the relative motion between two frames. Also, as we have the apriori information about the direction of the camera movement and approximate distance, the search space can be pruned and the results can be obtained really fast.

Third issue was to use a transform to register any pair of images. The possible candidates were affine, projective and spherical. Here also we tried a novel approach for this application which was to use the scaling and translation. The reason for choosing such a simple technique was that the motion of the camera is controlled and due to the distance of the camera from the actual scene, this model is sufficient to represent the relation between any pair of images. Although, the system has been designed in such a way that any transform can be plugged in place of this and most of the implementation remains untouched.

BackGround

Frame Selection

The problems, which I faced, the first was to find a mechanism to figure out a way to sample the frames from the MPEG stream. The images decoded out of the MPEG stream were "out of focus" when the camera used to move or stop after moving. A definitive way was needed to figure out that the camera had stopped between the consecutive translations and sample a representative "in focus" image of that data.

Various approaches which I tried without much of a success were :

The maximum of motion vector value
The number of non-zero motion vectors
The average and median of motion vector values

All this approaches had some problems. One of the problems was that the motion vectors had positive as well as negative values. Also, due to the low texture of the object in the video, testing values of individual motion vectors wasn't giving the correct indication.

The approach, which gave good results, was to use the sum amplitude of the motion vectors for the "p-frames" and sum of both motion vector amplitudes for the "b-vectors". The results clearly indicated a way to figure out the motion. Some results obtained from a sample video data are as under:

wpe1.jpg (55695 bytes)

Although, the data seems to give a very regular pattern, it is really not very reliable and hence finally we dropped the idea of the using this as the basis for detecting frames from the video stream.

Another solution which looks very promising for this problem is to use an external indicator for signalling camera start and stop motion. Using some signal generation routines, a pulse can be recorded in the audio stream to determine the instants when camera is moving and steady. Making use of this, then representative frames can be chosen from the period during which the camera is steady.

Motion Estimation

Traditionally, various methods like optical flow computation are used for motion estimation. There are some disadvantages for this method like it is not very robust with varying lighting conditions and it is also slow. So, we proposed to use Hausdorff distance to find the motion between adjacent images. The generalized Hausdorff based distance measure is used to measure the resemblance of edge points ot two images. In measuring the resemblance of the whole frame or a background of two images, there is no assumption that object must be stationary. The Hausdorff metric can be used to find the relative translation and scaling of the two images. For more details on Hausdorff, refer to the reference.

Image Registration

There are a lot of image registration techniques. They depend upon the type of the transform used to model the motion between adjacent frames. For this project, we had certain advantages like the camera's motion was roughly known a priori and each frames' relative postion in the final mosaic, given that we knew the camera motion. So we decided to use a simple transform of just scaling and translation which proved to be sufficient for this application.

Implementation

The motion of the camera is controlled by a special DALI script which results in a serpentine motion. So the motion was already known.

wpe3.jpg (15183 bytes)

Given this, the problem could be seen as a graph problem, where the nodes of the graph are the various images and they are connected to each other by links, depending on their relative position within the mosaic. So the mosaic is now viewed as :

wpe5.jpg (36341 bytes)

As shown in the diagram above the frames captured by the camera when it undergoes a serpentine motion helps in abstracting the problem of mosaicing as a graph. The various frames within the mosaic correspond to the each node in the graph and they are connected (bottom and right) to the other two nodes. So now the problem boils down to finding the relation between these the adjacent nodes and then putting them all together by registering them in the some reference co-ordinates.

A classical problem with image mosaicing is that most of the approaches compute transform between adjacent images and then find a composite transform by combining those. So, if there is an error, it is added up and at the end of the mosaic, the effective error in the registration is very high. But using this approach, any composite transform has less number of images between the reference frame and itself and so the error of registration is minimized.

The basic algorithm then followed is:

Add all the frames to the graph structure of the mosaic.
Compute the right and bottom transform for the image corresponding to each node. Each transform computed has a "confidence level" which is the amount of match it had with the other image.
Fill in the gaps by interpolating the transforms of surrounding nodes. The interpolation depends upon all the neighbors and picks the transform with the "best confidence level".
Compute the composite transform for each node. Differnet paths are tried for each node and the one with the highest composite confidence is chosen as the composite transform for then node. The composite confidence is the product of individual confidences of the links in the path.
Register all the images in the reference system.
Use alpha blending for removing the artifacts present in the final mosaic.

Some implementation specific details:

To find the transform between images, sub parts of the image were used as models. For the right transform, a slightly smaller image than the right half of the image was taken as model. For the bottom transform, a slightly smaller image than the bottom half of the image was taken as model.
The confidence level was the forward fraction of the hausdorff measure. This metric says that what amount of pixels of the model matched the one in the image.
For interpolating right and bottom transforms, the average of either top and bottom or right and left adjacent nodes was chosen depending upon the values of the confidence level.
All the images were registered with the first image at the left top as the reference frame.

Results

Here is the result of the above approach. The video sequence is captured in the systems lab and here is the mosaic stiched out of it. The video sequence shows the serpentine motion of the camera and the mosaic is the final result. There are 30 frames, one from each steady position of the camera and they are stiched in the mosaic going from top to bottom and right to left for each row.

Enhancements

Right now, the best of only two paths is taken as the composite path for a node. It can be more complicated, trying various paths and choosing the one that is the best.
The transforms used right now are simple translation and scaling. Ideally, it should be projective and hence projective transform should be implemented to better represent the actual motion of the mosaic.
Before stitching the final mosaic, a pass to check the consistencies of the transforms can be made and hence effectively resulting in accurate pixel alignment.
The code is not optimized for performance and can be really optimized to produce results faster.

Conclusions

A different approach has been proposed to generate an image mosaic out of a video sequence for a specific application. The results are promising and more work can be done to follow this approach for generating mosaics for specific applications.

References

Brown, L.G., "A Survey of Image Regestration Techniques", ACM Computing Surveys, 1992.
Davis, James, "Mosaics of Scenes with Moving Objects", Department of Computer Science, Standford University.
Fai, Leong Kian, "Video Mosaics: Multiple resolution preserved", Department of Computer Science, Cornell University.
Huttenlocher D. P & Rucklidge W. J., "A Multi-Resolution Technique for Comparing Images Using the Hausdorff Distance", Department of Computer Science, Cornell University.