February 13th, 1998
Overview
Methodology
Generating the affine transform
Applications
With a knowledge of how the background in a sequence is moving (camera motion), we can more effectively recognize an object being tracked, as well as have information as to how to recover a lost object when our other tracking methods fail or have uncertain results. This is currently done with affine block-motion estimation to cheaply sample the motion of regions of the image, and extrapolate the overall image movement from frame to frame. This block-motion sampling is done with a pixel-by-pixel comparison of binary (each pixel is either on or off) images generated from the sign of the laplacian of the gaussian smoothed image (which provides dense features for comparison.)
We begin our background estimation method with the assumption that camera motion does a good job of approximating a linear transformation in both x and y coordinates. (a linear transformation is basically some combination of stretching/scaling, rotation, and translation that applied to a point (x,y) produces a point (x',y') where that point would be under the given scaling, rotation, and translation) This assumption is motivated by the fact that most observed scene motion (camera motion) is usually some combination of a zoom, rotation and translation.
We approach the problem as a least-squares fitting problem in both x and y; by uniformly computing vectors across the image of how each pixel in the initial image corresponds to their counterpart in the final image, we can set up a simple minimization problem to determine the best fit of the points (the best fit of the points is the background motion, because we have previously made the assumption that the moving objects are smaller than the area in which they are being tracked.) This problem turns out to be quite good to implement; since in the simple case, the x and y minimizations are independent, we can compute them both at the same time when generating the vectors. Also, we don't have to store each of the hundred or so vectors while processing, because a weighted sum of their motions is sufficient for the fitting. This turns out to be reasonably efficient to compute (~ a few frames per second on a Pentium pro based PC) and keeps us within our goal of doing quasi-real-time tracking. More robust affine fitting methods exist, but most require significantly more processing power.
Generating the affine transform
So how do we generate the motion vectors?
The bulk of the affine fitting work is involved in generating the motion vectors (the difference in a pixel's location from the initial image to the final image); our approach is to reduce the images to binary form via the sign of the laplacian of the gaussian smoothed image (figure 1) and do a cheap translation only hausdorff search on small blocks (~13x13 pixels) taking the matches that have the best fraction of pixels matching.
Figure 1: An image and its representation under the sign of the laplacian of the gaussian smoothed image
For each block, we crop a small block out of the final image and do a translation fit of it to the region it was cut in the initial image in both x and y (the search range of the translation fit is a run-time parameter based on expected maximum motion), always preferring matches of the same quality that are closer to no translation. Figure 2 provides a typical sample block, and figure 3 shows a simplified search for the best fit.
Figure 2: A sample search block sampled from the image in figure 1 (enlarged) |
Figure 3: A graphical representation of the search for the best match of a block onto the initial image |
After all the blocks have been matched in the initial image, some simple linear algebra can turn the motion vectors into a matrix for the linear transformation in x and y for the initial image to the final image.
This transformation gives us much information, by transforming a point near the tracked object, we know the motion of the background with respect to the moving object and can difference the object's motion to the background motion to get the object's motion with respect to the background. This knowledge has been very important in recovering an object that leaves the field of view of the camera; by continuing to estimate the background, we can predict where the object should come back into frame if it continues its course and speed. This information also aids us in the generation of new models, by knowing what is background and what isn't, we can allow the form of the model to change and expand while reducing the clutter we may gather into the model without background knowledge. With the problem of motion tracking in non-rigid scenes camera motion is necessary and a reality, so effective but cheap estimates of the background is necessary in order to effectively keep track of an object over long sequences.
© 1998 walter bell