Walter Bell (email@example.com)
Dan Huttenlocher (firstname.lastname@example.org)
February 14th, 1998
Change Tolerant Model Acquisition
Implementation / API
Motion tracking and recognition is one of the problems that the computer vision field struggles to accomplish. The ability to track and retain information about an object over long time periods has numerous applications from intelligence to traffic control to human-computer interaction We present an approach to follow run-time determined targets from frame to frame in video streams, not based on the assumption of spatial locality, but based on a model-based feature recognition that is flexible enough to handle objects changing shape while still retaining recognition. Our method presents recognition of objects being the primary focus, not that of motion, so objects continue to be tracked regardless of their motion (or lack thereof.)
We wish to provide a sturdy system for tracking multiple objects over long periods of time with arbitrary camera movement. Our tracking strategy involves heuristics for recovery of objects that leave the view frame and enter occlusions. Because of today's bandwidth problems, we expect to be working with less than the normal 'perfect' vision data; data highly compressed to the point of showing mpeg artifacts in order to be streamed real-time. Our secondary goal is to utilize methods that can be efficiently computed on today's multi-processor or high end computers at several frames per second (hence quasi real-time.) We wish to provide a solution that is possible to be implemented and utilized in the foreseeable future-- only through actual goal-based use can the challenges that tracking systems must overcome be determined.
We're approaching the objective of tracking as a problem of model recognition; we have a binary representation (See Figure 1) of the target to be tracked, and we use a Hausdorff distance (For more information on the Hausdorff Distance, see Hausdorff Based Image Comparison) based search to search regions of the image for the object. For a binary representation of the target (a model), we are augmenting output from the standard canny edge detector of the gaussian smoothed image with the notion of a model history (For more information on our alterations to the standard canny operator output, see Change Tolerant Model Acquisition.) At each frame, we do a hausdorff search on each target, using the canny edges from the current image and the current model. Simultaneously, we do an affine estimation to approximate the net background motion. From the results of these two searches, we have considerable information about the target; we can approximate it's motion, as well as separate the background from motion in the region of the target. To be able to competently handle hazard conditions (such as the object becoming occluded going into a shadow, the object leaving the frame, or camera image distortion providing bad image quality), we retain history data about the target: such as it's past motion and size change, characteristic views of the target (snapshots throughout time that provide an accurate representation of the different ways the target has looked), and match qualities in the past.
Figure 1: An sequence image with the canny operator edges overlaid (enlarged, left),
and our representation of the model as a binary image (enlarged, right).
We have found this history of the tracking to be useful in more than just aiding hazard conditions; that part of a solid motion tracking system must involve history data, and not just a frame by frame method of motion comparison. This history state provides us with information as to how to decide what should be considered part of the target (e.g. things moving close to the object moving at the same speed should be incorporated into the object), and with information about motion and size, we can predictively guess where a lost object would have gone, or where it might reappear (which has been quite successful in recovering targets that leave the frame and reappear later in time.)
An inherent difficulty in our motion tracking goals is that we assume the camera can have an arbitrary movement (as opposed to a stationary camera), which makes developing a tracking system that can handle unpredictable changes in camera motion very difficult. We've been using a computationally efficient affine background estimation scheme to give us information as to the motion of the camera and scene. For a more complete description of the background estimation scheme, see Estimating the Background.
We've been generating an affine transformation for the image at time t to the image at time t+dt, which allows us a method of correlating the motion in the two images. This background information allows us to synthesize an image at time t+dt from the image at time t and the affine transform that would be our approximation of the net scene motion. This synthesized image has been useful in generating new model information and removing background clutter from our model space, because the actual image at t+dt and the generated one at t+dt can be differenced to remove major image features from the space surrounding targets.
In addition to the use of the affine transform as a tool to clean-up our search space, we are using it as a way to normalize the coordinate movement of our targets: by having a vector of how the background is moving, and a vector of how the target is moving, we can difference the two to generate a vector that is the motion of the target with respect to the background. This vector allows us to predictively match where a target should be, and anticipate hazard conditions (looking ahead in the direction of the motion provides us with clues as to coming obstacles [leaving the frame, entering an occlusion...]), as well as keep track of where the object should be in case of a hazard condition. When an object enters a hazard condition, we still can estimate the background motion, and use that coupled with our previous knowledge of the model's movement to guess where the model will reappear, or re-enter the frame.
The background estimation has been a key factor in the prolonged tracking of objects. We have found that short term tracking is possible without a background estimation, but after a period of time, object distortion and hazards become too difficult to cope with effectively without an good estimation of the background.
Change Tolerant Model Acquisition
One of the advantages of using the Hausdorff distance as a matching operator is that it is quite tolerant of changes in shape when matching, but to add to this we've needed a way to more accurately define the objects being tracked. Most of the models we have been dealing with have been quite small (<32x32 pixels), and getting a good representation of an object that small with most sparse, stable feature extractors has been quite difficult. To that end, we've developed a representation of our models that provides us with more information than normally available.
Straight dilation-based methods of grabbing a new model from the time t+1 image have been relatively effective, but in situations where there are non-object features close to the object (which occurs quite often), the dilation method tends to fail quite badly, by slowly incorporating the entire scene into the model. We needed a method of updating our model from frame to frame that was tolerant to changes in the model shape, but not so relaxed that we were incorporating non-model pixels into the model. Our approach uses a combination of background removal and adding the previous models to the current model match window and taking what seems to be stable pixels, as well as the new ones surrounding them, which over time either get eliminated from the model because they are not stable, or get incorporated into the model. This has been quite effective in keeping our models relatively 'clean' from clutter in the image, no longer does a road close to a truck get pulled into the model pixel by pixel. Our models appear dilated, but this is really a result of the history effect of how we grab models, but also has the nice feature of making our search results more definite because we have more model pixels to possibly match in the next frame.
Implementation / API
At each frame, there is a significant amount of computation to be done; smoothing/feature extraction, hausdorff matching each target (at least one match per model), as well as affine background estimation. Each of these operations is quite computationally expensive individually, so in order to keep to our original goal of being quasi real-time on toady's high end computers, we have been utilizing parallelism as much as possible. (by utilizing computers with more than 1 processor, we can effectively do 2 or more things at one time) We have been working under Windows NT v4.0 utilizing multithreading to utilize as much of the power in our dual-Pentium pro based machines as we can; and we've been successfully able to get around 5 frames per second while tracking one object and performing all the global feature extraction and background estimation.
We have an mpeg compressed sequence available for download that shows the output of the tracker on a car sequence involving multiple hazards such as occlusion, and leaving the frame.
Download from www.cs.cornell.edu
Return to top
Estimating the background
Return to the pitVision main page
© 1998 walter bell