detection and long term tracking of moving objects in aerial video pred1c152.jpg (6610 bytes)

Walter Bell (
Pedro Felzenszwalb (
Dan Huttenlocher (

March 26th, 1999


System Overview
   Image Server
   Background Motion Estimator
   New Target Detection
   Long Term Tracking



We describe a system for detecting and tracking independently moving objects in aerial video imagery, such as vehicles on a roadway. The system is able to follow multiple objects while maintaining the identity of each object, including when objects are temporarily hidden from view or when they stop moving and then re-start. The system uses a combination of affine image registration and local motion estimation to detect patches of the image that may be moving differently from the background. When such a patch persists for several frames, a model-based tracker is used to follow the corresponding object over longer time periods. The system employs several such model-based trackers so that it can simultaneously track multiple objects. The overall system runs at several frames per second on a small cluster of Pentium-II workstations.

System Overview

We divided the overall tracking system into five sub-systems, or modules, each concerned with a specific relatively high-level task. These modules are:

(i) an image server;
(ii) a background motion estimator;
(iii) a new target detector;
(iv) a long term tracker;
(v) a coordinator.

The modules have clearly defined boundaries and interfaces with each other, which enables independent development and testing of the different modules. This approach also provides a good deal of flexibility. By keeping the interfaces between the modules abstract and general, we were afforded the ability to replace one implementation of a module with another with very small changes to the system as a whole. The choice of interfaces between modules was an important design decision, and will be considered in the discussion of the individual modules below.

In the overall system structure, all the modules essentially act as slaves to the high-level coordination module. This high-level module handles all policies for the lower level modules, such as deciding when to give up tracking a particular target, or deciding that a new target should be assigned to a long term tracker. The high-level coordinator is where any application that was built on top of the tracker would interface. In the current system, the "application'' is simply the user interface, which displays the results and allows the user to control the tracker. This interface allows the user to perform certain high-level functions, such as marking a location on a roadway and counting the number of vehicles that cross that location.

The hierarchical control structure simplifies communication among modules as well as making the overall system more understandable. With this hierarchical approach one knows how the system will perform on a given task by looking at how the individual pieces would react coupled with what policies are enforced by the coordinator. One of the important aspects of this design is that the coordinator deals only with high-level events; it is concerned with targets and frames rather than pixels and images, which makes reasoning about the system more natural.

Each module of the system is implemented as a separate multi-threaded server process that communicates with others via asynchronous messages over TCP/IP. This allows the computation of the tracking system to be done on a single computer or on a small cluster of computers without modification. This design allows for the actual processing of the tracker to be done on computationally powerful machines in a remote location with the output displayed on a machine that has little computational power. The sub-system architecture also is important for the overall system performance, because it is what enables parallelism to be exploited when multiple processors are available.

The module divisions provide for a great deal of performance via asynchrony, since there are relatively minimal dependencies among the modules on a frame-by-frame basis. For example, most of the sub-systems are not concerned with background motion estimate for the current frame until sometime late in their processing of that frame. Therefore those modules can run most of their work in parallel with the background estimation. Exploiting these kinds of dependencies resulted in much of the system's overall speed. These dependencies impose a partial ordering on the processing. Thus, rather than constraining the system to be frame synchronized, we simply require the various modules to be within a constant number of frames of each other, called the "frame span'' of the system. This enables the system to exploit parallelism, up to the data dependencies among the modules. This asynchrony proves to have a substantial performance gain due to variable amount of time that a given module takes to processing a single frame. By letting modules get ahead when their processing is relatively cheap and sacrifice that lead when their processing is more computationally intensive, the system is not directly constrained by the speed of the slowest process at each frame.


Image Server

The image server is a process that distributes image frames from a sequence to various clients. The image server was designed to provide a general interface for clients to get sequential images from image streams. By making the client interface to the image server abstract, various types of image sequences can be provided in a manner that is transparent to the client. Grabbing frames from real-time cameras uses the same client interface as getting images from a sequence of frames off of disk. The image server takes a "sequence-oriented'' view of image streams; a sequence is an ordered list of images with a particular image that is considered to be the current image. Multiple clients can attach to a specific sequence and request the current frame. The current frame is updated by one of the clients, which is deemed the "master''; only that client can request that the current frame be advanced. In the tracking system, the coordinator acts as the master. When the current frame is advanced, a message is sent to the other clients of the sequence informing them of the frame change. These clients can then request the image if they want to.

Background Motion Estimator

Aerial imagery by definition has significant camera motion.  In order to compensate for the background motion at each frame, the background motion estimator computes the affine transform (a linear transformation of a translation and a rotation) of the background from one time frame to the next. On a given frame, it computes the affine transformation from the previous frame to the current frame, and places that transformation in a circular array of results that covers the frames in the system frame span. Other modules request affine parameters for a particular frame t (which is the transformation of the background from the frame t-1 to the frame t), and the background estimator replies with a vector of the 6 parameters that define an affine transformation.

New Target Detector

There are several stages of processing in detecting objects that are moving differently from the background. First, successive pairs of frames are registered using an affine transformation, in order to correct for the background motion. Then the residual motion, after affine registration, is estimated for patches in the image. These patches are found by segmenting individual image frames, estimating the translational motion of each region in the segmentation, and then merging neighboring regions that have the same motion magnitude. Finally, moving objects are detected by aggregating the information about these moving patches across several successive frames, using a local tracking method.

The module for detecting moving objects is the most computationally intensive, and required several tricks to get good performance. First, in order to make sure this module was always processing, it initially requests all the results that it will need throughout the frame with the hopes that those results will arrive before they are needed. In order to increase the system speed it was important to make as much computation run in parallel as possible -- but unfortunately most of the work in the detection of new targets is inherently sequential in nature. We got around this problem by overlapping the processing of different frames, within the frame span of the system. The results of the previous frame are not used until the end of the subsequent frame, when they are needed for the short term matching. Therefore we run the beginning part of the computation of one frame overlapped with the ending part of the computation of the previous frame. By arranging the computation in these two parts to make them about equal in time, we gained a substantial speed improvement with parallelism with two threads. The interface of the new target acquisition module with the coordinator is quite simple. When the coordinator requests the possible new targets it returns two lists of bounding boxes; objects that are moving at this time step (instantaneous motion), and object that have been detected for long term tracking.

Long Term Tracker

The long term tracker uses a simple two-dimensional geometric representation of an object at a given time frame, based on extracting certain intensity edges from images. Given the model of an object at time t, the tracker searches the image at time t+1 for the best location of this model. This best match is used both to determine where the   object is in the image, and to construct a new model for time t+1. Within the overall tracking system there are multiple long term trackers, one for each object that is being followed. In this section we describe the operation of a single long term tracker. The use of a simple two-dimensional geometric model enables the long term tracker to successfully follow an object even when it stops moving or when it re-appears after being temporarily hidden from view. A threshold on the match quality is used to determine whether or not the object is present in a given frame. When no match is found, the location of the object is predicted based on its previous motion with respect to the background. If no match is found for several successive frames, then the long term tracker gives up and declares the object to be lost. The number of frames that the tracker waits before giving up depends on how long the object has successfully been followed. Thus the tracker can follow an object that is hidden from view for a substantial time period, without spuriously following objects that were falsely detected (because the latter kinds of objects will not have been successfully followed for very long).

An object is represented as a bitmap M
t at each time frame t. When a long term tracker is first assigned a new object to track it is given an initial model of that object. This initial model is generally quite crude, as it is extracted automatically based on the motion and edge information from a single image frame. At each time frame, the long term tracker finds the best match of Mt to the image at the next time frame, It+1, using the generalized Hausdorff measure. When a match is found, Mt+1 is constructed using edges both from Mt and from the matching area in It+1. Over the first few frames that the long term tracker follows an object, the model generally gets substantially better. When no good match is found in It+1, then Mt is simply used as the new model, Mt+1.

The multiple long term tracking modules are alljust instantiations of the same server. Each instance of this server is concerned with tracking at most one target. Each server is considered to be "idle'' or "active'' at a given time frame. Idle trackers are assigned new targets to track when they appear, while active ones continually attempt to follow a given target. Our implementation of the long term tracker is quite computationally inexpensive. Several active trackers (5-10) can be run on a single machine without overloading it. The long term trackers interface with the coordinator in a high-level manner. When the coordinator wants an idle tracker to track an object, it passes the bounding box of the object to the long term tracker, which generates an initial model and begins tracking. On each frame the long term tracker returns results to the coordinator. If active, the tracker returns the bounding box of the tracked object in the image, and a status flag indicating whether the model was actually found, or is just predicted to be at that position. Optionally, the coordinator can request a binary representation of the current model. If the tracker is idle, it just returns a status bit indicating that it is alive but idle.


The coordinator implements the overall system policy, aggregating information from the various modules, and displaying the results to the user. The coordinator is only concerned with high-level events such as loss of communication with a module, the results from a specific module on a given frame, the last frame which all the modules have finished processing, etc. To keep the coordinator concerned with only these simple high-level events, the running modules need to intercommunicate directly to pass results. We took a simple view of this issue: the coordinator tells the different running modules where they can find various types of information when the system starts up (e.g., where the sequence of images is, and where the process estimating the background motion is). Once the system is running, the coordinator's primary job is to loosely synchronize the modules by keeping them all within the frame span of the system, and to make
high-level decisions about when to start and stop tracking long term targets. The specifics of the individual modules do not need to be known to make policies. Rather, the coordinator operates based on high-level information such as the status of the current targets, the instantaneous
moving regions, and the regions to consider as new targets. The coordinator's job is to decide which new regions to track with which trackers, and which targets to stop tracking.


    Detection and Long Term Tracking of Moving Objects in Aerial Video, March 1999

Download postscript (5 meg) from

Download gzipped postscript (1.73 meg) from


We show an example of the tracking system's results on a multitarget scene. Note this page is image laiden, and could be slow on modem connections.



Return to top


Return to the pitVision main page

1999 walter bell