BIB-VERSION:: CS-TR-v2.0
ID:: CORNELLCS//TR93-1400
ENTRY:: 1994-03-17
ORGANIZATION:: Cornell University, Computer Science Department
LANGUAGE:: English
TITLE:: Pictures and Trails: A New Framework for the Computation of Shape and 
        Motion from Perspective Image Sequences
AUTHOR:: Tomasi, Carlo
DATE:: November 1993
PAGES:: 21
ABSTRACT::
This report presents a new framework for the computation of shape and motion 
from a sequence of images taken under perspective projection. The framework 
is based on two abstractions, the picture and trail loci, that 
represent respectively the set of all pictures of the same scene and the set 
of all trails that a point in the world can leave on the image for a given 
camera trajectory. These abstractions lead to a remarkably clean relation 
between perspective and orthography. Furthermore, image motion is described in 
terms of angles between projection rays, thereby eliminating the need to 
model camera rotation and leading to more stable results. A numerically sound, 
global minimization method is developed, based on this framework, for the case 
of a two-dimensional world, but all concepts also hold in three dimensions. 
Experiments show that the method is rather immune to noise but critically 
dependent on camera calibration.
END:: CORNELLCS//TR93-1400
BODY::
Pictures and Trails: a New Framework for
the Computation of Shape and Motion
From Perspective Image Sequences
Carlo Tomasi*
TR 93-1400
November 1993
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
* This research supported by the National Science Foundation under contract RI-
9201751
Pictures and Trails: a New Framework for the Computation of
Shape and Motion from Perspective Image Sequences
carlo Tomasi'
December 1993
1This research was supported by the National Science Foundation under contract IRI-9201751.
Abstract
This report presents a new framework for the computation of shape and motion from a sequence
of images taken under perspective projection. The framework is based on two abstractions, the
pict?re and trailloci, that represent respectively the set of all pictures of the same scene and the set
of all trails that a point in the world can leave on the image for a given camera trajectory. These
abstractions lead to a remarkably clean relation between perspective and orthography. Eurthermore,
image motion is described in terms of angles between projection rays, thereby eliminating the
need to model camera rotation and leading to more stable results. A numerically sound, global
minimization method is developed, based on this framework, for the case of a two-dimensional
world, but all concepts also hold in three dimensions. Experiments show that the method is rather
immune to noise but critically dependent on camera calibration.
Chapter 1
Introduction
An important goal of computer vision is to build reliable systems for the computation of structure
and motion from the images produced by a moving camera. If the world is stationary and if feature
points can be tracked from image to image, the computation of structure and motion becomes
a purely geometric problem. It is, however, a nonlinear and potentially poorly conditioned one.
Conditioning must be addressed by formulating the problem in terms of well-observable parameters
only, using generously redundant data, and paying close attention to the numerical aspects of the
computation. Nonlinearity, on the other hand, must be addressed by a global solution method,
that is, one that does not get caught in local minima.
This report presents a formulation of the problem of computing shape and motion from a
sequence of images of a rigid scene under perspective projection. This formulation addresses all
the issues mentioned above, as highlighted in the following.
No rotation in the model The proposed model of the imaging situation is independent of the
camera rotation around its optical center. This is achieved by describing image changes
through the angles between the projection rays of point features, similarly to what is done
in [TS93]. When the camera rotates, these angles do not change. In contrast, in the tradi-
tional framework, rotation and translation can be mistaken for each other, thereby leading
to poor observability of the translation-rotation pair and the known sensitivity problems of
the standard approaches.
Multiframe and multipoint The new formulation can handle any sufficiently large number of
feature points and camera positions. In fact, the first proposed step is to use the available
images to build the locus of all possible perspective images of the same scene. This locus,
called the picture locus, turns out to be a three-dimensional variety in a space with roughly
as many dimensions as there are features in the set. The images in a specific sequence are
then points on the picture locus.
Global minimization The new approach splits the computation into a linear stage in the space
of all the data and nonlinear stage in a space with a fixed and small number of dimensions,
representing all possible affine deformations of the scene. In this small space, the global
minimum can be at least approximately identified by dense sampling.
Perspective vs orthography This two-stage partition of the computation was made possible
by a fundamental insight about the picture locus: the subspace tangent to the locus at
the origin is the set of all orthographic images of the same scene. This insight, in a sense,
reduces the problem of shape and motion under perspective to that of shape and motion
under orthography, a link that is interesting per se even besides the computational methods
that it suggests.
Appropriate numerical techniques The two-stage approach allows using the most appropriate
techniques for every part of the computation: linear data fitting in the large space of the input
data, and efficient variable projection methods in the small space of affine deformations.
Incidentally, the first stage yields shape and motion up to two separate affine transformations.
In many applications [WT93] this is sufficient, and the second, more expensive stage that enforces
Euclidean metric can be omitted.
In this report, a fiat, two-dimensional world is considered, and this for two reasons. First,
although all the concepts hold also in three dimensions, the extension is technically less than
straightforward, and has not addressed in detail yet. Second, all the concepts introduced are more
easily visualized in the two dimensional case, where the picture locus becomes a picture strface.
The next two sections present the main abstractions of the framework: the picture surface,
and the trait strface, a dual concept that will be introduced later on. Section 4 then outlines the
reconstruction method. Experiments are discussed in section 5. First, a series of simulations shows
that the method works well even with substantial uncertainty in the image measurements. Then, an
experiment with a real image sequence gives mixed results, supporting the conjecture that camera
calibration is critical for good results.
2
Chapter 2
The Picture Surface
The plane at the bottom of figure 2.1 represents the two-dimensional world where both camera
and scene are supposed to live. The camera looks at a set of point features and only records the
tangents of the angles formed by pairs of projection rays. In this two-dimensional case, all the pairs
of features have one reference feature in common, so with P + 1 feature points there are P tangents
per frame (in the figure, P = 3 for visualization purposes). Thus, one feature serves as a landmark
and the image positions of the other features are specified by the angles between their projection
rays and that of the landmark feature. The tangent t of each angle is given by (see Appendix A)
uz --H wx
1 --H ux --H wz
where (x, z) is the position of the feature in the world and
K = (u,w)			c/1c12
(2.1)
(2.2)
is the vector obtained by reflecting the camera coordinates C across the unit circle.
With P + 1 world feature points, an image from reflected camera position K = (u, w) yields a
set of P measurements ...... ,tp:
%tZp --H WX
tp =			-P			(2.3)
1 --H UXp --H WZp
that can be collected into one vector t = (ti,. . . ,tp). This vector can be viewed as a point in a
P-dimensional space. As the camera moves, the point t moves within this space. The locus of
all possible points t for a fixed set of world features is a surface, traced by the parameters u, w
and whose P components are given in parametric form by equation (2.3). This surface is called
the picture surface. Notice that the picture surface does not depend on camera position, since it
represents the images of the given features from all possible camera positions.
As an example, figure 2.2 shows a region of the picture surface for the four features S0 =
(0,0), Si = (0,4,0.8), S2 = (0.7,0.1), 53 = (0.2,0.5) of figure 2.3 when the camera moves in
the region defined by the rectangle with vertices K0 = (--H1, --H1) and K1 = (--H1, --H0.5) in the
K plane, corresponding to camera positions C on the grid in figure 2.3. This grid is in one-to-
one correspondence with the grid on the picture surface of figure 2.2. Surfaces for more features
cannot be visualized (except by projecting them to subspaces), but are still two-dimensional objects,
because they are traced by two parameters.
The picture surface is univocally related to the positions of the feature points in space: different
scenes yield different surfaces, and different points on the same surface represent different pictures
of the same scene.
3
?turevector
tangentsofthe
projectionray
angles
worldfeatures
mM???ffi#MyMM
camera _____			-?
worldplane
Figure 2.1: The components of a point on the picture surface (a picture vector) are the tangents
of the projection ray angles.
o6
04-
02-
0-
-0.2-
0?5
Figure 2.2: The picture surface for the four features in figure 2.3. The patch displayed here
corresponds to the camera positions shown in figure 2.3.
4
4,"
Figure 2.3: When the inverse camera coordinates K defined in equation (2.2) vary in the rectangle
with vertices K0 = (--H1, --H1) and K1 = (--H1, --H0.5), the camera positions C move on the grid in this
figure. The four crosses represent four features in the world, with the point at the origin being the
reference feature.
Section 4 shows that the picture surface can be determined by a linear data fitting procedure
from the available image measurements. Unfortunately, the relation between the surface param-
eters, resulting from fitting, and the coordinates of the world features that correspond to this
surface is complicated. The brute-force approach to establishing this relation leads to a nonlinear
constrained minimization problem of difficult solution. To avoid this problem, we now introduce
an important result about the picture surface (see Appendix B for a proof).
Theorem (Orthographic Picture Plane) The plane tangent to the picture surface
at the origin represents all the images of the same world features under orthography, up
to a scale factor.
This theorem is important because any two distinct orthographic images of a given set of features
are the x and z coordinates of the features in the world except only for an affine transformation. In
other words, we just need to pick any two points (not colinear with the origin) on the orthographic
plane to obtain structure up to an affine transformation.
Thus, we start to see the outline of the shape reconstruction method:
1. find the picture surface by linear data fitting
2. determine the orthographic plane to obtain shape up to an affine transformation
3. replace the results into the original projection equations (2.1) to compute actual shape.
The third step, however, can only be performed once partial motion information has been computed.
To this end, we introduce the concept of a trail surface, and an important duality result that links
shape and motion.
5
Chapter 3
The Trail Surface and Duality
The picture surface is the set of images obtained by fixing the scene and moving the camera
around. Conversely, one can determine a trail locns by instead fixing a number of camera positions
and collecting the images of a single feature in the world. For each feature in the world, the image
measurements from the given camera positions represent the trail that that feature left in the
images as the camera covered those positions over time. When the world feature is changed, the
trail vector moves on the trail locus.
If the projection equation (2.1) is examined, an important relation of duality can be established
between the picture and trail surfaces. In fact, equation (2.1) is symmetric in structure and motion:
the equation does not change with the replacements
u			H			X			(3.1)
W			H			--Hz			(3.2)
Because of this symmetry in motion and structure, the surface of figure 2.2 can also be interpreted
as a trail surface. Mathematically, this corresponds to fixing the camera positions in equation (2.1)
rather than the world features, as done in equation (2.3). The image measurements of a point at
S = (x, z) as seen from F camera positions K1 = (ui,wi),... , = (up, wp) are given by
= ufz--Hwfx			33
1 --H ufx --H WfZ
with respect to the reference feature. These coordinates can be collected into a vector t =
....... , tp), a point in an F-dimensional space. The locus of all possible measurement sets from
those F cameras as the point S = (x, z) varies in the world is the trail surface. To obtain the
physical situation corresponding to this reinterpretation of the surface in figure 2.2, apply the re-
placements (3.1) and (3.2). This yields figure 3.1, where now circles represent camera positions and
the grid points are the varying position of a feature in the world.
The orthographic-plane theorem holds for the trail surface as well. Because of its importance,
we state it here with its proper, dual terminology.
Theorem (Orthographic Trail Plane) The plane tangent to the trail surface at the
origin represents all the trails from the same camera positions under orthograph?, up to
a scale factor.
6
Appendix B proves this result as well.
A 1w
Figure 3.1: The surface of figure 2.2 can also be interpreted as the trail surface of the situation in
this figure. The reference feature is still at the origin (cross).
7
Chapter 4
The Reconstruction Method
In this section, we reconstruct shape and motion from images in four steps:
1. determine the parameters of the picture surface by linear fitting;
2. pick two points on the orthographic plane of the picture surface to determine shape up to an
affine transformation;
3. determine motion up to an affine transformation with the same technique;
4. replace these results into equation (2.1) to determine the affine transformations that map
affine motion and shape into their Euclidean counterparts.
Affine shape and motion are computed with only linear operations, while the nonlinear part is all
included in the last step.
4.1 The Parameters of the Picture Surface
We saw in section 2 that the image of a fixed set of points in the world is a point on a surface,
the picture surface. Conversely, the image measurements of any one point in space as seen from a
fixed set of cameras are a point on the trail surface. The picture and trail surfaces are embedded
in highly dimensional spaces: with P + 1 world points and F cameras, the picture surface lives in
a P-dimensional space and the trail surface lives in an F-dimensional space.
Because we know the analytic form of these surfaces (equations (2.3) and (3.3)), determining
their parameters from a set of image measurements is a data fitting problem. The problem becomes
linear if we eliminate motion from equations (2.3) and shape from equation (3.3). We now show
how to do this for the picture surface of equation (2.3), where motion is represented by the camera
positions u, w.
Let us rewrite equation (2.3) for three distinct points numbered p, q, r:
tp			=			UZp--HWXp			(4.1)
1 --H UXp --H WZp
tq			UZq--HWXq			(4.2)
1 --H UXq --H WZq
tr =			UZr--HWXr			(4.3)
1 --H UXr --H WZr
8
We can solve the first two equations for the parameters u, w, and replace the result into the
third equation. This yields an equation of the third degree in tp, tq, tr. The coefficients of this
equation depend only on shape, since the motion parameters u, w have been eliminated. These
coefficients are easy to determine because they appear linearly in the equation.
To solve equations (4.1) and (4.2) above for u, w we multiply these equations by the denomi-
nators of their right-hand sides; this yields two linear equations in u, w, which can be written in
matrix form as follows:
tpxp + Zp			tpZp --H Xp			U			--H			tp
tqxq + Zq			tqzq --H Xq			W			tq
This system can be solved by Cramer's rule and replaced into equation (4.3), rewritten as follows:
(trXr + Zr)U + (trzr --H Xr)W = tr
This substitution yields the desired multilinear equation in tp, tq,tr:
aitptqtr + a2tptq + a3tptr + a4tqtr + a5tp + a?tq + a7tr = 0
where
a2			=
a6 =
= ?XpZq + ZpXq,
?xp(zq --H Zr) + xq(zp --H zr) --H xr(zp --H zq)
?xr(xp --H xq) --H zr(zp --H zq)
xq(xp --H rr) + zq(zp --H Zr)
?xp(xq --H rr) --H zp(zq --H zr)
?XqZr + ZqXr
XpZr --H ZpXr
(4.4)
(4.5)
and the subscripts p, q, r were dropped for simplicity from the coefficients a?.
Because there are only six parameters in the right-hand sides of equations (4.5) but there are
seven coefficients, the coefficients must satisfy some constraint. It is easy to verify that the following
two linear equations hold:
--H			--H			--H a7 = 0
a2 + a? + a? = 0
so that equation (4.4) can be rewritten as follows:
a?tp(tq --H tr) + a4tr(tq --H tp) + a5t?(1 + tqtr)
a6tq(l + tptr) + a7tr(1 + ?ptq) = 0
(4.6)
(4.7)
+			(4.8)
Determining the coefficients a? from a set of measurements over several frames is a linear,
overconstrained minimization problem (Appendix C).
4.2 Affine Shape
Although easy to determine, the coefficients of the picture surface are complicated functions of the
point coordinates Xp, Zp, Xq, Zq, XT, Zr. Determining these coordinates from the coefficients directly
9
from equations (4.5) is a hard nonlinear constrained minimization problem. However, it is trivial
to find the orthographic plane of the picture surface. In fact, the constant term in equation (4.4) is
equal to zero, so the picture surface passes through the origin, as expected, and the orthographic
plane for the picture surface is given simply by the three linear terms in tp, 1q? ?r? that is, by
a?, a6, a?. In other words, the equation of the orthographic plane is
a5tp + a6tq + a7tr = 0.			(4.9)
As discussed in [TK92J, any two points on this plane, not colinear with the origin, represent
shape up to an affine transformation. More specifically, let t?1? --H (t(l),tq(l),t?(l))T and t?2? --H
(tp?2?, tq(2), tr(2))T be two points satisfying equation (4.9). For instance, let'
(t(1))T			=			(1,0,--Ha?/a7)
(t(2))T			=			(0, 1, --Ha?/a7)
Then the four columns of the 2 x 4 matrix
o			(t(1))T			_			0			I			0			Xr
o			(t(2))T			--H			0			0			1			?r
represent the coordinates of the origin and the three points numbered p, q, r up to an affine trans-
formation. Because the first three columns are the affine system of reference (origin and two unit
points), the only new information in the matrix above is given by the coordinates of the fourth
point, that is, by
Xr =
Zr = --Ha?Ia?
With more than four points, we repeat the procedure just described once for every value of r
different from p and q, for a total of P --H 2 independent problems. This yields a 2 >c P matrix 5 of
all the affine coordinates in the same reference system, because the origin and the two landmark
points p and q are always mapped to (0,0), (1,0), (0,1). For instance, with p = 1 and q = 2, we
have
10 X3			Xp
= 0 1 i3 ... ip
Because affine coordinates differ from Euclidean coordinates only by an affine transformation, there
must be a 2 x 2 matrix A such that
5= AS.
We will determine A in section 4.4 below. Before that, however, we need to recover the camera
motion up to an affine transformation.
4.3 Affine Motion
Because of the symmetry of equation (2.1) discussed in section 3 and expressed by equations (3.1)
and (3.2), motion can be determined up to an affine transformation by the same procedure used
1it is easy to change this choice if ?? = 0.
10
for shape in sections 4.1 and 4.2. Specifically, if the procedure for computing shape is summarized
by the function
s=?Th
where T is the matrix of image measurements, then the F x 2 matrix k that has the reflected affine
coordinates (see equation (2.2)) as its rows is simply given by
kT??(TT)
Duality saved us half of the work. Also, analogously to what happened for shape, if K is an
F x 2 matrix that collects the Eticlidean coordinates of all these camera positions, reflected around
the unit circle, there must be a 2 x 2 matrix B such that
K = kBT
4.4 Euclidean Shape and Motion
To summarize, we now have affine shape, 5, and affine motion, k. These two matrices of coordinates
are expressed in two different coordinate systems, so we need to find two 2 >c 2 matrices A and B
that yield the Euclidean coordinates 5 and K according to the transformations
5 = AS			(4.10)
K = KBT.			(4.11)
Notice that the origin of the coordinate system is fixed at the reference point (xo, zo) = (0,0).
Because the image measurements do not constrain scale and an overall rotation of the reference
system, we can impose the constraint that
(xi,zi) = (1,0).
Since (xi, i1) = (1,0), this constraint yields two of the entries of A:
all = 1 and a21 = 0
To find B and the remaining entries of A, we replace equations (4.10) and (4.11) into the original
measurement equation (2.1). Ignoring point and camera subscripts, equation (2.1) becomes
(biiu+bi2w)a22z --H (b2iu+b22w)(x+a12i)
1 --H (b11u + b12w)(x + a12i) --H (b2iu + b22w)a22i
which is separately linear in the two vectors 0 = (ai?,a??) and p = (b11,b12,b21,b22). In [TS93],
we show a method for solving this type of equation, although applied to a different problem.
In conclusion, in the proposed method, a linear stage for affine structure and motion is followed
by a nonlinear stage to determine the Euclidean metric. Because of this, the proposed method can
be seen on one hand as a successor of techniques based on essential matrices pioneered by Longuet-
Higgins [LH81], independently reinvented by Tsai and Huang [TH84] and surveyed in [May93];
and on the other hand it is a successor of the factorization method described in [TK92]. However,
essential matrices work on two frames at a time, thereby either introducing a hard correspondence
problem when the two frames are distant or leading to a poorly conditioned reconstruction when
they are close. The multiframe factorization method, on the other hand, works only under ortho-
graphic projection, which limits its applicability to distant scenes and narrow fields of view. The
current method, in contrast, is multiframe, multifeature, and works for perspective images. In
addition, in contrast to multiframe and multifeature Thca' methods such as [SA89], our method is
global, in that it does not require an initial estimate of either structure or motion.
11
Chapter 5
Experiments
Figure 5.1 shows the result of a simulation with noisy images. Both true and computed structure and
motion are shown. Noise on the image feature coordinates is Gaussian with a standard deviation of
0.5 pixels for a 512 x 512 image. In the simulation, both features and camera positions are scattered
randomly, each in one quadrant of the plane. The two points at the origin and along the positive
horizontal axis (at (1,0)) are the reference points, and their computed values are therefore exact.
The two plots in figure 5.2 show the structure and motion errors for increasing levels of noise.
Ten features and camera positions are used in all experiments, and each experiment is repeated ten
times with different random samples to produce ensemble averages. Structure errors are measured
as the ratio between the average error per feature and the size of the bounding box of the true
feature positions. A similar measure is used for the camera position errors.
Even with relatively few points and viewing positions, performance is good for subpixel noise
levels. When the standard deviation of noise increases beyond one pixel, performance degrades
sharply but continuously. We point out that in feature tracking the position of features can usually
be determined with an accuracy of 0.1 or so pixels [TK91j for typical 512 by 512 images. From the
plots of figure 5.2 we see that the corresponding structure and motion errors are a fraction of one
percent.
With real images, the results are less satisfactory. The central part of figure 5.3 shows an
epipolar slice (like the ones in [BBM87]) from a sequence of images taken with a Panasonic camera
mounted on a micrometric translation and rotation stage. Figure 5.4 shows the setup from above,
without the camera, which used to be on the platform visible at the bottom.
Features were obtained by detecting sharp intensity transitions in the first row of the epipolar
slice and were tracked by continuity from one row to the next. Features were found on the leftmost
block in figure 5.4, on the block closest to the camera at the center, and on the Crayola box on the
right. Figure 5.5 shows the actual positions of the features (crosses) and of the camera (circles) as
measured in figure 5.4.
No camera calibration was performed, and the nominal focal length of 16 mm was used, con-
verted to pixels based on the manufacturer's specification of the size of the sensor's active area. The
lens was a c-mount lens for surveillance applications, with consequently poor optical properties.
Figure 5.6 shows the reconstruction for the ten camera positions (circles) in the sequence and the
eighteen features tracked in the epipolar slice. The camera motion is fairly accurately recovered, the
overall distance between the camera and the scene is essentially correct, and each of the three feature
groups is approximately of the right shape and size. However, the position of the three groups of
features is considerably distorted with respect to the ideal positions of figure 5.5. The contrast
between these results, with features tracked with about 0.1 pixels accuracy, and the simulations
12
described in figures 5.1 and 5.2, run under greater positional uncertainty values, seems to support
the conjecture that the camera calibration is crucial. We are working on camera calibration in
order to verify this assertion.
0
0+
9
+0
t
+0
Figure 5.1: True (circles) and computed (crosses) structure and motion with simulated data. Cam-
era positions are in the lower-left quadrant, feature points in the upper-right one.
0L
y
$
$
Thy
0			62			CA			C?			14			16			16			2
?. I?.
C?7
?C4
CL?
III
$
2			CC			CA			Co			IA			16			16			2
Figure 5.2: Errors in the computed structure (top) and motion (bottom) for increasing levels of
image feature noise, measured in pixels for a 512 by 512 image. See text for the units of the vertical
axes.
13
Figure 5.3: An epipolar slice (center) sandwiched between the top of the first and bottom of the
last frame of a 50?frame image sequence.
Figure 5.4: The imaging setup viewed from above (the camera has been removed from the motion
stage at the bottom).
14
Figure 5.5: The actual positions of the camera (circles) and world features (crosses) as measured
in figure 5.4.
?an
*
Figure 5.6: The positions of the camera (circles) and world features (crosses) as computed by the
method described in this report.
15
Chapter 6
Conclusion
This report presented a radically new conceptual framework, as well as a computational procedure,
for the recovery of shape and motion from a sequence of images taken under perspective. While
more and better experiments are obviously necessary, a good case can be made for this new way of
thinking about an old and important problem.
In fact, the picture and trail loci are useful abstractions per se, and the results about their
tangent subspaces (or planes in the two-dimensional case) are one of their primary advantages,
since they establish an unsuspectedly clean and clear relation between perspective and orthogra-
phy. Furthermore, the new, rotation-independent model of the imaging situation, which made this
relation apparent, removes the slack that was caused by the poor distinguishability of rotation and
translation in previous formulations. Finally, the reduction of the nonlinear part of the shape and
motion reconstruction to the small space of affine scene deformations gives a handle on the intrinsic
nonconvexity of this vision task.
Future work on both camera calibration and the extension of the computation to three dimen-
sions will hopefully imprint the seal of practical usefulness on this new framework.
16
Bibliography
[BBM87] R. C. Bolles, H. H. Baker, and D. H. Marimont.
approach to determining structure from motion.
Vision, 1(1):7--H55, 1987.
Epipolar-plane image analysis: An
International Journal of Computer
[LH81] II. C. Longuet-Higgins. A computer algoritlim for reconstructing a scene from two pro-
jections. Nature, 293:133--H135, September 1981.
[May93] 5. Maybank. Theory of Reconstruction from Image Motion. Springer-Verlag, Berlin
Heidelberg, 1993.
[SA89] M. E. Spetsakis and J. (Yiannis) Aloimonos. Optimal motion estimation. In Proceedings
of the JEEF Workshop on Visual Motion, pages 229--H237, Irvine, California, March 1989.
[TH84] R. Y. Tsai and T. 5. Huang. Uniqueness and estimation of three-dimensional motion
parameters of rigid objects with curved surfaces. IFEE 7ransactions on Pattern Analysis
and Machine Intelligence, PAMI-6(1):13--H27, January 1984.
[TK91]
C. Tomasi and T. Kanade. Shape and motion from image streams: a factorization method
- 3. detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie
Mellon University, Pittsburgh, PA, April 1991.
[TK92] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a
factorization method. International Journal on Computer Vision, 9(2):137--H154, 1992.
[T593] C. Tomasi and J. Shi. Direction of heading from image deformations. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR93), pages
422--H427, New York, NY, June 1993.
D. Weinshall and C. Tomasi. Linear and incremental acquisition of invariant shape
models from image sequences. In Proceedings of the Fourth International Conference on
Computer Vision (ICCV93), pages 675--H682, Berlin, Germany, May 1993.
17
[WT93]
Appendix A
The Projection Equation
Figure A.1 shows the landmark point S0, used as the origin of the coordinate reference system,
and a second point S1 with coordinates (xi, zi) = (0, 1). This second point establishes both the
orientation and the metric of the coordinate system. Two more points appear in the figure: a
camera center C, which stands for any of the F camera positions ..... . , CF, and an object point
5, which stands for any of the P object points Si, . , Sp other than So The angle 0 between the
projection rays of So and 5 can be determined from image measurements and is independent of
the camera rotation.
Let D be the vector difference between the point position 5 and the camera position C. Then,
the tangent t of the image measurement a is given by minus the ratio of the projections of D
along the direction of C and along the direction orthogonal to C. If C = (cx, cz), the vector
counterclockwise orthogonal to C and with the same magnitude as C is C1 (--Hc?, c?), so that
If we let
t=tana=--H %)fD			(CI)Ts
=ICI2?CTS
K = (u,w) = did2
be the vector obtained by reflecting C across the unit circle, and K' --H (--Hw, u) be its orthogonal
vector, we can also write
--H (KI)TS
1 - KTs'
In scalar form,
uz --H wx
1 --H ux --H wz
18
(equation (2.1) in the main text).
z
s
Figure A.1: The symbols used in the projection equation.
19
Appendix B
Proofofthe Orthographic Plane
Theorems
We first prove the orthographic plane theorem for the picture sufface. The dual theorem, for the
trail surface, could be obtained simply by duality (see section 3). However, because its meaning is
less intuitive than the result for the picture surface, a few remarks are added below.
The equation of the plane tangent to the picture surface at the origin is the numerator of
equation (2.3):
tp = UZp WXp
But this is also the limit of equation (2.3) when the norm of K = (u, w) tends to zero. Erom
equation (2.2), when K tends to zero, the norm of the camera position vector C tends to infinity.
Thus, the images taken from very distant cameras are very close to the origin of the picture surface.
We can now think of an infinitesimally small circle on the picture surface and around the origin:
when the radius of this circle shrinks to zero, the projection rays become parallel and orthographic
projection is approached, except that the image coordinates become smaller and smaller. Every
point on the tangent plane differs from some point on that circle only by a scale factor. This
completes the proof for the picture surface. The reasoning for the trail surface is similar: when the
norm of the world point position vector S = (x, y) tends to zero in equation (3.3), we again obtain
the orthographic projection equation
= ufz --H wfx,
and the same reasoning applies. This is a little less intuitive than it is for the picture surface, since
we usually think of orthography as the case when the camera positions go to infinity. However, it
is equivalent to instead keep the camera positions fixed and shrink the world points towards the
origin: the two situations differ only by a scaling factor.
20
Appendix c
Determining the Picture Surface
Parameters
ff we reintroduce the frame subscript J into equation (4.8) and suppose that F > 5 frames have
been collected, then 3F measurements 1fp, tfq, ?fr are available for points p, q, r. ?tom these, the
following F x 5 matrix M can be formed whose entries rn?? are defined by
mfl			=			tfp(tfq --H tfr)
mf2			=			tfr(tfq --H tfp)
=			tfp(l + tfqtfr)
=			tTh(l+tfptfr)
=			tfr(l+tfptfq)
and the vector a = (a?, a4, a5, a?, a7)T is then the best nonzero solution to the linear homogeneous
system
Ma=O,			(Cl)
while al and a? are determined by equations (4.6) and (4.7). The solution to system (C.1) can be
found by computing the singular value decomposition of M,
M = U?vT
and letting a be the fifth column of V, that is, the eigenvector corresponding to the smallest right
singular value.
21
