Sie sind auf Seite 1von 14

Camera Models and Affine Multiple Views

Geometry
Subhashis Banerjee
Dept. Computer Science and Engineering
IIT Delhi
email: suban@cse.iitd.ac.in
May 29, 2001

1
1 Camera Models

A Camera transforms a 3D scene point X = (X, Y, Z)T into an image point x =


(x, y)T .

1.1 The Projective Camera


The most general mapping from P 3 to P 2 is
 
    X1
x1 T11 T12 T13 T14  
   X2
 x2  =  T21 T22 T23 T24 

 

 X3 
x3 T31 T32 T33 T34
X4

where (x1 , x2 , x3 )T and (X1 , X2 , X3 , X4 )T are homogeneous coordinates related to x


and X by
(x, y) = (x1 /x3 , x1 /x3 )
(X, Y, Z) = (X1 /X4 , X2 /X4 , X3 /X4 )
The transformation matrix T = [Tij ] has 11 degrees of freedom since only the ratios
of elements Tij are important.
(see Zisserman and Mundy).

1.2 The Perspective Camera


A special case of the projective camera is the perspective (or central) projection,
reducing to the familiar pin-hole camera when the leftmost 3 × 3 sub-matrix of T
is a rotation matrix with its third row scaled by the inverse focal length 1/f . The
simplest form is:  
1 0 0 0
Tp =  0 1 0 0  
0 0 1/f 0
which gives the familiar equations
" # " #
x f X
=
y Z Y

Each point is scaled by its individual depth, and all projection rays converge to the
optic center.
1.3 The Affine Camera
The affine camera is a special case of the projective camera and is obtained by con-
straining the matrix T such that T31 = T32 = T33 = 0, thereby reducing the degrees
of freedom from 11 to 8:
 
    X1
x1 T11 T12 T13 T14  
   X2
 x2  =  T21 T22 T23 T24 

 

 X3 
x3 0 0 0 T34
X4

In terms of image and scene coordinates, the mapping takes the form

x = MX + t

where M is a general 2 × 3 matrix with elements Mij = Tij /T34 while t is a general
2-vector representing the image center.
The affine camera preserves parallelism.

1.4 The Weak-Perspective Camera


The affine camera becomes a weak-perspective camera when the rows of M form a
uniformly scaled rotation matrix. The simplest form is
 
1 0 0 0
 
Twp = 0 1 0 0 
0 0 0 Zave /f

yielding, " # " # " #


f 1 0 0 x f X
Mwp = and =
Zave 0 1 0 y Zave Y
This is simply the perspective equation with individual point depths Zi replaced by
an average constant depth Zave
The weak-perspective model is valid when the average variation of the depth of the
object (∆Z) along the line of sight is small compared to the Zave and the field of view
is small. We see this as follows.
Expanding the perspective projection equation using a Taylor series, we obtain
" # Ã µ ¶2 !" #
f X f ∆Z ∆Z X
x= = 1− + − ...
Zave + ∆Z Y Zave Zave Zave Y
When |∆Z| << Zave only the zero-order term remains giving the weak-perspective
projection. The error in image position is then xerr = xp − xwp :
µ ¶" #
f ∆Z X
xerr =−
Zave Zave + ∆Z Y

showing that a small focal length (f ), small field of view (X/Zave and (Y /Zave ) and
small depth variation (∆Z) contribute to the validity of the model.

1.5 The orthographic camera


The affine camera reduces to the case of orthographic (parallel) projection when M
represents the first two rows of a rotation matrix. The simplest form is
 
1 0 0 0
 
Torth = 0 1 0 0 
0 0 0 1
yielding, " # " # " #
1 0 0 x X
Morth = and =
0 1 0 y Y

2 Affine Multiple Views Geometry


• Epipolar Geometry
• Structure determination
– Affine structure
– Euclidean structure

2.1 Affine Epipolar Geometry


When the perspective effects are small, the problem of locating perspective epipolar
lines becomes ill-conditioned. In such cases it is convenient to assume the parallel
projection model of the affine camera which explicitly models the ambiguities.
The affine epipolar constraint can be described in terms of the affine fundamental
matrix F as p0T Qp = 0, i.e.,
  
h i0 0 a xi
x0i yi0 1   
 0 0 b   yi  = 0
c d e 1
Z
Scene point

Z Average depth
ave
‘‘plane’’

f Image ‘‘plane’’

X
Optic Center Xp X X
wp orth

Figure 1: 1D image formation with image plane at Z = f . Xp , Xwp and Xorth are the
perspective, weak-perspective and orthographic projections respectively.

where p0 = (x0 , y 0 , 1)T and p = (x, y, 1) are homogeneous 3-vectors representing cor-
responding image points in two views.
(See Shapiro, Zisserman and Brady).
To derive the above, we write M as (B | b) where B is a general (non-singular)
2 × 2 matrix and b is a 2 vector. The projection equation then gives
" #
Xi
xi = B + Zi b + t
Yi
Similarly, for M0 A, we have
" #
Xi
x0i =B 0
+ Zi b0 + M 0 D + t 0
Yi
Eliminating scene coordinates (Xi , Yi ) gives
x0i = Γxi + Zi d + ²
where Γ = B0 B−1 , d = b0 − B0 B−1 b and ² = t0 − Γt + M0 D.
X1
X1
X2 X2
X3
X3

u’3
x1 x1
u’3 u’2
x2 x2 x3
u’1
u’2
x3
x’e u’1
xe
O O’

(a) (b)

Figure 2: Affine and Perspective Epipolar Geometries

Γ and d are functions only of camera parameters {M, M0 } and the motion trans-
formation A, while ² explains the motion of the reference point (centroid) and depend
on the translation of the object D and the camera origins t and t0 .
This equation shows that x0i associated with xi lies on a line (epipolar) on the
second image with offset Γxi + ² and direction d. The unknown depth Zi determines
how far along this line does x0i lie. Inverting the equation we obtain

xi = Γ−1 x0i − Zi Γ−1 d − Γ−1 ²

The translation invariant versions of these equations are

∆x0i = Γ∆xi + ∆Zi d


∆xi = Γ ∆x0i − ∆Zi Γ−1 d
−1

We can eliminate Zi from the above equations and obtain a single equation in terms
of image measurables:
(x0i − Γxi − ²).d⊥ = 0
where, d = (dx , dy ) and its perpendicular d⊥ = (dy , −dx ). This equation can be
written as
ax0i + byi0 + cxi + dyi + e = 0
where (a, b)T = d, (c, d)T = −ΓT d⊥ and e = −²T d⊥ . This gives us
  
h 0 0 a
i xi
x0i yi0 1 
 0 0 b  
  yi  = 0
c d e 1
2.1.1 Computation of Affine epipolar Geometry
Given correspondences in two views the affine fundamental matrix can be computed
using orthogonal regression by minimizing

1 n−1X
2 (ri · n + e)2
| n | i=0

Here ri = (x0i , yi0 , xi , yi )T and n = (a, b, c, d)T . The minimization finds a hyper-plane
that globally minimizes the sum of the squared perpendicular distances between ri
and the hyper-plane.
Defining
vi = ri − r̄
and
n−1
X
W= vi vi T
i=0

it can be shown that the solution satisfies the eigenvector equation

Wn = λi n, | n |2 = 1

2.2 Affine Structure


• Consider a set of n 3D world points Xi (i = 0, . . . , n − 1) in affine (non-rigid)
motion described by
X0i = AXi + D
where X0i is the new 3D position of the ith point, A is an arbitrary 3 x 3 matrix
and D is a 3-vector representing translation.

• Removing the effects of translation


by registering the points with respect to a reference point X0 to obtain

∆X = X − X0 and ∆X0 = X0 − X00 = A∆X

• Affine projections
If the affine camera models for the two views are given by the parameters {M, t}
and {M0 , t0 } respectively, then

∆x = M∆X and ∆x0 = M0 ∆X0 = M0 A∆X


• Basis and Affine structure
Now, consider four non-coplanar scene points X0 , . . . , X3 with X0 as the origin.
We define three axis vectors Ej = Xj − X0 for j = 1, . . . , 3. {E1 , E2 , E3 } form
a basis for the 3D affine space and any of the n vectors can be represented in
this basis as

Xi − X0 = αi E1 + βi E2 + γi E3 for i = 1, . . . , n

where (αi , βi , γi ) are the 3D affine coordinates of Xi . We call the 3D affine


coordinates the affine structure of the point Xi . It can be shown that the affine
structure remains invariant to affine motion with respect to the transformed
basis, that is,
∆Xi = αi E1 + βi E2 + γi E3
(1)
∆X0i = αi E01 + βi E02 + γi E03
where E0j = AEj .

• Computation of Affine structure


From the above we obtain
∆xi = xi − x0 = αi e1 + βi e2 + γi e3
(2)
∆x0i = x0i − x00 = αi e01 + βi e02 + γi e03

where, ei = MEi and e0i = M0 E0i .


Thus, to compute the affine structure, we require two images with at least four
points in correspondence, i.e.,

{x0 , x1 , x2 , x3 } and {x00 , x01 , x02 , x03 }

These correspondences establish the bases {e1 , e2 , e3 } and {e01 , e02 , e03 } provided
no two axes, in either images, are collinear. Each additional point gives four
equations in 3 unknowns
 
" # " # αi
∆xi e1 e2 e3  
=  βi 
∆x0i e01 e02 e03
γi

and the affine structure can be computed. The redundancy in the system enables
us to verify whether the affine projection model is valid.
2.2.1 Tomasi and Kanade factorization
In case n point correspondences (n ≥ 4) over k views (k ≥ 2) are available, we use
the factorization procedure of Tomasi and Kanade to obtain the bases and structure.
Their formulation can be written as an extension of the above equation as
   
∆x1 ∆x2 . . . ∆xn−1 e1 e2 e3  

 ∆x01 ∆x02 . . . ∆x0n−1 


 e01 e02 e03   1
α α2 . . . αn−1
 =   β1 β2 . . . βn−1  

 ∆x001 ∆x002 . . . ∆x00n−1 


 e001 e002 e003 
 γ
.. .. 1 γ 2 . . . γ n−1
. .

where the left measurement matrix W represents the n point correspondences in k


views and has dimensions 2k x (n−1). The matrices on the right, M̃ (2k x 3) and S̃ (3
x (n − 1)), are called motion and structure matrices respectively. The matrix S̃ gives
the invariant affine structure of the n points in motion, and the it h row of M̃, M̃(i),
along with the corresponding image center x0 (i), gives the projection parameters for
the it h view {M̃(i), x0 (i)}.
Clearly, in the absence of noise, W must have a rank at-most 3. Tomasi and
Kanade perform a singular value decomposition of W and use the 3 largest eigenvalues
to construct M̃ and S̃. If the SVD returns a rank greater than 3, then the affine
projection model is invalid and we use this as a check. The rank 2 case signifies either
a planar object (which is not possible for facial images!) or degenerate motion. In
such a case, the 3D affine structure cannot be determined and the views are related
by 2D affine transformations. The 2D affine structure can then be recovered in only
two axes using the same formalism.

2.2.2 Image transfer and linear combination of views


Once the affine structure has been computed, it can be used to generate a new view
of the object (“transfer”) by simply selecting a new spanning set {e001 , e002 , e003 }. No
camera calibration is needed. Note that this is same as choosing a new projection
matrix M00 .
x00i = x000 + αi e001 + βi e002 + γi e00k
If the affine structure is not of interest (graphics), it is possible to bypass the affine
coordinates and express the new image coordinates ∆x00 directly in terms of the first
two sets of image coordinates ∆x and ∆x0 . One can write the projection equations
in the first two views as
∆x = G∆X
∆x0 = G0 ∆X
where G and G0 are 2 × 3 matrices with rows {G1 , G2 } and {G01 , G02 } respectively.
The new view can be similarly written as

∆x00 = G00 ∆X

where G00 has rows {G001 , G002 }.


Now, any three rows of {G1 , G2 , G01 , G02 } define a linearly independent spanning
set for A3 , say {G1 , G2 , G01 }. So, there exists scalars such that
" # " #
00 a1 a2 a3 0
G = G+ G0
b1 b2 b3 0

Then, ∆x00 = G00 ∆X gives


 
" # " # " # ∆x
a1 a2 a3 0 a1 a2 a3  
∆x00 = ∆x + ∆x0 =  ∆y 
b1 b2 b3 0 b1 b2 b3
∆x0

Thus, if images of an object are obtained using affine cameras, then a novel view can
be expressed as a linear combination of views (this is useful for object recognition).

2.2.3 Change of basis


Given the current spanning set {e1 , e2 , e3 } and {e01 , e02 , e03 } in the two images, we
have that  
" # " # αi
∆xi e1 e2 e3  
0 = 0 0 0  βi 
∆xi e1 e2 e3
γi
Suppose that we now wish to express the same set of points using alternative spanning
sets {h1 , h2 , h3 } and {h01 , h02 , h03 }, the new affine coordinates must obey
 
" # " # α̂i
∆xi h1 h2 h3  
=  β̂i 
∆x0i h01 h02 h03
γ̂i

2.2.4 Koenderink and Van Doorn


Instead of choosing E3 = X3 − X0 , KVD choose Ek = k (i.e. the direction of viewing
in the first frame). Since ek = MEk = 0, the projection of Ek in the first image is
degenerate reducing it to a single point. Thus, only two basis vectors are chosen in
the first image
∆xi = αi e1 + βi e2
P Q

~
Q reference
plane
~
P

p’
p
q
~
p’
q’
V
1
V

~
2 q’

Figure 3: Affine Structure from Motion

In the second image, the third axis vector is no longer degenerate, given by e0k =
ME0k = MAEk . e0k is actually an epipolar line. If we use e01 and e02 to predict the
position where each point would appear in image 2, as if they lay on plane {E1 , E2 },
we get
x̂0i = x00 + αi e01 + βi e02
the disparity between the predicted position and the observed position

x0i = x00 + αi e01 + βi e02 + γi e0k

is solely due to the γi component

x0i − x̂0i = γi e0k

2.3 Rigid reconstruction


2.3.1 Assumptions
• Rigid transformation (isometry)
• Affine projection

• Metric constructions

2.3.2 Procedure

Image Plane Fronto-parallel


plane

Figure 4: Euclidean reconstruction

1. Translation in fronto-parallel plane merely produces a shift in projections. This


can be factored out by putting two projections of O in to coincidence.

2. Rotation can be decomposed into i) a rotation in the image plane (cyclo-


rotation) and rotation about an axis in the fronto-parallel plane. Projection
of the third affine frame vector is the projection of a plane perpendicular to the
axis of rotation in the fronto-parallel plane. One can reconstruct the projection
in the first view (only affine construction) and factor out the relative rotation
in the two images. This yields the cyclo-rotation.

3. Since the axis of rotation is known in both views, one can find the overall scale
difference due to translation in depth. Points on the axis of rotation do not
rotate. Consider the projection of all image points on to this axis. If they differ
in the two views, they must differ by only a constant scale factor. Otherwise,
the rigidity assumption is falsified.

4. Now the two views differ only by a rotation about an axis in the fronto-parallel
plane. Define a Euclidean frame (e1 , e2 , e3 ), such that e1,2,3 are unit vectors
with e1 along the axis of rotation and e3 along the line of sight.
Let G1 e1 + G2 e2 denote the depth gradient of a plane in the object. That is,
the depth of a point αe1 + βe2 in the image with respect to the fronto-parallel
plane is αG1 + βG2 . Note that

G1 = tan σ cos τ
G2 = tan σ sin τ

where σ is the slant and τ is the tilt of the plane.


Consider any triangle OXY in the plane. Let the coordinates of X and Y be
(X1 , X2 ) and (Y1 , Y2 ) respectively. Then the third coordinates must be

X3 = G1 X1 + G2 X2
Y3 = G1 Y1 + G2 Y2

For a given turn ρ the rotation can be represented by


 
1 0 0
 
 0 cos ρ − sin ρ 
0 sin ρ cos ρ

Of the three transformed coordinates, the first one is trivially unchanged and
the third one is not observable. The second coordinate is observable, and the
equations are:

X21 = X20 cos ρ − sin ρ(X10 G1 + X20 G2 )


Y21 = Y20 cos ρ − sin ρ(Y10 G1 + Y20 G2 )

here the upper indices label the views and the lower indices label the compo-
nents.
Because the turn ρ is unknown, we eliminate it from these equations to obtain a
single equation in (G1 , G2 ). This equation represents a one-parameter solution
for the two view case. The parameter is the unknown turn ρ. The equation is
quadratic in (G1 , G2 ) with the linear term absent; and represents a hyperbola
in the (G1 , G2 ) space (please derive it).
5. Repeating the steps above between the second and a third view, we obtain a
pair of two view solutions. Each two view solution represents a one-parameter
family of solutions. The one-parameter families for the 0-1 transition and the
1-2 transition are represented by the hyperbolic loci in the gradient space. The
pair of hyperbola has either two or four intersections. The case of no intersection
occurs only in the non-rigid case. If the motion is rigid, then there has to be
one solution and hence a pair of them. The intersections represent either one or
two pairs of solutions that are related through a reflection in the fronto-parallel
plane.