Sie sind auf Seite 1von 37

Structure-from-Motion

Determining the 3-D structure of the world, and/or the


motion of a camera using a sequence of images
taken by a moving camera.
Equivalently, we can think of the world as moving and the
camera as fixed.
Like stereo, but the position of the camera isnt
known (and its more natural to use many images
with little motion between them, not just two with a lot
of motion).
We may or may not assume we know the parameters of the
camera, such as its focal length.

Structure-from-Motion
As with stereo, we can divide problem:
Correspondence.
Reconstruction.
Again, well talk about reconstruction
first.
So for the next few classes we assume that
each image contains some points, and we
know which points match which.
Structure-from-Motion

Movie
Reconstruction
A lot harder than with stereo.
Start with simpler case: scaled
orthographic projection (weak
perspective).
Recall, in this we remove the z coordinate
and scale all x and y coordinates the same
amount.
Announcements
Quiz on April 22 in class (a week from
Thursday).
Practice problems handed out Thursday.
Reviews at office hours, possibly in
seminar room across the hall from my
office.
Questions about PS6
First: Represent motion
Well talk about a fixed camera, and moving object.
Key point:
|
|
|
|
|
.
|

\
|
=
1 1 1
. . .
2 1
2 1
2 1
n
n
n
z z z
y y y
x x x
P
Points
|
|
.
|

\
|
=
y
x
t s s s
t s s s
S
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
Some matrix
|
.
|

\
|
=
n
n
v v v
u u u
I
2 1
2 1
. . .
The image
SP I =
Then:
Structure-from-Motion
S encodes:
Projection: only two lines
Scaling, since S can have a scale factor.
Translation, by tx/s and ty/s.
Rotation:

SP I =
Rotation
P
r r r
r r r
r r r
|
|
|
.
|

\
|
3 , 3 2 , 3 1 , 3
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1 Represents a
3D rotation of
the points in P.
First, look at 2D rotation
(easier)
|
|
.
|

\
|
|
|
.
|

\
|

n
n
y y y
x x x
2 1
2 1
. . .
cos sin
sin cos
u u
u u
|
|
.
|

\
|

=
u u
u u
cos sin
sin cos
R
Matrix R acts
on points by
rotating them.
Also, RR
T
= Identity. R
T
is also a rotation
matrix, in the opposite direction to R.
Simple 3D Rotation
|
|
|
.
|

\
|
|
|
|
.
|

\
|

n
n
n
z z z
y y y
x x x
2 1
2 1
2 1
. . .
1 0 0
0 cos sin
0 sin cos
u u
u u
Rotation about z axis.
Rotates x,y coordinates. Leaves z coordinates fixed.
Full 3D Rotation
|
|
|
.
|

\
|

|
|
|
.
|

\
|

|
|
|
.
|

\
|
=
o o
o o
| |
| |
u u
u u
cos sin 0
sin cos 0
0 0 1
cos 0 sin
0 1 0
sin 0 cos
1 0 0
0 cos sin
0 sin cos
R
Any rotation can be expressed as combination of three
rotations about three axes.
|
|
|
.
|

\
|
=
1 0 0
0 1 0
0 0 1
T
RR
Rows (and columns) of R are
orthonormal vectors.
R has determinant 1 (not -1).
Questions?
Putting it Together
P
r r r
r r r
r r r
t
t
t
s
z
y
x
|
|
|
|
|
.
|

\
|
|
|
|
.
|

\
|
|
.
|

\
|
1 0 0 0
0
0
0
1 0 0
0 1 0
0 0 1
0 1 0
0 0 1
3 , 3 2 , 3 1 , 3
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
Scale
Projection
3D Translation
3D Rotation
) , , ( ) , , (
0 ) , , ( ) , , (
where
3 , 2 2 , 2 1 , 2 3 , 1 2 , 1 1 , 1
3 , 2 2 , 2 1 , 2 3 , 1 2 , 1 1 , 1
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
s s s s s s
s s s s s s
P
st s s s
st s s s
y
x
=
= -
|
|
.
|

\
|

We can just write st


x
as
t
x
and st
y
as t
y
.
Affine Structure from Motion
) , , ( ) , , (
0 ) , , ( ) , , (
where
3 , 2 2 , 2 1 , 2 3 , 1 2 , 1 1 , 1
3 , 2 2 , 2 1 , 2 3 , 1 2 , 1 1 , 1
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
s s s s s s
s s s s s s
P
t s s s
t s s s
y
x
=
= -
|
|
.
|

\
|
Affine Structure-from-Motion:
Two Frames (1)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1
. . . . . .
2 1
2 1
2 1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
Affine Structure-from-Motion:
Two Frames (2)
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1 1
1 0 0 0
0 1 0 0
0 0 1 0
1 1 1 1
4 3 2 1
4 3 2 1
4 3 2 1
z z z z
y y y y
x x x x
To make things
easy, suppose:
Affine Structure-from-Motion:
Two Frames (3)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1
. . . . . .
2 1
2 1
2 1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
Looking at the first four points, we get:
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1 1
1 0 0 0
0 1 0 0
0 0 1 0
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2
4
2
3
2
2
2
1
2
4
2
3
2
2
2
1
1
4
1
3
1
2
1
1
1
4
1
3
1
2
1
1
y
x
y
x
t s s s
t s s s
t s s s
t s s s
v v v v
u u u u
v v v v
u u u u
Affine Structure-from-Motion:
Two Frames (5)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2
2
1
1
k
k
k
y
x
y
x
k
k
k
k
z
y
x
t s s s
t s s s
t s s s
t s s s
v
u
v
u
Once we know the motion, we can use the images of
another point to solve for the structure. We have four
linear equations, with three unknowns.
Affine Structure-from-Motion:
Two Frames (7)
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1 1
1 0 0 0
0 1 0 0
0 0 1 0
1 1 1 1
4 3 2 1
4 3 2 1
4 3 2 1
z z z z
y y y y
x x x x
A
But, what if the first four points arent so simple?
Then we define A so that:

This is always possible as long as the points
arent coplanar.
Affine Structure-from-Motion:
Two Frames (8)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|

1 1 1
. . . . . .
2 1
2 1
2 1
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
A A
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1
. . . . . .
2 1
2 1
2 1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|

1 1 1 1 1
1 0 0 0
0 1 0 0
. . . 0 0 1 0 . . .
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z
y
x
A
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
Then,
given:
We have:
And:
Affine Structure-from-Motion:
Two Frames (9)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|

1 1 1 1 1
1 0 0 0
0 1 0 0
. . . 0 0 1 0 . . .
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z
y
x
A
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
Given:
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1

|
|
|
|
|
.
|

\
|
A
t s s s
t s s s
t s s s
t s s s
y
x
y
x
Then we just pretend that:
is our motion,
and solve as
before.

Affine Structure-from-Motion:
Two Frames (10)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|
1 1 1
. . . . . .
2 1
2 1
2 1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
This means that we can never determine the exact 3D
structure of the scene. We can only determine it up to
some transformation, A. Since if a structure and motion
explains the points:
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|

1 1 1
. . . . . .
2 1
2 1
2 1
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
A A
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
So does
another of
the form:
Affine Structure-from-Motion:
Two Frames (11)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
.
|

\
|

1 1 1
. . . . . .
2 1
2 1
2 1
1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
y
x
y
x
n
n
n
n
z z z
y y y
x x x
A A
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
|
|
|
|
|
.
|

\
|
=
1 0 0 0
4 , 3 3 , 3 2 , 3 1 , 3
4 , 2 3 , 2 2 , 2 1 , 2
4 , 1 3 , 1 2 , 1 1 , 1
a a a a
a a a a
a a a a
A
Note that A has
the form:
A corresponds to
translation of the
points, plus a
linear
transformation.
Lets take an explicit example of this. Suppose we have a cube with
vertices at: (0,0,0) (10,0,0), (0,0,10).. Suppose we transform this by
rotating it by 45 degrees about the y axis. Then the transformation matrix
is: [sq2 0 sq2 0; 0 1 0 0]. Now suppose instead we had a rectanguloid
with corners at (10, 0, 0) (11,0,0), (10, 0, 10) . We can transform this
rectanguloid into the first cube by transforming it as: [10 0 0 -100; 0 1 0 0;
0 0 1 0; 0 0 0 1]. Then we can apply [sq2 0 sq2 0; 0 1 0 0] to the
resulting cube, to generate the same image. Or, we could have combined
these two transformations into one:
[sq2 0 sq2 0; 0 1 0 0]* [10 0 0 -100; 0 1 0 0; 0 0 1 0; 0 0 0 1]
= [10sq2 0 sq2 0; 0 1 0 0]
Affine Structure-from-Motion: A
lot of frames (1)
|
|
|
|
|
.
|

\
|
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
1 1 1
. . .
.
.
.
. . .
. . .
. . .
. . .
. . .
2 1
2 1
2 1
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
2 2
3 , 2
2
2 , 2
2
1 , 2
2 2
3 , 1
2
2 , 1
2
1 , 1
1 1
3 , 2
1
2 , 2
1
1 , 2
1 1
3 , 1
1
2 , 1
1
1 , 1
2 1
2 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
n
n
n
m
y
m m m
m
x
m m m
y
x
y
x
m
n
m m
m
n
m m
n
n
n
n
z z z
y y y
x x x
t s s s
t s s s
t s s s
t s s s
t s s s
t s s s
v v v
u u u
v v v
u u u
v v v
u u u
I S P
First Step: Solve for Translation
(1)
This is trivial, because we can pick a simple
origin.
World origin is arbitrary.
Example: We can assume first point is at origin.
Rotation then doesnt effect that point.
All its motion is translation.
Better to pick center of mass as origin.
Average of all points.
This also averages all noise.

More explicitly, suppose sum(p) = (0,0,0,n)^T. Then,
sum(R*P) = R*(sum(P)) = R*(0,0,0,n)^T = (0,0,0,n)^T.
Sum(T*R*P) = T*(0,0,0,n)^T = (ntx,nty,ntz,n)^T. (Or
just look at the 2x4 projection matrix). If we subtract tx
or ty from every row, then the residual is
(s11,s12,s13;s21,s22,s23)*P. I = s part of matrix + t
part of matrix.

P
r r r
r r r
r r r
t
t
t
s
z
y
x
|
|
|
|
|
.
|

\
|
|
|
|
.
|

\
|
|
.
|

\
|
1 0 0 0
0
0
0
1 0 0
0 1 0
0 0 1
0 1 0
0 0 1
3 , 3 2 , 3 1 , 3
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
First Step: Solve for Translation
(2)
|
|
|
.
|

\
|
=

=
|
|
|
|
|
.
|

\
|
0
0
0
1
: WLOG
n
i
i
z
i
y
i
x
n
u
u
n
k
i
k
i

=
=
1
n
v
v
n
k
i
k
i

=
=
1
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|






=
m m
n
m m m m
m m
n
m m m m
n
n
n
n
v v v v v v
u u u u u u
v v v v v v
u u u u u u
v v v v v v
u u u u u u
I
. . .
. . .
. . .
. . .
. . .
~
2 1
2 1
2 2 2 2
2
2 2
1
2 2 2 2
2
2 2
1
1 1 1 1
2
1 1
1
1 1 1 1
2
1 1
1
First Step: Solve for Translation
(3)
|
|
|
.
|

\
|
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
n
n
n
m m m
m m m
m
n
m m
m
n
m m
n
n
n
n
z z z
y y y
x x x
s s s
s s s
s s s
s s s
s s s
s s s
v v v
u u u
v v v
u u u
v v v
u u u
2 1
2 1
2 1
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
2
3 , 2
2
2 , 2
2
1 , 2
2
3 , 1
2
2 , 1
2
1 , 1
1
3 , 2
1
2 , 2
1
1 , 2
1
3 , 1
1
2 , 1
1
1 , 1
2 1
2 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
. . .
.
.
.
~
. . .
~ ~
~ ~ ~
. . .
. . .
. . .
~ ~ ~
~ ~ ~
~ ~ ~
~
. . .
~ ~
As if by magic, theres no translation.
Rank Theorem
|
|
|
.
|

\
|
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
n
n
n
m m m
m m m
m
n
m m
m
n
m m
n
n
n
n
z z z
y y y
x x x
s s s
s s s
s s s
s s s
s s s
s s s
v v v
u u u
v v v
u u u
v v v
u u u
2 1
2 1
2 1
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
2
3 , 2
2
2 , 2
2
1 , 2
2
3 , 1
2
2 , 1
2
1 , 1
1
3 , 2
1
2 , 2
1
1 , 2
1
3 , 1
1
2 , 1
1
1 , 1
2 1
2 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
. . .
.
.
.
~
. . .
~ ~
~ ~ ~
. . .
. . .
. . .
~ ~ ~
~ ~ ~
~ ~ ~
~
. . .
~ ~
I
~
I
~
I
~
has rank 3.
This means there
are 3 vectors
such that every
row of is a
linear
combination of
these vectors.
These vectors
are the rows of P.

S
P
Solve for S
SVD is made to do this.
UDV I =
~
D is diagonal with non-increasing
values.
U and V have orthonormal rows.
Ignoring values that get set to 0, we
have U(:,1:3) for S, and
D(1:3,1:3)*V(1:3,:) for P.
Linear Ambiguity (as before)
I
~
I
~
= U(:,1:3) * D(1:3,1:3) * V(1:3,:)
= (U(:,1:3) * A) * (inv(A) *D(1:3,1:3) * V(1:3,:))
Noise
has full rank.
Best solution is to estimate I thats as near to
as possible, with estimate of I having rank 3.
Our current method does this.
I
~
I
~
Weak Perspective Motion
|
|
|
.
|

\
|
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
=
|
|
|
|
|
|
|
|
|
|
|
|
.
|

\
|
n
n
n
m m m
m m m
m
n
m m
m
n
m m
n
n
n
n
z z z
y y y
x x x
s s s
s s s
s s s
s s s
s s s
s s s
v v v
u u u
v v v
u u u
v v v
u u u
2 1
2 1
2 1
3 , 2 2 , 2 1 , 2
3 , 1 2 , 1 1 , 1
2
3 , 2
2
2 , 2
2
1 , 2
2
3 , 1
2
2 , 1
2
1 , 1
1
3 , 2
1
2 , 2
1
1 , 2
1
3 , 1
1
2 , 1
1
1 , 1
2 1
2 1
2 2
2
2
1
2 2
2
2
1
1 1
2
1
1
1 1
2
1
1
. . .
.
.
.
~
. . .
~ ~
~ ~ ~
. . .
. . .
. . .
~ ~ ~
~ ~ ~
~ ~ ~
~
. . .
~ ~
I
~
S
P
Row 2k and
2k+1 of S should
be orthogonal.
All rows should
be unit vectors.
(Push all scale
into P).
=(U(:,1:3)*A)*(inv(A) *D(1:3,1:3)*V(1:3,:))
Choose A so
(U(:,1:3) * A)
satisfies these
conditions.
I
~
Related problems we wont cover
Missing data.
Points with different, known noise.
Multiple moving objects.
Final Messages
Structure-from-motion for points can be
reduced to linear algebra.
Epipolar constraint reemerges.
SVD important.
Rank Theorem says the images a
scene produces arent complicated
(also important for recognition).

Das könnte Ihnen auch gefallen