Sie sind auf Seite 1von 19

~Pergamon

Pattern Recognition, Vol. 30, No. 4, pp. 607--Q25, 1997


1997 Pattern Recognition Society. Pnblished by Elsevier Science Ltd
Ptinted in Great Britain. All rights reserved
0031-3203/97 $17.00+.00

PH: S0031-3203(96)00107-0

AUTOMATIC VIDEO INDEXING VIA OBJECT


MOTION ANALYSIS
JONATHAN D. COURTNEY*
Texas Instruments, Incorporated 8330 LBJ Freeway, MIS 8374 Dallas, Texas 75243, U.S.A
(Received 12 June 1996; received for publication 30 July 1996)

Abstract-To assist human analysis of video data, a technique has been developed to perform automatic,
content-based video indexing from object motion. Moving objects are detected in tbe video sequence using
motiOn se~mentation I?etbo?s. By trac~ng individual objects through tbe segmented data, a symbolic
representation of tbe video IS generated m tbe form of a directed graph describing tbe objects and tbeir
movement. This graph is then annotated using a rule-based classification scheme to identify events of interest
~.g., a~pearance/di.sappearan~e, deposit/removal, entrance/exit, and motion/rest of objects. One may then use ~
m?ex mto. tbe monon graph mstead of the raw data to analyse the semantic content of tbe video. Application of
tb1s techmque to surveillance video analysis is discussed. 1997 Pattern Recognition Society. Published by
Elsevier Science Ltd.
Video indexing

Object tracking

Motion analysis

1. INTRODUCTION

Advances in multimedia technology, including commercial prospects for video-on-demand and digital library
systems, have generated recent interest in content-based
video analysis. Video data offers users of multimedia
systems a wealth of information; however, it is not as
readily manipulated as other data such as text. Raw video
data has no immediate "handles" by which the multimedia system user may analyse its contents. By annotating video data with symbolic information describing the
semantic content, one may facilitate analysis beyond
simple serial playback.
To assist human analysis of video data, a technique has
been developed to perform automatic, content-based
video indexing from object motion. Moving objects
are detected in the video sequence using motion segmentation methods. By tracking individual objects
through the segmented data, a symbolic representation
of the video is generated in the form of a directed graph
describing the objects and their movement. This graph is
then annotated using a rule-based classification scheme
to identify events of interest, e.g., appearance/disappearance, deposit/removal, entrance/exit, and motion/rest of
objects. One may then use an index into the motion graph
instead of the raw data to analyse the semantic content of
the video.
We have developed a system that demonstrates this
indexing technique in assisted analysis of surveillance
video data. The Automatic Video Indexing (AVI) system
allows the user to select a video sequence of interest, play
it forward or backward and stop at individual frames.
Furthermore, the user may specify queries on video
sequences and "jump" to events of interest to avoid
tedious serial playback. For example, the user may select

* E-mail: courtney@csc.ti.com.
607

Content-based retrieval

a person in a video sequence and specify the query "show


me all objects that this person removed from the scene".
In response, the AVI system assembles a set of video
"clips" highlighting the query results. The user may
select a clip of interest and proceed with further video
analysis using queries or playback as before.
The remainder of this paper is organized as follows:
Section 2 discusses content-based video analysis. Section 3 presents a video indexing technique based on
object motion analysis. Section 4 describes a system
which implements this video indexing technique for
scene monitoring applications. Section 5 presents experimental results using the system. Section 6 concludes the
paper.

2. CONTENT-BASED VIDEO ANALYSIS

Video data poses unique problems for multimedia


information systems that text does not. Textual data is
a symbolic abstraction of the spoken word that is usually
generated and structured by humans. Video, on the other
hand, is a direct recording of visual information. In its
raw and most common form, video data is subject to little
human-imposed structure, and thus has no immediate
"handles" by which the multimedia system user may
analyse its contents.
For example, consider an online movie screenplay
(textual data) and a digitized movie (video and audio
data). If one were analysing the screenplay and interested
in searching for instances of the word "horse" in the text,
various text searching algorithms could be employed to
locate every instance of this symbol as desired. Such
analysis is common in online text databases. If, however,
one were interested in searching for every scene in the
digitized movie where a horse appeared, the task is much
more difficult. Unless a human performs some sort of

608

J. D. COURTNEY

pre-processing of the video data, there are no symbolic


keys on which to search. For a computer to assist in the
search, it must analyse the semantic content of the video
data itself. Without such capabilities, the information
available to the multimedia system user is greatly reduced.
Thus, much research in video analysis focuses on
semantic content-based search and retrieval techniques.
Video indexing refers to the process of identifying important frames or objects in the video data for efficient
playback. An indexed video sequence allows a user not
only to play the sequence in the usual serial fashion, but
also to "jump" to points of interest while it plays. A
common indexing scheme is to employ scene cut detection(!) to determine breakpoints in the video data. Indexing has also been performed based on camera (i.e.
viewpoint) motion<2 l and object motion.< 3 '4 l
Using breakpoints found via scene cut detection, other
researchers have pursued hierarchical segmentation<S-?)
to analyse the logical organization of video sequences. In
the same way that text is organized into sentences,
paragraphs, and chapters, the goal of these techniques
is to determine a hierarchical grouping of video subsequences. Combining this structural information with
content abstractions of segmented sub-sequences<Sl provides multimedia system users a top-down view of video
data.
The indexing technique described in this paper (the
"AVI technique") performs video indexing based on
object motion analysis. Unlike previous work, it forms
semantically high-level interpretations of object actions
and interactions from the object motion information. This
allows multimedia system users to search for objectmotion "events" in the video sequence (such as object
entrance or exit) rather than features related to object
velocity alone (such as "northeast movement").
3. VIDEO INDEXING VIA OBJECT MOTION ANALYSIS

Given a video sequence, the AVI technique analyses


the motion of foreground objects in the data and indexes

the objects to indicate the occurrence of several events of


interest. It outputs a symbolic abstraction of the video
content in the form of an annotated directed graph
containing the indexed objects. This symbolic data
may then be read by a user interface to perform content-based queries on the video data.
The AVI technique processes the video data in three
stages: motion segmentation, object tracking, and motion
analysis. First, motion segmentation methods< 9 10) are
used to segment moving foreground objects from the
scene background in each frame. Next, each object is
tracked through successive video frames, resulting in a
graph describing object motion and path intersections.
Then the motion graph is scanned for the occurrence of
several events of interest. This is performed using a rulebased classifier which employs knowledge concerning
object motion and the output of the previous stages to
characterize the activity of the objects recorded in the
graph. For example, a moving object that occludes another object results in a "disappear" event; a moving
object that intersects and then removes a stationary object
results in a "removal" event. An index is then created
which identifies the location of each event in the video
sequence.
Figure 1 depicts the relation between the video data,
motion segmentation information, and the motion graph.
Note that for each frame of the video, the AVI technique
creates a corresponding symbolic "frame" to describe it.
3.1. Terminology and notation

The following is a description of some of the terms and


notation used in the subsequent sections:
A sequence Y' is an ordered set of N frames, denoted
Y' = {Fo,F,, . .. ,FN-J}, whereFnisframenumbern
in the sequence.
A clip is a 4-tuple '?J=(Y',j,s,l), where Y' is a
sequence with N frames, and f, s, and l are frame
numbers such that 0 ~ f ~ s ~ l ~ N - 1. Here, F1
and F 1 are the first and last valid frames in the clip, and
Fs is the "start" frame. Thus, a clip specifies a sub-

Video Data

Motion Segmentation

Motion Graph

:--~--: :----ci\-:
----:
:--~-:---- -~--:
. ; .. :..,HJ .....,.... ,....... .ioi
.u_...,... T.....

, o

.:~.~-~--~-

:::~:,

..-1?---:
. ,

1~.~-~.~.~~.t:.~~-~.;.~.~-~.r:.~~-~.t ---~-~-~-~....l~--~.~. ~.~-).::~--~-~~--- j


Removal

Fig. 1. Relation between video data, motion segmentation information, and tbe symbolic motion graph.

609

Automatic video indexing via object motion analysis

sequence and contains a state variable to indicate a


"frame of interest".
A frame F is an image I annotated with a timestamp t.
Thus, frame number n is denoted by the pair

Fn

(In, tn)

An image I is an rxc array of pixels. The notation /(i, j)


indicates the pixel at coordinates (row i, columnj). For
purposes of this discussion, a pixel is assumed to be an
intensity value between 0 and 255.
A timestamp records the date and time that an image
was digitized.

3.2. Motion segmentation


For each frame Fn in the sequence, the motion segmentation stage computes segmented image Cn as

Cn = ccomps(Thk),
where Th is the binary image resulting from thresholding
the absolute difference of images In and / 0 at h, Th k the
morphological close operation 2 l on Th with structuring
element k, and the function ccomps() performs connected components analysis/Ill resulting in a unique
label for each connected region in image Th k. The
image Th is defined as
T (i ")
h ,J

={1
0

if IIn(i,j)- Io(i,j)l 2' h,


otherwise,

for all pixels (i,j) in h


Figure 2 shows an example of this process. Absolute
differencing and thresholding [Fig. 2(c) and (d)] detect
motion regions in the image. The morphological close
operation shown in Fig. 2(e) joins together small regions
into smoothly-shaped objects. Connected components
analysis assigns each detected object a unique label, as
shown in Fig. 2(f). Components smaller than a given size

(d)

1. Sudden lighting changes may render the reference


frame invalid. However, techniques such as scene
cut detection(!) may be used to detect such
occurrences and indicate when a new reference
image must be acquired.
2. Gradual lighting changes may cause the reference
image to slowly grow "out of date" over long video
sequences, particularly in outdoor scenes. Here, more
sophisticated techniques involving cumulative differences of successive video frames< 13) must be employed.
3. The viewpoint may change due to camera motion. In
4
this case, camera motion compensation l must be used
to offset the effect of an apparent moving background.
4. An object may be present in the reference frame and
move during the sequence. This causes the motion
segmentation process to incorrectly detect the background region exposed by the object as if it were a
newly-appearing stationary object in the scene.
A straightforward solution to problem 4 is to apply a
test to non-moving regions detected by the motion segmentation process based on the following observation: if

(b)

(a)

threshold are discarded. The result is Cm the output of the


motion segmentation stage.
The motion segmentation technique described here is
best suited for video sequences containing object motion
within an otherwise static scene, such as in surveillance
and scene monitoring applications. Note that the technique uses a "reference image" for processing. This is
nominally the first image from the sequence, / 0 . For many
applications, the assumption of an available reference
image is not unreasonable; video capture is simply
initiated from a fixed-viewpoint camera when there is
limited motion in the scene. Following are some reasons
why this assumption may fail in other applications:

(c)

(e)

(f)

Fig. 2. Motion segmentation example. {a) Reference image I0 . (b) Image In- (c) Absolute difference \In - I0 \.
(d) Thresholded image Th. (e) Result of morphological close operation. (f) Result of connected components
analysis.

610

J. D. COURTNEY

the region detected by the segmentation of image In is due


to the motion of an object present in the reference image
(i.e. due to "exposed background"), a high probability
exists that the boundary of the segmented region will
coincide with intensity edges detected in I 0 . If the region
is due to the presence of a foreground object in the
current image, a high probability exists that the region
boundary will coincide with intensity edges in In. The test
is implemented by applying an edge detection operator to
the current and reference images and checking for coincident boundary pixels in the segmented region of
Cn.<9 l Figure 3 shows this process. If the test supports
the hypothesis that the region in question is due to
exposed background, the reference image is modified
by replacing the object with its exposed background
region (see Fig. 4).
No motion segmentation technique is perfect. The
following are errors typical of many motion segmentation techniques:
1. True objects will disappear temporarily from the
motion segmentation record. This occurs when there
is insufficient contrast between an object and an
occluded background region, or if an object is
partially occluded by a "background" structure (for
instance, a tree or pillar present in the scene).
2. False objects will appear temporarily in the motion
segmentation record. This is caused by light fluctuations or shadows cast by moving objects.
3. Separate objects will temporarily join together. This
typically occurs when two or more objects are in
close proximity or when one object occludes another
object.
4. Single objects will split into multiple regions. This
occurs when a portion of an object has insufficient
contrast with the background it occludes.
Instead of applying incremental improvements to relieve the shortcomings of motion segmentation, the AVI
technique addresses these problems at a higher level
where information about the semantic content of the
video data is more readily available. The object tracking
and motion analysis stages described in Sections 3.3 and
3.4 employ object trajectory estimates and knowledge
concerning object motion and typical motion segmentation errors to construct a more accurate representation of
the video content.
3.3. Object tracking

The motion segmentation output is processed by the


object tracking stage. Given a segmented image Cn with
P uniquely-labeled regions corresponding to foreground
objects in the video, the system generates a set of features
to represent each region. This set of features is named a
"V-object" (video-object), denoted V~, p = 1, ... , P. A
V-object contains the label, centroid, bounding box, and
shape mask of its corresponding region, as well as object
velocity and trajectory information generated by the
tracking process.
V-objects are then tracked through the segmented
video sequence. Given segmented images Cn and Cn+t

with V-objects Vn = {V~; p = 1, ... , P} and Vn+l =


{V,:+ 1 ; q = 1, ... , Q}, respectively, the motion tracking
process "links" V-objects V~ and V~+l if their position
and estimated velocity indicate that they correspond to
the same real-world object appearing in frames Fn and
Fn+l This is determined using linear prediction of Vobject positions and a "mutual nearest neighbor" criterion via the following procedure:
1. For each V-object
next frame using

if,.

V~

E Vn, predict its position in the

J/, + v~

. (tn+l - tn),

where if,. is the predicted centroid of V~ in Cn+t> J1:.


the centroid of V~ measured in Cm v~ the estimated
(forward) velocity of V~, and tn+l and tn are the
timestamps of frames Fn+l and Fm respectively.
Initially, the velocity estimate is set to v~ = (0, 0).
2. For each V~ E Vn, determine the V-object in the next
frame with centroid nearest if,.. This "nearest neighbor" is denoted JV~. Thus,
JV~ = V~+l 3

II.U,: -

.u~+1ll

S II ,if,; - .u~+1ll Vq #

r.

3. For every pair (V~, JV~ = V~+ 1) for which no other Vobjects in Vn have V~+ 1 as a nearest neighbor, estimate
v~+ 1 , the (forward) velocity of V~+l' as
r

vn+l =

.U~+1- ~
;
tn+1 - tn

(1)

otherwise, set v~+l = (0, 0).


These steps are performed for each Cm
n = 0, 1, ... , N- 2. Steps 1 and 2 find nearest neighbors
in the subsequent frame for each V-object. Step 3 generates velocity estimates for V-objects that can be unambiguously tracked; this information is used in step 1 to
predict V-object positions for the next frame.
Next, steps l-3 are repeated for the reverse sequence,
i.e. Cm n = N- 1,N- 2, ... , 1. This results in anew set
of predicted centroids, velocity estimates, and nearest
neighbors for each V-object in the reverse direction.
Thus, the V-objects are tracked both forward and backward through the sequence. The remaining steps are then
performed:
4. V-objects V~ and V~+l are mutual nearest neighbors
if Jll"~ = V~+ 1 and JV~+ 1 = V~. (Here, JV~ is the
nearest neighbor of V~ in the forward direction, and
JV~+ 1 is the nearest neighbor of V~+ 1 in the reverse
direction.) For each pair of mutual nearest neighbors
(V~, v~+1), create a primary link from v~ to v~+1"
5. For each V~ E Vn without a mutual nearest neighbor,
create a secondary link from V~ to JV~ if the predicted
centroid if,. is within E of JV~, where E is some small
distance.
6. For each V~+ 1 in Vn+ 1 without a mutual nearest
neighbor, create a secondary link from JV~+ 1 to
V,:+ 1 if the predicted centroid p~+ 1 is within E of
JV~+l"
The object tracking procedure uses the mutual nearest
neighbor criterion (step 4) to estimate frame-to-frame V-

Automatic video indexing via object motion analysis

(a)

(b)

(c)

(f)

\'-

r "o.
I..... ..__,I

(g)

(h)

Fig. 3. Exposed background detection. (a) Reference image / 0 . (b) Image In. (c) Region to be tested. (d)
Edge image of (a), found using Sobel0 1) operator. (e) Edge image of (b). (t) Edge image of (c), showing
boundary pixels. (g) Pixels coincident in (d) and (t). (h) Pixels coincident in (e) and (t). The greater number
of coincident pixels in (g) versus (h) support the hypothesis that the region in question is due to exposed
background.

611

J. D. COURTNEY

612

Fig. 4. Reference image modified to account for the exposed background region detected in Fig. 3.

object trajectories with a high degree of confidence. Pairs


of mutual nearest neighbors are connected using a "primary" link to indicate that they are highly likely to
represent the same real-world object in successive video
frames.
Steps 5-6 associate V-objects that are tracked
with less confidence but display evidence that they might
result from the same real-world object. Thus, these
objects are joined by "secondary" links. These steps
are necessary to account for the "split" and "join"
type motion segmentation errors as described in
Section 3.2.
The object tracking process results in a list of Vobjects and connecting links that form a directed graph
(digraph) representing the position and trajectory of
foreground objects in the video sequence. Thus, the Vobjects are the nodes of the graph and the connecting
links are the arcs. This motion graph is the output of the
object tracking stage.

FO

Fl

F2

F3

F4

Figure 5 shows a motion graph for a hypothetical


sequence of one-dimensional frames. Here, the system
detects the appearance of an object at A and tracks it to
the V-object at B. Due to an error in motion segmentation, the object splits at D and E, and joins at F. At G, the
object joins with the object tracked from C due to
occlusion. These objects split at H and I. Note that
primary links connect the V-objects that were most
reliably tracked.

3.4. Motion analysis


The motion analysis stage analyses the results of
object tracking and annotates the motion graph with tags
describing several events of interest. This process proceeds in two parts: V-object grouping and V-object
indexing. Figure 6 shows an example motion graph for
a hypothetical sequence of 1-D frames discussed in the
following sections.

F5

F6

F7

Fig. 5. The output of the object tracking stage for a hypothetical sequence of 1-D frames. The vertical lines
labeled "Fn" represent frame number n. Primary links are shown as solid arcs; secondary links are shown as
dashed arcs.

F8

613

Automatic video indexing via object motion analysis

FO

Fl

F2

F3

F4

F5

F6

F7

F8

F9

FlO

Fll

Fl2

Fl3

Fl4

Fig. 6. An example motion graph for a sequence of 1-D frames.

3.4.1. V-object grouping. First, the motion analysis


stage hierarchically groups V-objects into structures
representing the paths of objects through the video data.
Using graph theory terminology,0 5l five groupings are
defined for this purpose:
A stem M ={Vi: i = 1,2, ... ,NM} is a maximalsize, directed path (dipath) of two or more V-objects
containing no secondary links, meeting all of the following conditions:

stationary. If equation (2) is true, then the stem is classified as stationary; if equation (3) is true, then the stem is
classified as moving. Figure 7 highlights stationary stems
B, C, F, and H; the remainder are moving.
A branch B ={Vi: i = 1, 2, ... ,NB} is a maximalsize dipath of two or more V-objects containing no
secondary links, for which outdegree(Vi)=1 for
1 :::; i < NB and indegree(V;)=l for 1 < i:::; NB. Figure
8 labels V-objects belonging to branches with the letters
"L" through "T". A branch represents a highly reliable
trajectory estimate of an object through a series of
frames.
If a branch consists entirely of a single stationary stem,
then it is classified as stationary; otherwise, it is classified as moving. Branches "N" and "Q" in Fig. 8 (highlighted) are stationary; the remainder are moving.
A trail Lis a maximal-size dipath of two or more Vobjects that contains no secondary links. This grouping
represents the object tracking stage's best estimate of an
object trajectory using the mutual nearest neighbor criterion. Figure 9 labels V-objects belonging to trails with
the letters "U" through "Z".
A trail and the V-objects it contains are classified as
stationary if all the branches it contains are stationary,

outdegree(Vi) = 1 for 1 :::; i < NM,


indegree(Vi) = 1 for 1 < i :::; NM, and
either
(2)

or
(3)

where Jli is the centroid of V-object Vi EM.


Thus, a stem represents a simple trajectory of an object
through two or more frames. Figure 7 labels V-objects
from Fig. 6 belonging to separate stems with the letters
"A" through "K".
Stems are used to determine the motion "state" of
real-world objects, i.e. whether they are moving or

K
FO

Fl

F2

F3

F4

F5

F6

F7

F8

F9

FlO

Fll

Fl2

Fl3

Fl4

Fig. 7. Stems. Stationary stems are highlighted.

FO

Fl

F2

F3

F4

F5

F6

F7

FS

F9

FlO

Fig. 8. Branches. Stationary branches are highlighted.

Fll

Fl2

Fl3

Fl4

J. D. COURTNEY

614

FO

Fl

F2

F3

F4

F5

F6

F7

FS

F9

FlO

Fll

Fl2

Fl3

Fl4

Fig. 9. Trails.

and moving if all the branches it contains are moving.


Otherwise, the trail is classified as unknown. Trail W in
Fig. 9 is stationary; the remainder are moving.
A track K={L,,G,, ... ,LNK_ 1 ,GNK_ 1 ,LNK} is a
dipath of maximal size containing trails {L; : 1 ::;
i::; NK], and connecting dipaths {G; : 1 ::; i < NK}.
For each G; E K there must exist a dipath

= {Vf,G;, V;~d

(where Vf is the last V-object in L;, and V;~ 1 is the first Vobject in L;+ 1), such that every \'} E H meets the requirement

(4)
where p,~ is the centroid of Vf, v~ the forward velocity of
vf, (tj- f;) the time difference between the frames containing \'} and Vf, and P,j is the centroid of \'). Thus,
equation (4) specifies that the object must maintain a
constant velocity through path H.
A track represents the trajectory estimate of an object
that may cause or undergo occlusion one or more times in
a sequence. The motion analysis stage uses equation (4)
to attempt to follow an object through frames where an

FO

Fl

F2

F3

F4

F5

F6

F7

occlusion occurs. Figure 10 labels V-objects belonging to


tracks with the letters "a", "(3", "x", "6" and "c".
Note that track 6 joins trails X and Y.
A track and the V-objects it contains are classified as
stationary if all the trails it contains are stationary, and
moving if all the trails it contains are moving. Otherwise,
the track is classified as unknown. Track x in Fig. 10 is
stationary; the remaining tracks are moving.
A trace is a maximal-size, connected digraph of Vobjects. A trace represents the complete trajectory of an
object and all the objects with which it intersects. Thus,
the motion graph in Fig. 6 contains two traces: one trace
extends from F2 to F 7 ; the remaining V-objects form a
second trace. Figure 11 labels V-objects on these traces
with the numbers "1" and "2".
Note that the preceding groupings are hierarchical, i.e.
for every trace E, there exists at least one track K, trail L,
branch B, and stem M such that E 2 K 2 L 2 B 2 M.
Furthermore, every V-object is a member of exactly one
trace.
The motion analysis stage scans the motion graph
generated by the object tracking stage and groups Vobjects into stems, branches, trails, tracks, and traces.

FS

F9

FlO

Fll

Fl2

Fl3

Fl4

Fl3

Fl4

Fig. 10. Tracks. The dipath connecting trails X and Y from Fig. 9 is highlighted.

FO

Fl

F2

F3

F4

F5

F6

F7

FS

Fig. 11. Traces.

F9

FlO

Fll

Fl2

Automatic video indexing via object motion analysis


Thus, these five definitions are used to characterize
object trajectories in various portions of the motion
graph. This information is then used to index the video
according to its object motion content.
3.4.2. V-object indexing. Eight events of interest are
defined to designate various object-motion events in a
video sequence:
Appearance: An object emerges in the scene.
Disappearance: An object disappears from the scene.
Entrance: A moving object enters in the scene.
Exit: A moving object exits from the scene.
Deposit: An inanimate object is added to the
scene.
Removal: An inanimate object is removed from the
scene.
Motion: An object at rest begins to move.
Rest: A moving object comes to a stop.
These eight events are sufficiently broad for a video
indexing system to assist the analysis of many sequences.
For example, valuable objects such as inventory boxes,
tools, computers, etc., can be monitored for theft (i.e.
removal) in a security monitoring application. Likewise,
the traffic patterns of automobiles can be analysed (e.g.,
entrance/exit and motion/rest), or the shopping patterns
of retail customers recorded (e.g., motion/rest and removal).
After the V-object grouping process is complete, the
motion analysis stage has all the semantic information
necessary to identify these eight events in a video sequence. For each V-object V in the graph, the following
rules are applied to annotate the nodes of the motion
graph with event tags:
1. If Vis moving, the first V-object in a track (i.e. the
"head"), and indegree(V) > 0, place a tag designating an appearance event at V.
2. If V is stationary, the head of a track, and
indegree(V) = 0, place a tag designating an appearance event at V.
3. If Vis moving, the last V-object in a track (i.e. the
"tail"), and outdegree(V) > 0, place a disappearance
event tag at V.
4. If V is stationary, the tail of a track, and outdegree(V) = 0, place a disappearance event tag at V.
5. If Vis non-stationary (i.e. moving or unknown), the
head of a track, and indegree(V) = 0, place an
entrance event tag at V.
6. If V is non-stationary, the tail of a track, and outdegree(V) = 0, place an exit event tag at V.
7. If V is stationary, the head of a track, and
indegree(V) = 1, place a deposit event tag at V.
8. If V is stationary, the tail of a track, and outdegree(V) = 1, place a removal event tag at V.
Rules 1-8 use track groupings to annotate the video at
the beginning and end of individual object trajectories.
Note, however, that rules 7 and 8 only account for the
object deposited or removed from the scene; they do not
tag the V-object that caused the deposit or remove event

615

to occur. For this purpose, we define two additional


eventsDepositor: A moving object adds an inanimate object
to the scene.
Remover: A moving object removes an inanimate
object from the scene.
-and apply two more rules:
9. If V is moving and adjacent to a V-object with a
deposit event tag, place a depositor event tag at V.
10. If Vis moving and adjacent from a V-object with a
removal event tag, place a remover event tag at V.
The additional events depositor and remover are used
to provide a distinction between the subject and object of
deposit/removal events. These events are only used when
the actions of a specific moving object must be analysed.
Otherwise, their deposit/removal counterparts are sufficient indication of the occurrence of the event.
Finally, two additional rules are applied to account for
the motion and rest events:
11. If V is the tail of a stationary stem M; and the head
of a moving stem Mj for which IMd 2: hM and
IMj I 2: hM, then place a motion event tag at V. Here,
hM is a lower size limit of stems to consider.
12. If Vis the tail of a moving stem M; and the head of a
stationary stem Mj for which IMd 2: hM and
IMjl 2: hM, then place a rest event tag at V.
Table 1 summarizes the conditions under which rules
1-12 apply event tags to V-objects with moving, stationary, and unknown motion states. Figure 12 shows all
the event annotation rules applied to the example motion
graph of Fig. 6.
As the annotation rules are applied to the motion
graph, each identified event is recorded in an index table
for later lookup. This event index takes the form of an
array of lists of V-objects (one list for each event type)
and indexes V-objects in the motion graph according to
their event tags.
The output of the motion analysis stage is an annotated
directed graph describing the motion of foreground
objects and an event index indicating events of interest
in the video stream. Thus, the motion analysis stage
generates from the object tracking output a symbolic
abstraction of the actions and interactions of foreground
objects in the video. This approach enables content-based
analysis of video sequences that would otherwise be
impossible.

4. THE AVI SYSTEM

A system has been developed that performs contentbased video indexing for assisted analysis of surveillance
video data (see Fig. 13). The AVI system processes video
sequences using the indexing technique described in
Section 3, then stores the output-the video data, motion
segmentation information, and indexed motion graphin a database. A graphical user interface (GUI) allows the
user to retrieve a video sequence from the database, play

J. D. COURTNEY

616

Table 1. Conditions for annotating V-objects with each of the object-motion events
V-object motion state
Unknown

Moving

Stationary

Appearance

1. Head of track
2. indegree(V) > 0

1. Head of track
2. indegree(V) = 0

Disappearance

1. Tail of track
2. outdegree(V)

1. Tail of track
2. outdegree(V)

>0

=0

Entrance

1. Head of track
2. indegree(V) = 0

1. Head of track
2. indegree(V) = 0

Exit

1. Tail of track
2. outdegree(V)

1. Tail of track
2. outdegree(V)

=0

Deposit

1. Head of track
2. indegree(V) = 1

Removal

1. Tail of track
2. outdegree(V) = 1

(Depositor)

Adjacent to V-object with deposit tag

(Remover)

Adjacent from V-object with removal tag

Motion

1. Tail of stationary stem


2. Head of moving stem

Rest

1. Tail of moving stem


2. Head of stationary stem

Entrance

FO

Fi

Entrance

F2.:

Depositor I Deposit

F3

Motion

FS

F..

Exit

F6..

Appear

Exit

Rest

F7

FS

F9

Disappearance

FlO /Fll

Fl2.:

Removal I Remover

Entrance

=0

Fl3

:Fl4

Exit

Fig. 12. Annotation rules applied to Fig. 6.

Video Indexing

Fig. 13. A high-level diagram of the AVI system.

it forward or backward and stop on individual frames.


The system also provides a content-based retrieval mechanism by which the AVI system user may specify
queries on a video sequence using spatial, temporal,
event-, and object-based parameters. Thus, the user
can "jump" to important points in the video sequence
based on the query specification.

Figure 14 shows a picture of the "playback" portion


of the AVI GUI. It provides familiar VCR-like controls
(i.e. forward, reverse, stop, step-forward, step-back), as
well as a system "clipboard" for recording intermediate video analysis results (i.e. video "clips").
For example, the clipboard shown in Fig. 14 contains
three clips, the result of a previous query by the user.

Automatic video indexing via object motion analysis

617

Fig. 14. The AVI system playback interface.

Fig. 15. The AVI system query interface.

The user may select one of these clips, play it forward and
back, and pose a new query using it. The clip(s) resulting
from the new query are then pushed onto the top of the
clipboard stack. The user may also peruse the clipboard

stack using the button-commands "up", "down", and


"pop".
Figure 15 shows the query interface to the AVI system.
Using the "Type" field, the user may specify any com-

J. D. COURTNEY

618

from the video database and performs the following


steps:

bination of spatial, temporal, event-, or object-based


queries. The interface provides fields to set parameters
for temporal and event-based queries; parameters for
spatial and object-based queries may be set inside the
video playback window shown in Fig. 14 using the
mouse. For example, the user may select a spatial region
in the video window and specify the query "show me all
objects that were removed from this region of the scene
between 8 am and 9 am." After specifying the query type
and parameters, the user executes the "Apply" buttoncommand to pose the query to the AVI system. Clips
highlighting the query results are then posted to the
system clipboard.

1. If E is specified in the query, G is truncated to a


subgraph including only those V-objects with event
tags matching E.
2. If T = (ti, tj) is specified, G is further truncated to a
subgraph spanning frames Fi to Fj.
3. If Vis specified, G is truncated to include only Vobjects belonging to the trace containing V.
4. If V belongs to a track, G is truncated to include only
V-objects belonging to the track containing V.
5. If R is specified, G is truncated to include only those
V-objects whose shape mask intersects the specified
spatial region.
6. If E is not specified, G is truncated to include only
those V-objects V with indegree(V) = 0, i.e. the
source nodes remaining in G.

4.1. Query-based video retrieval

A query engine retrieves video data from the database


in response to queries generated at the AVI system
graphical user interface. A query Y takes the form

Step 1 filters the V-objects of the motion graph to


match the specified event. This step is facilitated by use
of the event index into the motion graph. When E is
specified (which is typical), the number of V-objects that
must be processed by the following steps is greatly
reduced. Step 2 satisfies the temporal query constraints.
Steps 3 and 4 satisfy the object-based constraints by
restricting the query result to the V-objects on the most
reliable path of V in the motion graph. If V belongs to a
track, the track is the most reliable path of V; otherwise,
the trace containing Vis the most reliable path. Step 5
filters V-objects to meet the spatial constraints. Finally,
step 6 reduces the result to include only the first occurrence of objects meeting the requirements of V, T, and R.
The resulting subgraph G then contains only V-objects
satisfying all the constraints of the query.

Y = ('{!', T, V,R,E),

where '{!'is a video clip, T = (ti,tj) specifies a time


interval within the clip, V is a V-object within the clip,
R a spatial region in the field of view, and E an objectmotion event.
The clip '{!' specifies the video sub-sequence to be
processed by the query, and the (optional) values ofT, V,
R, and E define the scope of the query. Using this form,
the AVI system user can make such a request as "find any
occurrence of this object being removed from this region
of the scene between 8 am and 9 am." Thus, the query
engine processes Yby finding all the video sub-sequences
in '{!' that satisfy T, V, R, and E.
In processing a given query, the query engine retrieves
a copy of the motion graph G corresponding to clip '{!'

E =Exit

T
R

FO

Fl

F2

F3

F5

F4

F6

F7

F8

F9

FlO

Fll

Fl2

Fl3

Fl4

Fl3

F14

T
Fig. 16. Graphical depiction of the query Y

= (CC, T, V, R, E) applied to Fig. 12.

E =Exit

FO

Fl

F2

F3

F4

F5

F6

F7

F8

F9

FlO

Fig. 17. Processing of event-based constraints (step 1).

Fll

F12

619

Automatic video indexing via object motion analysis

Fl

FO

F2

F3

F4

F5

F6

F7

FS

F9

FlO

Fll

Fl2

Fl3

Fl4

Fll

Fl2

Fl3

Fl4

Fl2

Fl3

Fl4

Fig. 18. Processing of temporal constraints (step 2).

Fl

FO

F2

F3

F4

F5

F6

F7

F8

F9

FlO

Fig. 19. Processing of object-based constraints (steps 3 and 4).

T
1
R

FO

Fl

F2

F3

F4

F5

F6

F7

FS

F9

FlO

Fll

Fig. 20. Processing of spatial constraints (step 5).

Figure 16 is a graphical depiction of a query


Y = (~, T, V,R,E) applied to the motion graph shown
in Fig. 12. This query is equivalent to the request "show
if object Vexits the scene in region R during time interval
T'. Figures 17-20 illustrate the steps performed by the
query engine on this sequence. Figure 20 shows the
single V-object satisfying the query.
For each V-object V; satisfying the query, the query
engine generates a result, i!lt; = (~;, V;), consisting of a
clip, ~;, and a pointer to the V-object. The first and last
frames of ~; are set to reflect the time constraint of the
query, T, if specified; otherwise, they are set to those of~,
the clip specified in the query. The "frame of interest" of
~i is set to the frame containing V;. These results are sent
to the graphical user interface for display.
4.2. Video analysis example
Figure 21 shows frames from an example video sequence with motion content characteristic of security
monitoring applications. In this sequence, a person enters
the scene, deposits a piece of paper, a briefcase, and a
book, and then exits. He then re-enters the scene, removes the briefcase, and exits again. If a user forms the

query "find all deposit events", the AVI system will


respond with video clips depicting the person depositing
the paper, briefcase, and book. Figure 22 shows the actual
result given by the AVI system in response to this query.
Figure 23 demonstrates how more advanced queries
may be used in video analysis. After receiving the three
clips of Fig. 22 in response to the query "show all deposit
events", the AVI system user is interested in learning
more about fate of the briefcase in the sequence of
Fig. 21. First, the user retrieves the clip highlighting
frame F 78 [shown in Fig. 23(a)] from the clipboard
and applies the query "find entrance events of this
object" to the person shown depositing the briefcase.
The system responds with a single clip showing the
first instance of the person entering the scene, as
shown in Fig. 23(b). The user can play the clip at this
point and observe the person carrying the briefcase into
the room.
Next, the user applies the query "find removal events
(caused by) this object" to the person carrying the
briefcase. The system responds by saying there are no
such events. (Indeed, this is correct because the person
removes no objects until after he leaves and re-enters the

620

1. D. COURTNEY

[0]

[10]

[20]

[30]

[40]

[50]

[60]

[70]

[80]

[90]

[100]

[110]

[120]

(130]

[140]

(150]

(160]

[170]

[180]

(190]

[200]

[210]

[220]

[230]

Fig. 21. Frames from an example video sequence. Frame numbers are shown below each image.

room-at that point, the person is defined as a different


object.)
The user returns to the original clip of Fig. 23(a) by
popping the clipboard stack twice. Then the user applies

the query "find removal events of this object" to the


briefcase. The system responds with a single clip of
the person removing the briefcase, as shown in
Fig. 23(c).

Automatic video indexing via object motion analysis

[36]

621

[78]

[110]

Fig. 22. Clips from the video sequence of Fig. 21 satisfying the query "fmd all deposit events". Boxes
highlight the objects contributing to the event.

(a)

(b)

(c)

(d)

Fig. 23. Advanced video analysis example. Clips show: (a) the briefcase being deposited, (b) the entrance of
the person who deposits the briefcase, (c) the briefcase being removed, (d) the exit of the person who
removes the briefcase.

Finally, the user specifies the query "find exit events of


this object" tothepersonremovingthe briefcase. The system
then responds with a single clip of the person as he leaves
the room (with the briefcase), as shown in Fig. 23(d).

5. EXPERIMENTAL RESULTS

The video indexing technique described in this paper


was tested using the AVI system on three video sequences
containing a total of 900 frames, 18 objects, and 44

J. D. COURTNEY

622

Table 3. Event detection results for Test Sequence 2

events. The sequences where created as mock-ups of


different domains of scene monitoring.
Test Sequence 1 is characteristic of an inventory or
security monitoring application (see Fig. 21). In it, a
person adds and removes various objects from a room
as recorded by an overhead camera. It contains 300
frames captured at approximately 10 frames per
second and five objects generating 10 events. The
sequence contains entrance/exit and deposit/
removal events, as well as two instances of object
occlusion.
Test Sequence 2 is characteristic of a retail customer
monitoring application (see Fig. 24). In it, a customer
stops at a store shelf, examines different products, and
eventually takes one with him. It contains 285 frames
at approximately 10 frames per second and four
objects generating 14 events. This is the most complicated of the test sequences: it contains examples of
all eight events, displays several instances of occlusion, and contains three foreground objects in the
initial frame.
Test Sequence 3 is characteristic of a parking lot traffic
monitoring application (see Fig. 25). In it, cars enter a
parking lot and stop, drivers emerge from their vehicles, and pedestrians walk through the field of view. It
contains 315 frames captured at approximately three
frames per second and nine objects generating 20
events. Before digitization, the sequence was first
recorded to 8 mm tape with consumer-grade equipment and is therefore the most "noisy" of the test
sequences.
The performance of the AVI system was measured by
indexing each of the test sequences and recording its
success or failure at detecting the eight primary objectmotion events. Tables 2-4 report event detection results
for the AVI system on the three test sequences. For each
event, the tables report the number actually present in the
sequence, the number found by the AVI system, the Type
I (false negative) errors, and the Type II (false positive)
errors.
Of the 44 total events in the test sequences, the AVI
system displays 10 Type II errors but only one Type I
error. Thus, the system is conservative and tends to find at
least the desired events.

Table 2. Event detection results for Test Sequence 1


Actual
Appearance
Disappearance
Entrance
Exit
Deposit
Removal
Motion
Rest
Total

0
2
2
2
3
1
0
0
10

Detected
0
2
2
3
3
0
0
11

Type I

Type ll

0
0
0
0
0
0
0
0

0
0
0
1
0
0
0
0

Actual
Appearance
Disappearance
Entrance
Exit
Deposit
Removal
Motion
Rest
Total

Detected

3
2

3
2

1
2
3

2
2
3

14

15

Type I
0
0
0
0
0
0
0
0

Type

0
0
0
0
0
0
0

Table 4. Event detection results for Test Sequence 3


Actual
Appearance
Disappearance
Entrance
Exit
Deposit
Removal
Motion
Rest
Total

2
0
7
8
0
0
0
3
20

Detected
3
2
8
9
0
0

Type I

Type II

0
0
0
0
0
0
0

1
2

0
0

27

The system performed the worst on Test Sequence 3,


where it displayed the only Type I error and eight of the
10 total Type II errors. This is primarily due to three
reasons:
I. Noise in the sequence, including vertical jitter from a
poor frame-sync signal, resulted in very unstable
motion segmentation masks. Thus, stationary objects
appear to move significantly.
2. The method used to track objects through occlusions presently assumes constant object trajectories.
A motion tracking scheme that is more robust in
the presence of rapidly changing trajectories will
result in fewer false positives for many of the
events.0 6 l
3. No means to track objects through occlusion by fixed
scene objects is presently used. The light pole in the
foreground of the scene temporarily occludes pedestrians who walk behind it, causing pairs of false
entrance/exit events.
However, the system performed very well on Test
Sequences I and 2 despite multiple simultaneous occlusions and moving shadows; and in all the sequences, the
system is sufficiently robust to accurately respond to a
large number of useful queries (Fig. 26).
6. CONCLUSION

Automatic indexing techniques enable intelligent analysis of video data by creating symbolic "handles" by

623

Automatic video indexing via object motion analysis

[OJ

[12]

[24]

[36]

[48]

[60]

(72]

[84]

[96]

[108]

[120]

[132]

(144]

[156]

[168]

[180]

[192]

(204]

[216]

[228]

[240J

[252]

[264]

[276]

Fig. 24. Frames from Test Sequence 2.

which multimedia system users may navigate through


video sequences. The video indexing technique described
in this paper abstracts raw video information using

motion segmentation, object tracking, and a hierarchical


path construction method which enables annotation using
several motion-based event tags. Efficient retrieval of

624

J. D. COURTNEY

[OJ

(13]

(26]

(39]

(52]

(65]

[78]

[91]

(104]

(117]

(130]

(143]

[156]

[169]

[182]

[195]

(208]

(221]

[234]

(247]

(260]

[273]

(286]

(299]

Fig. 25. Frames from Test Sequence 3.

video clips is facilitated by an event index into the


abstracted video. Furthermore, a system employing this
indexing technique for assisted analysis of surveillance

video allows users to "jump" to points of interest in a


video sequence via intuitive spatial, temporal, event-, and
object-based queries.

Automatic video indexing via object motion analysis

625

(b)

(a)

Fig. 26. Appearance and exit of an individual pedestrian from Test Sequence 3. Frame F 217 shows the
pedestrian emerging from a car; frame F 248 shows the pedestrian walk out of the field of view.

Acknowledgements- Thanks go to Dinesh Nair and Stephen


Perkins for assisting in the design and implementation of the
AVI system.

8.

REFERENCES
9.
1. HongJiang Zhang, Atreyi Kankanhalli and W. Stephen,
Automatic partitioning of full-motion video, Multimedia
Systems 1(1), 10--28 (1993).
2. Akihito Akutsu, Yoshinobu Tonomura, Hideo Hashimoto
and Yuji Ohba, Video indexing using motion vectors, in
Visual Communications and Image Processing Proc. SPIE
1818, Petros Maragos, ed., pp. 1522-1530, Boston,
Massachusetts (November 1992).
3. Mikihiro Ioka and Masato Kurokawa, A method for
retrieving sequences of images on the basis of motion
analysis, in Image Storage and Retrieval Systems, Proc.
SPIE 1662, pp. 35-46 (1992).
4. Suh-Yin Lee and Huan-Ming Kao, Video indexing-an
approach based on moving object and track, in Storage and
Retrieval for Image and Video Databases, Proc. SPIE
1908, Wayne Niblack, ed., pp. 25-36, San Jose, California
(February 1993).
5. Glorianna Davenport, Thomas Aguierre Smith and Nata1io
Pincever, Cinematic primitives for multimedia, IEEE
Comput. Graphics Appl., 67-74 (July 1991).
6. Masahiro Shibata, A temporal segmentation method for
video sequences, in Visual Communications and Image
Processing, Proc. SPIE I818, Petros Maragos, ed., pp.
1194-1205, Boston, Massachusetts (November 1992).
7. Deborah Swanberg, Chiao-Fe Shu and Ramesh Jain,
Knowledge guided parsing in video databases, in Storage

10.

11.
12.

13.
14.

15.
16.

and Retrieval for Image and Video Databases, Proc. SPIE


1908, Wayne Niblack, ed., pp. 13-24, San Jose, California
(February 1993).
F. Arman, R. Depommier, A. Hsu and M-Y. Chiu, Contentbased browsing of video sequences, in Proc. ACM Int.
Conf on Multimedia, San Francisco, California (October
1994).
Ramesh Jain, W. N. Martin and J. K. Aggarwal,
Segmentation through the detection of changes due to
motion, Comput. Graphics Image Process. 11, 13-34
(1979).
S. Yalamanchili, W. N. Martin and J. K. Aggarwal,
Extraction of moving object descriptions via differencing, Comput. Graphics Image Process. 18, 188-201
(1982).
Dana H. Ballard and Christopher M. Brown, Computer
Vision. Prentice-Hall, Englewood Cliffs, New Jersey
(1982).
Robert M. Haralick and Linda G. Shapiro, Computer and
Robot Vision, Vol. 2. Addison-Wesley, Reading, Massachusetts (1993).
Akio Shio and Jack Sklansky, Segmentation of people in
motion, in IEEE Workshop on Visual Motion, pp. 325-332,
Princeton, New Jersey (October 1991).
M. Irani and P. Anandan, A unified approach to moving
object detection in 2D and 3D scenes, in Proc. Image
Understanding Workshop, pp. 707-718, Palm Springs,
California (February 1996).
Gary Chartrand and Ortrud R. Oellermann, Applied and
Algorithmic Graph Theory. McGraw-Hill, New York
(1993).
Stephen S. Intille and Aaron F. Bobick, Closed-world
tracking, in Proc. Fifth Int. Conf. on Computer Vision, pp.
672-678, Cambridge, Massachusetts (June 1995).

About the Author-JONATHAN D. COURTNEY received the M.S. degree in Computer Science and the B.S.
degree in Computer Engineering and Computer Science from Michigan State University. Mr Courtney is a
Member of the Technical Staff in the Multimedia Systems Branch of Corporate Research and Development at
Texas Instruments. His Master's thesis research, under the direction of Professor Ani! K. Jain, concerned mobile
robot localization using multisensor maps. His current research interests include multimedia information
systems and virtual environments for cooperative work. Mr Courtney is a member of the IEEE.

Das könnte Ihnen auch gefallen