Beruflich Dokumente
Kultur Dokumente
I.
I NTRODUCTION
R ELATED W ORK
Fig. 1.
called Tensor Canonical Correlation Analysis where descriptive similarity features between two video volumes are used in
nearest neighbor classication scheme for recognition [8]. Lui
et.al however, studied the underlying geometry of the tensor
space occupied by human action sequences and performed
factorization on this space to obtain product manifolds [10].
Classication is done by projecting a video or a tensor onto
this space and classifying it using a geodesic distance measure.
In this type of methodology, unlike in the space time approach,
it shows much improved performance on datasets with large
variations in illumination and scale. However, the classication
is done on an entire video sequence and not on partial or subsequences.
Apart from the concept of 3D space time shape and dictionary level features where the concentration was on extracting
local motion patterns, a different methodology to characterize
an action sequence is to directly model the video frame with
respect to time. Chin et al. performed an analysis on modelling the variation of the human silhouettes(binary images)
with respect to time [3]. They explored different manifold
learning techniques such as Neural Networks and Generalized
Radial Basis Functions but the learning technique required
normalization in sequence length and temporal shifts. Sagha
et al. proposed an embedding technique for action sequences
which is based on Spatio-temporal correlation distance which
is considered as an efcient measure for ordered sequence
comparisons [15]. Since the comparison is between certain
points of two sequences, these key points or poses have to
be determined which would require a start and end-point
normalization between sequences of the same class.
The proposed technique in this paper uses a combination of
histogram of Flow (HOF) and local binary patterns(LBP) from
the optical ow vectors as the action feature set and nds an
4343
4349
P ROPOSED M ETHODOLOGY
There are three main parts for the action recognition framework proposed in this paper. The rst is that of motion feature
extraction from the optical ow masked by a binary silhouette.
The second part is the reduction of the dimensionality of
the motion features to a space which spans the inter-frame
variations of the corresponding feature set. The third part is
nding a suitable functional mapping from the feature space to
the reduced dimensional space for each action class. A block
diagram illustrating the action recognition framework is shown
in Figure 1.
A. Motion Representation using Histogram of Oriented Flow
and Local Binary Flow Patterns
For each pixel in the body segmented region, optical ow
is computed between every two frames in a video sequence
using the Lucas-Kanade algorithm and then the magnitude and
the direction is computed from the ow vectors< vx , vy >.
The magnitude and the direction ow images are divided
into K blocks and at each block, a histogram of the ow
orientations is computed where the magnitude of a particular
pixel is placed in the corresponding orientation bin. On a
block-wise level, these histograms provide information about
the extent of movement of a part of the body on a local scale
as well as the direction of the motion. Thus, these histograms
represent the distribution of orientation in a local region where
these local distributions change during the course of an action
sequence. To get a relationship between the ow vectors in a
local neighborhood, an encoding which brings out the ow
texture is required. Local binary patterns is used to bring out
this textural information of the ow vectors. But, the ow
vectors are two dimensional images and so, the LBP operator
should be applied to both channels of the images, i.e separately
to both the magnitude and direction. When computed on
the magnitude, uniform magnitude patterns are generated and
when computed on the direction, uniform directional patterns
are formed. Both these patterns are computed on the whole
image, and so represent the ow texture on a global scale
where these global features also change during the course of
the action sequence.
A sampling grid (P,R) = (16,2) is used for the LBP computation where the uniform mapping reduces the labels from
65636 to 243 uniform labels with the last label being assigned
to non-uniform patterns. In short, the LBP coding of the optical
ow asserts the relationship between an optical ow at a pixel
to its corresponding neighborhood while the HOF(histogram
of ow) provides the orientation distribution at a local region.
The combination of all these features put together represents
our action feature set and is tested in the action recognition
framework proposed in this paper. An illustration of the motion
feature extraction is shown in Figure 2.
Fig. 2.
M
k=1
4344
4350
(1)
(2)
y = i=1
(3)
N
x xi )
i=1 radbasis(
where (yi , xi ) are the training input/output pairs, y is the
estimated point for the test input x. To get suitable training
points in the feature space which corresponds to notable
transitions in the manifold, k-means clustering is done to get
L(m) clusters for each action class. So, the functional mapping
for a particular action class m can be modeled by a general
regression equation given as
L(m)
y =
yi,m exp(
i=1
L(m)
i=1
2
Di,m
)
2 2
2
Di,m
exp(
)
2 2
i,m )T (x x
i,m )
; Di,m = (x x
(4)
i,m ) are the ith cluster centers in the input
where (yi,m , x
feature space and Eigen space. Selection of the standard
deviation for the radial basis function node for each action
class in the network is taken as the median Euclidean distance
between the corresponding actions cluster centers. The action
class is determined by rst projecting the consecutive set of
R frames onto to the Eigen space. These projections given by
yr : 1 r R is compared with the projections y(m)r of
those frames estimated by each of the GRNN action models
using the Mahalanobis distance. The action model which gives
the closest estimates of the projections is determined as the
action class.
IV.
E XPERIMENTAL R ESULTS
4345
4351
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a1
100
3
a2
a3
75
22
100
a4
a5
a6
a7
88
a8
a9
a10
Fig. 3.
Fig. 4.
12
7
21
93
78
100
100
1
99
100
st
1 (Class) Best
2.5(Walk)
1.87(Walk)
1.81(Walk)
2.89(Walk)
2.23(Walk)
1.89(Walk)
1.88(Walk)
1.88(Walk)
2.14(Walk)
1.85(Walk)
Median to actions
3.94
3.64
3.82
4.09
3.82
3.66
2.6249
3.63
3.88
3.54
4346
4352
TABLE III.
Test Seq
Dir 0
Dir 9
Dir 18
Dir 27
Dir 36
Dir 45
Dir 54
Dir 63
Dir 72
Dir 81
st
1 (Class) Best
1.76(Walk)
1.69(Walk)
1.73(Walk)
1.73(Walk)
1.77(Walk)
1.77(Walk)
1.77(Walk)
1.96(Walk)
2.29(Walk)
2.69(Side)
nd
(Class) Best
2.34(Skip)
2.31(Skip)
2.26(Skip)
2.32(Skip)
2.32(Skip)
2.20(Skip)
2.11(Skip)
2.31(Skip)
2.49(Skip)
2.80(Skip)
Median to actions
3.94
3.64
3.82
4.09
3.82
3.66
2.6249
3.63
3.88
3.54
and hence the optical ow motion pattern does not have the
variation associated with a walk action.
[10]
[11]
[12]
[13]
[14]
[15]
V.
C ONCLUSIONS
[16]
[17]
[18]
[19]
[20]
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
4347
4353