Sie sind auf Seite 1von 6

2013 IEEE International Conference on Systems, Man, and Cybernetics

Regression based Learning of Human Actions from


Video using HOF-LBP Flow patterns
Binu M Nair and Dr. Vijayan K Asari
UD Vision Lab
Electrical and Computer Engineering, University of Dayton
Dayton, OH USA
nairb1@udayton.edu, vasari1@udayton.edu
AbstractA human action recognition framework is proposed
which models motion variations corresponding to a particular
class of actions without the need for sequence length normalization. The motion descriptors used in this framework are
based on the optical ow vectors computed at every point on
the silhouette of the human body. Histogram of ow(HOF) is
computed from the optical ow vectors and these give the motion
orientation in a local neighborhood. To get a relationship between
the motion vectors at a particular instant, the magnitude and
direction of the optical ow vector are coded with local binary
patterns(LBP). The concatenation of these histograms(HOF-LBP)
are considered as the action feature set to be used in the proposed
framework. We illustrate that this motion descriptor is suitable
for classifying various human actions when used in conjunction
with the proposed action recognition framework which models the
motion variations in time for each class using regression based
techniques. The feature vectors extracted from the training set are
suitably mapped to a lower dimensional space using Empirical
Orthogonal Functional Analysis. A regression based technique
such as Generalized Regression Neural Networks(GRNN), are
used to compute the functional mapping from the action feature
vectors to its reduced Eigenspace representation for each class,
thereby obtaining separate action manifolds. The feature set
obtained from a test sequence are compared with each of the
action manifolds by comparing the test coefcients with the
ones corresponding to the manifold (as estimated by GRNN) to
determine the class using Mahalanobis distance.
KeywordsGeneralized Regression Neural Networks, Action
Recognition, Histogram of Flow, Local Binary Patterns

I.

I NTRODUCTION

In recent years, a lot of emphasis has been on gesture


recognition due to its potential in a variety of applications
in security and surveillance such as monitoring behavioral
patterns in detected individuals to identify threatening behavior
and in consumer applications implemented in computers and
smart phones for touch less interfaces. However, the area
of human action recognition is an open problem with lots
of different approaches such as extracting 3D space time
features, dictionary based approaches for characterizing local
features, tracking of body parts across time etc. The approach
presented here is to learn and model suitable action manifolds
corresponding to common gestures in a manner which does
not require sequence length normalization.
Firstly, we dene a suitable motion descriptor based on the
combination of the histogram and the local binary patterns
978-1-4799-0652-9/13 $31.00 2013 IEEE
DOI 10.1109/SMC.2013.741

computed from optical ow and use the combination of


these histogram-based features to represent a particular motion
instance of the human body during an action sequence. The
action recognition framework proposed in the paper denes an
action manifold from the motion descriptors of the extracted
silhouettes for each action class. In other words, a function
mapping from the descriptor space to the lower dimensional
Eigen space is computed and learned for each action through
regression-based techniques. The paper is divided into the
following sections : Section II explains briey about some of
the recent works in the eld, Section III describes the shape
descriptor used in the proposed action recognition framework
and Section IV presents the accuracies of individual actions
obtained with the Weizmann database and evaluates the algorithm for robustness to deformities as well to view-point
changes.
II.

R ELATED W ORK

Some of the early works in the human action recognition


area includes the concept of space time shapes, which are concatenated silhouettes over a set of frames, to extract certain features corresponding to the variation within the spatio-temporal
space. Gorelick et al. [6] modelled the variation within the
space time shape using Poissons equation and extracted space
time structures which provides discriminatory features. Wang
et al. recognized human activities using the derived form of the
Radon transform known as the R-Transform [19], [20]. Nair et
al. introduced the 3D shape descriptor which is a combination
of a 3D distance transform along with the R-Transform and
this represents a space time shape at multiple levels and used
as corresponding action features [12].
Action sequences can also be represented as a collection
of spatio-temporal words with each word corresponding to a
certain set of of space-time interest points which are detected
by set of spatial Gaussian lters and temporal 1D Gabor lters
[14]. Here, Niebles et.al computes the probability distributions
of the spatio-temporal words corresponding to each class of
human action using a probabilistic Latent Semantic Analysis
model. Another algorithm which is similar to the former is
given by Batra et al where a dictionary of mid-level features
called Space time shapelets is created which characterize
the local motion patterns within a space time shape thereby
representing an action sequence as a histogram of the space
time shapelets over the trained dictionary [2]. A 3D gradient4348
4342

Fig. 1.

Block diagram of human action recognition framework.

based shape descriptor representing the local variations was


introduced by Klaser et al. [9] and is based on the 2D HOG
descriptor used for human body detection [4], [5]. Here, each
space time shape is divided into cubes where in each cube, the
histogram is computed from the spatial and temporal gradients.
Another approach to the gesture recognition is to model
the non-linear dynamics of the human action by tracking
the trajectories of certain points in the body and to capture
certain properties unique to those trajectories. Ali et al. used
the concepts from Chaos Theory to reconstruct the phase
space from each of the trajectories and compute the dynamic
and metric invariants which are then used as action feature
vectors [1]. This method will be affected by partial occlusions
as some trajectories maybe missing which may affect the
metrics extracted. Kaaniche et al. used the 2D Histogram of
Gradients as a feature descriptor to track corner points on
the image frame [7]. By matching textured regions using this
feature descriptor and tracking those points using a Kalman
lter, local motion descriptors are extracted and used ofine
to learn a set of gestures. Scovannar et al. used a 3D-SIFT
to represent spatio-temporal words in a bag of words model
representation of action videos [16]. Sun et al. extended the
above methodology which combined local descriptors based
on SIFT features and holistic moment-based features [18]. The
local features comprised of the 2D SIFT and 3D SIFT features
computed from suitable interest points and the holistic features
are the Zernike moments computed from motion energy images
and motion history images. The approach take here assumes
that the scene is static as it relies on the difference frame to
get suitable interest points.
A different approach for characterizing human action sequences is to consider these sequences as multi-dimensional
arrays called tensors. Kim et al. presented a new framework

called Tensor Canonical Correlation Analysis where descriptive similarity features between two video volumes are used in
nearest neighbor classication scheme for recognition [8]. Lui
et.al however, studied the underlying geometry of the tensor
space occupied by human action sequences and performed
factorization on this space to obtain product manifolds [10].
Classication is done by projecting a video or a tensor onto
this space and classifying it using a geodesic distance measure.
In this type of methodology, unlike in the space time approach,
it shows much improved performance on datasets with large
variations in illumination and scale. However, the classication
is done on an entire video sequence and not on partial or subsequences.
Apart from the concept of 3D space time shape and dictionary level features where the concentration was on extracting
local motion patterns, a different methodology to characterize
an action sequence is to directly model the video frame with
respect to time. Chin et al. performed an analysis on modelling the variation of the human silhouettes(binary images)
with respect to time [3]. They explored different manifold
learning techniques such as Neural Networks and Generalized
Radial Basis Functions but the learning technique required
normalization in sequence length and temporal shifts. Sagha
et al. proposed an embedding technique for action sequences
which is based on Spatio-temporal correlation distance which
is considered as an efcient measure for ordered sequence
comparisons [15]. Since the comparison is between certain
points of two sequences, these key points or poses have to
be determined which would require a start and end-point
normalization between sequences of the same class.
The proposed technique in this paper uses a combination of
histogram of Flow (HOF) and local binary patterns(LBP) from
the optical ow vectors as the action feature set and nds an

4343
4349

underlying function which captures the temporal variance of


these features. The motivation behind such a technique is to
remove the need for normalization due to the start and end
points as well as to remove the need for sequence length
normalization [13]. The next section describes the proposed
action recognition framework.
III.

P ROPOSED M ETHODOLOGY

There are three main parts for the action recognition framework proposed in this paper. The rst is that of motion feature
extraction from the optical ow masked by a binary silhouette.
The second part is the reduction of the dimensionality of
the motion features to a space which spans the inter-frame
variations of the corresponding feature set. The third part is
nding a suitable functional mapping from the feature space to
the reduced dimensional space for each action class. A block
diagram illustrating the action recognition framework is shown
in Figure 1.
A. Motion Representation using Histogram of Oriented Flow
and Local Binary Flow Patterns
For each pixel in the body segmented region, optical ow
is computed between every two frames in a video sequence
using the Lucas-Kanade algorithm and then the magnitude and
the direction is computed from the ow vectors< vx , vy >.
The magnitude and the direction ow images are divided
into K blocks and at each block, a histogram of the ow
orientations is computed where the magnitude of a particular
pixel is placed in the corresponding orientation bin. On a
block-wise level, these histograms provide information about
the extent of movement of a part of the body on a local scale
as well as the direction of the motion. Thus, these histograms
represent the distribution of orientation in a local region where
these local distributions change during the course of an action
sequence. To get a relationship between the ow vectors in a
local neighborhood, an encoding which brings out the ow
texture is required. Local binary patterns is used to bring out
this textural information of the ow vectors. But, the ow
vectors are two dimensional images and so, the LBP operator
should be applied to both channels of the images, i.e separately
to both the magnitude and direction. When computed on
the magnitude, uniform magnitude patterns are generated and
when computed on the direction, uniform directional patterns
are formed. Both these patterns are computed on the whole
image, and so represent the ow texture on a global scale
where these global features also change during the course of
the action sequence.
A sampling grid (P,R) = (16,2) is used for the LBP computation where the uniform mapping reduces the labels from
65636 to 243 uniform labels with the last label being assigned
to non-uniform patterns. In short, the LBP coding of the optical
ow asserts the relationship between an optical ow at a pixel
to its corresponding neighborhood while the HOF(histogram
of ow) provides the orientation distribution at a local region.
The combination of all these features put together represents
our action feature set and is tested in the action recognition
framework proposed in this paper. An illustration of the motion
feature extraction is shown in Figure 2.

Fig. 2.

Illustration of Feature Extraction

B. Computation of Reduced Posture Space using PCA


To perform a regression analysis on the extracted action
feature set, a set of output variables corresponding to the
action features in a frame is needed. The regression of these
features is done so as to obtain a function which characterizes
the variation of the action features with respect to time. The
action feature set corresponds to the regressors which is a D
dimensional vector and we need a set of response variables
of dimension d where the variation in this set is due to
the change of the regressors with respect to the ow of the
action sequence. Choosing time as a response variable would
require initialization of the start and end points of an action
as well as normalizing them with respect to all the action
sequences. Moreover, the same actions performed by different
people would vary in speed and thus, the time instant of
the same action would differ which would require sequence
length normalization. In this methodology, we propose this
regression based framework to counter these normalizations
so that the actions features extracted at every frame of an
sequence can be considered as points on a particular manifold.
In short, an action sequence (either partial or complete) is
considered as a sample from that particular action model. The
goal is then to nd the appropriate manifold from a given
test sample sequence(partial or complete). The selection of
the response variables should be in such a way as to bring
out the variations occurring in the regressor variables( or
action feature set) due to time (ow of action sequence). One
method which has been used to treat a multi-variate time series
data is Principal Component Analysis or Empirical Orthogonal
Function Analysis where in this analysis, a time-series data
is represented as a linear combination of time-independent
orthogonal basis functions with time-varying amplitude [11].
Essentially, it is just Principal Component Analysis applied
to the time series data with the Eigen vectors forming the
orthogonal time invariant functions and the corresponding projections as the time variant amplitudes. This kind of analysis
has been widely used in atmosphere science to evaluate the
time varying climatic data such as wind pressure etc. So, in
accordance to EOF analysis, if there is a time series data
P (t) = [p1 (t)p2 (t)...pM (t)]T with dimensionality M , and let
it be observed at times t1 , t2 , t3 , ..tN , then
pm (ti ) =

M

k=1

4344
4350

Ykm .Qk (ti )

(1)

where Ykm are the time-independent basis functions (EOFs)


which are orthogonal in space and Qk (ti ) are the timedependent coefcients which are orthogonal in time. This
analysis forms the basis for our regression modeling where
the response variables for the action feature set will be the
corresponding time dependent coefcients.
So,at this step in the framework, an Eigen space is computed
which represents the inter-frame variation of the action feature
set. The learning stage involves a collection of all the possible
action features at all instances per action and forms an action
space denoted by SD (m) expressed mathematically as
SD (m) = xk : 1 k K(m)

(2)

where K(m) is the number of frames taken over all the


training video sequences from the action m out of M action
classes and xk being the corresponding action feature set
of dimension D 1. The reduced action or Eigen space is
obtained by applying Principal Component Analysis to get the
Empirical Orthogonal functions(EOF) which are none other
than the Eigen vectors v1v2 ...vd obtained by SVD decomposition. In other words, the principal components of the
matrix XXT where X = [x1 x2 x3 ... xK(m) ] are computed
for the action class m to get the time-independent basis
vectors where the projection of the action feature set X onto
these vectors gives a set of time-varying response variables
Y = [y1 y2 y3 ... yK(m) ] with yk having dimension d 1.
C. Modeling of Action Manifolds using Generalized Regression Neural Networks
In this paper, regression based techniques are used to nd
a functional mapping from the set of regressors(action feature
set xk ) to the response variables. Different regression models
can be used such as a multivariate linear regression model,
a polynomial regression model, neural networks etc. Linear
regression model uses the least squares or weighted least
squares to obtain the set of coefcients which best describes
that particular mapping or model. But the time varying nature
of the response variables which in this case is the set of
amplitudes of the Eigen vectors is inherently non-linear, any
model or mapping which is linear in the set of regressors
will not be sufcient and the mapping would be improper.
A solution to this is to use polynomial regression model
which are linear in its weights but non-linear in its basis
functions (suitable powers of the feature set variables). The
issue with this model is that a large number of basis functions
are required due to the high dimensionality of the regressors.
Neural networks is not considered as it takes a very long
time to train a network to model the functional mapping.
Moreover, the weights obtained from the training is not robust
as it depends heavily on whether the global minimum on the
error surface where sometimes the weights get stuck at a local
minimum. Radial Basis function network (RBFN) is suitable
for the proposed action recognition framework as it involves a
a faster training scheme which is a one-pass algorithm which
in turn provides fast convergence to the optimal regression
surface. A variant of the radial basis function network known
as Generalized Regression Neural Networks(GRNN) [17] is
used which differs from the former in the linear layer. However
the number of radial basis nodes is essentially the same as
the number of input vectors which in this algorithm, the input

feature set is large. To reduce the large number of nodes which


is memory intensive, a k-means clustering is applied on the
input feature set and the clusters are then used as the radial
basis function nodes in the input layer.
The mapping from the action feature space (D 1) to the
Eigen space (d 1) can be represented as SD (m)  Sd (m)
where Sd (m) = yk . In this framework, the aim is to model
the mapping from the action feature space to the reduced
Eigen space for each action m separately using the Generalized
Regression Neural Network [17]. The network models the
equation of the form
N
yi radbasis(x xi )

y = i=1
(3)
N
x xi )
i=1 radbasis(
where (yi , xi ) are the training input/output pairs, y is the
estimated point for the test input x. To get suitable training
points in the feature space which corresponds to notable
transitions in the manifold, k-means clustering is done to get
L(m) clusters for each action class. So, the functional mapping
for a particular action class m can be modeled by a general
regression equation given as


L(m)

y =

yi,m exp(

i=1
L(m)


i=1

2
Di,m
)
2 2

2
Di,m
exp(
)
2 2

i,m )T (x x
i,m )
; Di,m = (x x

(4)
i,m ) are the ith cluster centers in the input
where (yi,m , x
feature space and Eigen space. Selection of the standard
deviation for the radial basis function node for each action
class in the network is taken as the median Euclidean distance
between the corresponding actions cluster centers. The action
class is determined by rst projecting the consecutive set of
R frames onto to the Eigen space. These projections given by
yr : 1 r R is compared with the projections y(m)r of
those frames estimated by each of the GRNN action models
using the Mahalanobis distance. The action model which gives
the closest estimates of the projections is determined as the
action class.
IV.

E XPERIMENTAL R ESULTS

The action recognition framework proposed along with the


feature set computed from the histogram and local binary
patterns of the optical ow vectors is tested on the widely available Weizmann action dataset. This dataset contains videos
captured at around 50 frames/sec and contains 10 human
actions which are a1-bend, a2-jump in place, a3-jumping jack,
a4-jump forward, a5-run, a6-sideways movement, a7-waving
single hand, a8-skipping, a9- waving two hands and a10walking.
A. Accuracy of algorithm
For evaluating the algorithm, a leave-9 out strategy is used
for testing where out of the 9 action sequences, 8 are treated as
the validation set for the GRNN model where these sequences
are used in the Eigen space computation but not in the training
of the GRNN action models. The remaining sequence is taken

4345
4351

as the test data which is neither used in the Eigen space


computation nor used in the training of GRNN models. The
accuracies obtained with this strategy is given in the form
a confusion matrix shown on Table I. The accuracies are
reported on the partial sequences of length having 15 frames
extracted from the test sequence. The partial sequences of
a corresponding test sequence contains an overlap of over
10 frames. Experiments have been conducted using different
number of clusters K but not much variation in accuracy
occurs and so we use a xed number of clusters(K = 10)
in our analysis.
TABLE I.

a1
a2
a3
a4
a5
a6
a7
a8
a9
a10

a1
100
3

C ONFUSION M ATRIX FOR W EIZMANN DATASET

a2

a3

75

22
100

a4

a5

a6

a7

88

a8

a9

a10
Fig. 3.

Test For Robustness Dataset.

Fig. 4.

Test For View Invariance Dataset.

12
7
21

93
78
100

100
1

99
100

The action models corresponding to a1(bend), a3(jumping


jack),a7(waving one hand), a8(skip), a9(waving 2 hands)
and a10(walk) show very good accuracies with the current
framework. Some actions such as a2(jump in place) and
a6(sideways) have an accuracy of around 75%. The other
25% got miss-classied as a3(jumping jack) in case of a2 and
a8(skip) in case of a6. These maybe due to the similar nature
of movement between the two action classes due to common
instances of the same body posture. But one of the major
advantage in this algorithm is that only a rough silhouette
mask is required which eliminates the need for a very good
background segmentation technique and would require a coarse
segmentation. Moreover, this framework can work for both
complete sequences as well as partial sequences. However, the
robustness of the action feature set will depend on how well
the optical ow vectors match between consecutive frames.
We run the RANSAC algorithm on each block to remove any
mismatches on every frame of the action sequence before doing
any analysis.

In all variants of the walk action sequence, the algorithm


correctly classied the walk sequence with some variants
showing a much closer distance to the correct walk action
manifold than to other action manifolds. The median distance
to the rst best estimated class differs considerably with
the median distance to the second best and to other action
manifolds. In other words, the proposed algorithm is not
affected by deformities in the human body shape and is robust
to such changes.

B. Test for Robustness

C. Test for View Invariance

The framework is also tested for robustness and view


invariance using the same database. The test for robustness
consists of the walk action sequence under varying conditions of occlusion and body shape changes. The robustness
sequences are illustrated in Figure 3 and the corresponding
classication is shown in Table II.
TABLE II.
Test Seq
Swing Bag
Carry Briefcase
Walk with dog
Knees Up
Limping Man
Sleepwalking
Occluded Legs
Normal Walk
Occluded by pole
Walk in Skirt

st

The algorithm is also tested using the same database for


the same action walk but with different view points. Similar
to the test for robustness, the rst best action class and the
second best action class to the view-varying walk sequences
as illustrated in Figure 4, are estimated and is summarized in
Table III. The algorithm classies most of the walk action
sequences with varying viewpoint accurately with minimum
median distance to the correct walk action manifold except
for the view point where the person in the action sequence is
oriented at 81 deg (Dir 81). The misclassication in the video
sequence with the view-point direction 81 is due to the optical
ow not capturing enough motion in the body as the person
is moving towards the camera. In other words, with a single
view camera, the motion in the local regions does not exhibit
much translation when the person moves towards the camera

T EST FOR ROBUSTNESS /D EFORMITY

1 (Class) Best
2.5(Walk)
1.87(Walk)
1.81(Walk)
2.89(Walk)
2.23(Walk)
1.89(Walk)
1.88(Walk)
1.88(Walk)
2.14(Walk)
1.85(Walk)

2nd (Class) Best


3.09(Skip)
2.17(Skip)
2.34(Skip)
3.27(Side)
2.93(Skip)
2.13(Skip)
2.59(Skip)
2.62(Skip)
2.94(Skip)
2.15(Skip)

Median to actions
3.94
3.64
3.82
4.09
3.82
3.66
2.6249
3.63
3.88
3.54

4346
4352

TABLE III.
Test Seq
Dir 0
Dir 9
Dir 18
Dir 27
Dir 36
Dir 45
Dir 54
Dir 63
Dir 72
Dir 81

st

T EST FOR V IEW I NVARIANCE

1 (Class) Best
1.76(Walk)
1.69(Walk)
1.73(Walk)
1.73(Walk)
1.77(Walk)
1.77(Walk)
1.77(Walk)
1.96(Walk)
2.29(Walk)
2.69(Side)

nd

(Class) Best
2.34(Skip)
2.31(Skip)
2.26(Skip)
2.32(Skip)
2.32(Skip)
2.20(Skip)
2.11(Skip)
2.31(Skip)
2.49(Skip)
2.80(Skip)

Median to actions
3.94
3.64
3.82
4.09
3.82
3.66
2.6249
3.63
3.88
3.54

and hence the optical ow motion pattern does not have the
variation associated with a walk action.

[10]
[11]
[12]
[13]

[14]
[15]

V.

C ONCLUSIONS

In this paper, a frame work for recognizing actions is


presented which works on partial video sequences. The main
features are the removal of the start and end point normalization for action sequences and the invariance to the speed of
the action being performed. The start and the end points of
sequence need not be considered when modeling the mapping
from the feature space to the Eigen space due to the inherent
comparison of manifolds using Generalized Regression Neural
Networks. Moreover as the result shows, the algorithm (the
framework and the motion descriptors) is robust with respect
to deformities in the shape and also invariant to the viewpoint of the person in the action sequence. In future, a more
detailed study will be done on the motion feature obtained by
the computation of HOF and LBP from the optical ow and the
correlation between these motion features and the ow vectors.
Also, the effect of a mismatch in the optical ow vectors and
subsequent analysis of how the Eigen space gets affected will
also be researched.

[16]

[17]
[18]

[19]
[20]

R EFERENCES
[1]
[2]
[3]

[4]

[5]
[6]
[7]

[8]

[9]

S. Ali, A. Basharat, and M. Shah. Chaotic invariants for human


action recognition. In IEEE 11th International Conference on Computer
Vision, ICCV 2007, pages 1 8, oct. 2007.
D. Batra, T. Chen, and R. Sukthankar. Space-time shapelets for action
recognition. In IEEE Workshop on Motion and video Computing,
WMVC 2008, pages 1 6, jan. 2008.
T.-J. Chin, L. Wang, K. Schindler, and D. Suter. Extrapolating learned
manifolds for human activity recognition. In IEEE International
Conference on Image Processing, ICIP 2007., volume 1, pages 381
384, october 2007.
N. Dalal and B. Triggs. Histograms of oriented gradients for human
detection. In IEEE Computer Society Conference on Computer Vision
and Pattern Recognition,CVPR 2005, volume 1, pages 886 893 vol. 1,
june 2005.
N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented
histograms of ow and appearance. In European Conference on
Computer Vision, May 2006.
L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actions
as space-time shapes. Transactions on Pattern Analysis and Machine
Intelligence, 29(12):22472253, December 2007.
M. B. Kaaniche and F. Bremond. Tracking hog descriptors for gesture
recognition. In Proceedings of the 2009 Sixth IEEE International
Conference on Advanced Video and Signal Based Surveillance, AVSS
09, pages 140145, 2009.
T.-K. Kim, S.-F. Wong, and R. Cipolla. Tensor canonical correlation
analysis for action classication. In Computer Vision and Pattern
Recognition, 2007. CVPR 07. IEEE Conference on, pages 1 8, june
2007.
A. Klaser, M. Marszalek, and C. Schmid. A spatio-temporal descriptor
based on 3d-gradients. In Proceedings of the British Machine Vision
Conference (BMVC08), pages 9951004, September 2008.

4347
4353

Y. M. Lui, J. Beveridge, and M. Kirby. Action classication on


product manifolds. In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2010, pages 833 839, june 2010.
A. Monahan, J. Fyfe, M. Ambaum, D. Stephenson, and G. North.
Empirical orthogonal functions: The medium is the message. Journal
of Climate, 22:6501, 2009.
B. M. Nair and V. K. Asari. Action recognition based on multi-level
representation of 3d shape. In VISAPP, pages 378386. SciTePress,
2011.
B. M. Nair and V. K. Asari. Time invariant gesture recognition by
modelling body posture space. In Proceedings of the 25th international
conference on Industrial Engineering and Other Applications of Applied
Intelligent Systems: advanced research in applied articial intelligence,
IEA/AIE12, pages 124133. Springer-Verlag, 2012.
J. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human
action categories using spatial-temporal words. In British Machine
Vision Conference, BMVC 2006, 2006.
B. Sagha and D. Rajan. Human action recognition using pose-based
discriminant embedding. Signal Processing: Image Communication,
27(1):96 111, 2012.
P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift descriptor and
its application to action recognition. In Proceedings of the International
Conference on Multimedia (MultiMedia07), pages 357360, September
2007.
D. Specht. A general regression neural network. IEEE Transactions on
Neural Networks, 2(6):568 576, nov 1991.
X. Sun, M. Chen, and A. Hauptmann. Action recognition via local
descriptors and holistic features. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR
Workshops 2009., pages 58 65, june 2009.
S. Tabbone, L. Wendling, and J. Salmon. A new shape descriptor
dened on the radon transform. In Computer Vision and Understanding,
volume 102, pages 4251, 2006.
Y. Wang, K. Huang, and T. Tan. Human activity recognition based
on r transform. In IEEE Conference on Computer Vision and Pattern
Recognition,CVPR 07, pages 1 8, june 2007.

Das könnte Ihnen auch gefallen