Sie sind auf Seite 1von 8

CS-772A

PROBABILISTIC MACHINE LEARNING


INDIAN INSTITUTE OF TECHNOLOGY, KANPUR
20th APRIL, 2016

TEMPORAL GESTURE RECOGNITION USING


HMM AND DYNAMIC TIME WARPING
INSTRUCTOR: Prof. Piyush Rai

By
Abhinav Jain 13022
Rishab Jain 13567
Peshal Agarwal - 13472

Abstract
The work presented in this report examines machine learning algorithms through the lens of gesture recognition
using image classification techniques which is not a novel problem but has been on the frontline in the past couple of
decades and still continues to keep researchers motivated to explore further. Though this problem of image
classification has already been addressed with uncountably many algorithms, effective solutions are still being
developed. Our work primarily focuses on studying these algorithms and generating a comparative model which
would function as a creative tool in keeping up with the advances in the field of Image Classification. This elementary
but effective approach endeavors to gain a deeper knowledge of the requirements and challenges in the analysis of
the algorithms in real-time and creative problem domains.
The primary contributions of this work includes (1) a survey of contemporary algorithms, which include
Hidden Markov Model, and Dynamic Time Warping; (2) implementation of variants of HMM and Support Vector
Machines for temporal and static gesture recognition; (3) comparing the aforementioned algorithms through survey
and results obtained after implementing a few of these algorithms.

Objective
Our aim is to classify a sequence of gestures into their corresponding classes. For the purpose stated, we have used
Hidden Markov Model with Gaussian outputs. We have reviewed some other algorithms available for gesture
recognition like 2D-HMM, and Dynamic Time Warping.

Dynamic Time Warping:


Dynamic Time Warping is an algorithm that can compute the similarity between two time-series, even if the lengths
of the time-series do not match.

How DTW works:


Given two, one-dimensional, time-series, x = { x1, x2 x|X| }T and y = { y1, y2y|Y| } with respective lengths |x| and |y|,
construct a warping path w = { w1,w2.w|W| }T so that |w|, the length of w is:
Max {|x|, |y|} |w| < |x| + |y|
th
Where, the k value of w is given by:
wk = (xi, yj)
A number of constraints are placed on the warping path, which are as follows:
The warping path must be continuous and must start at: w1 = (1, 1) and end at: w|w| = (|x| , |y|)
The warping path must exhibit monotonic behaviour, i.e. the warping path cannot move backwards
The warping that needs to be found is the one that minimizes the normalised total warping cost:

Where, DIST (.) is the distance function between points i in time-series and point j in time series y, given by wk.
The minimum total warping path can be found by using dynamic programming to fill a two-dimensional (|x| by |y|)
cost matrix C. Each cell in the cost matrix represents the accumulated minimum warping cost so far in the warping
between the time-series x and y up to the position of that cell. Value in the cell at C(i,j) is give by:

Which, is the distance between point i in the time-series x and point j in the time-series y, plus the minimum
accumulated distance from the three previous cells that neighbour the cell.
When the cost matrix has been filled, the minimum possible warping path can easily be calculated by navigating
through the cost matrix in reverse order, starting at C(|x|, |y|), until cell C(1,1) has been reached. At each step, the cells
to the left, above and diagonally positioned with respect to the current cell are searched to find the minimum value.
The cell with the minimum value is then moved to and the previous three cell search is repeated until C(1,1) has
been reached. The warping path then gives the minimum normalised total warping distance between x and y that is

However, in computational fields such as gesture recognition which features multiple-dimensions, common
approach is to take the summation of distance errors between each dimension of an N-dimensional template and
the new N-dimensional time-series. The total distance across all N dimensions is then used to construct the warping
matrix C.

Training and Classification:

Figure 1: Model training for DTW

In order for ND-DTW to be used as a real-time recognition algorithm, a template must first be created for each
gesture that needs to be classified which is computed by recording Mg training examples for each of the G gestures
that are required to be recognised. After the training data has been recorded, each of the G templates can be found
by computing the distance between each of the Mg training examples for the kth gesture and searching for the
training example that gives the minimum normalised total warping distance when matched against the other Mg-1
training examples in that class.
The gth template (g) is therefore given by:

Where,
1{.} is the indicator function that gives 1 when i j or 0 otherwise.
Xi and Xj are the ith and jth N-dimensional training examples for the gth gesture in the form of X= {x1, x2,.xN}T
And ND-DTW is given by :

After the templates have been created for each gesture in the database, an unknown N-dimensional time-series X
can be classified by computing the normalised total warping distance between X and each of the G templates. c, the

classification index representing the gth gesture is then given by finding the corresponding template that gave the
minimum normalised total warping distance:

Classification using HMM


To classify a sequence into one of k classes, train up k HMMs, one per class, and then compute the log-likelihood
that each model gives to the test sequence; if the ith model is the most likely, then declare the class of the sequence
to be class i. We trained 3 HMM models for classification of 3 kinds of gestures:

Figure2: HMM for model training

Hidden Markov Model:


HMM is usually characterized by the following:

N, the number of states in the model. We denote the individual states as S = {S1,.SN} and the state at time t
is qt.
M, the number of distinct observation symbols per state i.e the discrete alphabet size. We denote the
individual symbols as V = {v1,..vN}
The state transition probability distribution A = {aij} where,
Aij = P{qt+1 = Sj | qt = Si}
1 i, j N
Observation symbol probability distribution in state j,
B ={ bj(k) = P[vk at t | qt = Sj] } 1 j N , 1 k M
The initial state distribution vector = { i} where
i = P[ qi = Si] , 1 i N

Compact notation for parameters to be estimated are = (A, B, )


Assumptions: Observations are statistically independent and 1st order markov Chain is used.
Consider, the following forward, t and backward variable, t which can be calculated using forward and backward
procedure:

For adjusting the model parameters = (A, B, ), there is no optimal way. We can, however, choose such that is
locally maximized using an iterative procedure such as the Baum-Welch method (or equivalently the EM
(expectation-modification)).

We define t (i) = P (qt = Si | O, ) and t(i, j) = P(qt = Si , qt+1 = Sj | O, ) which in terms of t and t are

Using t (i) and t(i, j) model parameters can be computed as follows:

But HMM assumes discrete alphabets as Inputs. But the input data for gesture recognition is continuous feature
vector instead of a discrete alphabet. We can resolve this using one of these methods:

1. Vector Quantization:
Each continuous observation vector is mapped into a discrete codebook index. VQ partitions the training
vectors into M (size of the codebook) disjoint sets, which are represented by a single vector. These
representative vectors can be obtained using K-means clustering.
After the k-means algorithm has converged the clustered training data can be used to create the codebook.
Compute the distance (typically Euclidean) between the new sample and each of the k-cluster centres, with
the new sample's quantisation value set to the ID of the cluster centre resulting in the minimum distance.
After having obtained, the ID of new samples, discrete HMM can be applied.

2. Gaussian Hidden Markov Model:


For continuous feature vectors, the emission probabilities are modified as follows

Here, O represents the output vector, cjm is the coefficient of mixture of the mth mixture in state j
with mean vector jm and covariance matrix Ujm. Updates for the above mentioned factors is done
below:

3. Two- dimensional HMM:


This is an extension of 1D HMM which is not actually a 2D model but a pseudo-counterpart of it as not all
states are connected. Each row is an independent Markov chain. The basic assumption is that there exists a
set of super states that are Markovian and within each super state, there is a set of simple Markovian
states.
The transition between super states is modelled as a 1st order Markov chain and each super state is used to
represent an entire row of the image; a simple Markov chain is then used to generate observations in the
column. Following notations can be used to work with 2D-HMM The number of states of the model N2
Q = q1, . . . . ., qN2 is the set of states.
The number of data streams k1 x k2 = K
The number of symbols M
The transition probabilities of the underlying Markov chain A = {aijl} 1 i,j N; 1 l N2
The observation probabilities, B = {bijm} 1 i,j N ; 1 m M
The initial probability, ={ijk} 1 i,j N ; 1 k K

Figure 3: Two-dimensional ergodic HMM (right). Two-dimensional HMM in in subsequent time steps (s1 - s4 - observations) (left)

Parameter Initialization
For the number of hidden states, we tried using values ranging from 2 up to about 8. It should be noted that
adding states always improves the achievable likelihood. This is because, in the limit, perfect agreement can be
reached between model and data by allowing for a different state at every time step. The likelihood measure does not
in any way penalize for increasing the number of states. We heavily biased the model toward remaining in the same
state from one time step to the next. Once this condition was imposed, transitions became relatively rare and the

number of states which were profitably used became relatively stable at 5. Any additional states added above this
number became likely only rarely, if at all. After this initial exploration, all subsequent models were created with five
hidden states.
The remaining decisions concerned the initial values to be used for the state transition matrix P, the output
likelihood matrix Q, and the initial state vector. In particular, it was necessary to initialize this matrix such that the
chance of remaining in the same state from one time step to the next was very high. We obtained good results when
the diagonal values of this matrix were initialized to values distributed around 0.99. This had the effect of starting the
optimization algorithm in a region of the search space where state transitions occur only rarely.

Figure 4: Transition matrix supporting the chance of remaining in the state for a model with 5 states

If the P values are instead distributed uniformly, the resulting models change state at nearly every time step. This
extreme sensitivity in turn makes comparison between models impossible.
Unlike the P matrix, we found that the initial values of the Q and other parameters had little effect on the
HMMs produced. The hidden states have no intrinsic meaning, so there is no reason to believe that any one state will
be more likely than another; therefore, initial state vector can reasonably be initialized uniformly. Similarly, there is
no a priori relationship between the output symbols and any particular state. This means that the Q matrix can also
be safely initialized with uniformly distributed values.
In case of Gaussian HMM we have two other parameters namely, mean vector and the covariance matrix. Initial
estimates for parameters of mixture of Gaussians were done using k-means or by rnd which is a random method
that chooses centres randomly from data and using covariance between data computes the covariance matrix
(which can be diagonal, spherical or full matrix).
For the continuous models, we found that it is preferable to use diagonal covariance matrices with several mixtures
(i.e. higher M), rather than fewer mixtures with full covariance matrices. The reason for this is simple, namely the
difficulty in performing reliable re-estimation of the off diagonal components of the covariance matrix from the
necessarily limited training data.

Results
HMM model was trained for the classification of 3 kinds of gestures using batch training. Each model was trained
using the same prior and maximum number of iterations allowed were 20.
As a result log likelihood increased in each iteration with model converging around 15th iteration.

Figure 5: Log-likelihood vs number of iterations for three different HMM models trained for three classes of gestures

We classified a sequence into one of 3 classed by training up 3 HMMs, one per class and then computed the loglikelihood that each model gave to the test sequence; if the ith model was the most likely, then declared the class of
the sequence to be class i.

Figure 6: HMM model implemented on test data with sequence of images belonging to two different classes

Inferences
Recognition of Multivariate Temporal Gestures:
Machine learning algorithms including Hidden Markov Models and Dynamic Time Warping have been presented in
this report for the purpose of Temporal Gesture Classification. Even though, HMM classifies a sequence of images
correctly but it does with poor confidence. Recent research has shown that ND-DTW algorithm achieves excellent
classification results on the gestures when they were pre-segmented and also from a continuous stream of data that
also contained null gestures.

BIBLIOGRAPHY

Gesture Recognition for Musician Computer Interaction by Nicholas Edward Gillian


https://en.wikipedia.org/wiki/Scaleinvariant_feature_transform#Comparison_of_SIFT_features_with_other_local_features
COMPARISON OF THE EFFECTIVENESS OF 1D AND 2D HMM IN THE PATTERN RECOGNITION by
JANUSZ BOBULSKI
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition by Lawrence
R. Rabiner
HMM toolbox : https://www.cs.ubc.ca/~murphyk/Software/HMM/hmm_usage.html

Das könnte Ihnen auch gefallen