Sie sind auf Seite 1von 35


By Mr. Deval Chaudhari (08BCE010) Mr. Vipul Chauhan (08BCE016)



Minor Project

Submitted in partial fulfillment of the requirements For the degree of Bachelors of Technology in Computer Engineering

By Mr. Deval Chaudhari (08BCE010) Mr. Vipul Chauhan (08BCE016)

Guide Prof. Pooja Shah


This is to certify that the Major Project entitled Gesture Recognition System submitted by Mr. Deval Chaudhary (08BCE010) & Mr. Vipul Chauhan

(08BCE016), towards the partial fulfillment of the requirements for the degree of Bachelors of Technology in Computer Engineering of Nirma University of Science and Technology, Ahmedabad is the record of work carried out by them under my supervision and guidance. In my opinion, the submitted work has reached a level required for being accepted for examination. The results embodied in this minor project, to the best of my knowledge, havent been submitted to any other university or institution for award of any degree or diploma.

Dr. Sanjay Garg Professor and Section Head, CSE Department, Institute of Technology, Nirma University, Ahmedabad

Prof D J Patel Professor and Head, CSE Department, Institute of Technology, Nirma University, Ahmedabad

Prof. Pooja Shah Guide and Assistant Professor, Dept. of Computer Science, Institute of Technology, Nirma University, Ahmedabad


We would like to thank our Faculty Guide, Prof Pooja Shah, Assistant Professor, Department of Computer Engineering, Institute of Technology, Nirma University, Ahmedabad for her valuable guidance and continual encouragement throughout the Minor project. We heartily thankful to her for her time to time suggestion and the clarity of the concepts of the topic that helped me a lot during this study. We would also like to thank our laboratory faculty Dr. Sanjay Garg for his guidance throughout the semester.

We are also thankful to Prof D J Patel, Head, Department of CE, IT and MCA Department, Institute of Technology, Nirma University, Ahmedabad for his continual kind words of encouragement and motivation throughout the Minor Project. We are also thankful to Dr K Kotecha, Director, Institute of Technology for his kind support in all respect during my study.

We are thankful to all faculty members of Department of Computer Engineering, Nirma University, Ahmedabad for their special attention and suggestions towards the project work.

Mr. Deval Chaudhari (08BCE010) Mr. Vipul Chauhan (08BCE016)


The primary goal of gesture recognition research is to create a system which can identify specific human gestures and use them to convey information or for device control. The report includes various ways in which gesture can be recognized like
Instrumented Gloves Vision based Tracking. Then we have discussed two types approaches. One is Statistical Modal under which we discussed Template Based method, Feature extraction, Color Segmentation Models, Principle Component Analysis. The second approach is Learning Algorithms under which we have discussed Neural Networks and Hidden Markov Model. Out of all the methods we have implemented 2-D Correlation method which is discussed in detail. The discussion includes the matlab model that implements the algorithms and the java jar file which perform the action based on gesture detection. The GUI of the application developed is also shown. The report ends with issues we faced during implementation of our algorithms and the open issue in the field of the gesture recognition.





With the development of information technology in our society, we can expect that computer systems to a larger extent will be embedded into our environment. These environments will impose needs for new types of human computer- interaction, with interfaces that are natural and easy to use.

The user interface (UI) of the personal computer has evolved from a text-based command line to a graphical interface with keyboard and mouse inputs. However, they are inconvenient and unnatural. The use of hand gestures provides an attractive alternative to these cumbersome interface devices for human-computer interaction (HCI). Users generally use hand gestures for expression of their feelings and notifications of their thoughts. In particular, visual interpretation of hand gestures can help in achieving the ease and naturalness desired for HCI. Vision has the potential of carrying a wealth of information in a nonintrusive manner and at a low cost, therefore it constitutes a very attractive sensing modality for developing hand gestures recognition. Recent researches in computer vision have established the importance of gesture recognition systems for the purpose of human computer interaction.

The primary goal of gesture recognition research is to create a system which can identify specific human gestures and use them to convey information or for device control. A gesture may be defined as a physical movement of the hands, arms, face, and body with the intent to convey information or meaning. Gesture recognition, then, consists not only of the tracking of hand movement, but also the interpretation of that movement as semantically meaningful commands.


Gestures can be static (the user assumes a certain pose or configuration) or dynamic (with pre-stroke, stroke, and post-stroke phases). Some gestures also have both static and dynamic elements, as in sign languages. Again, the automatic recognition of natural continuous gestures requires their temporal segmentation. Often one needs to specify the start and end points of a gesture in terms of the frames of movement, both in time and in space. Sometimes a gesture is also affected by the context of preceding as well as following gestures. Moreover, gestures are often language and culture-specific.



Two approaches are commonly used to interpret gestures for Human Computer interaction. They are Instrumented Gloves Vision Based Tracking

1.2.1. Instrumented Gloves

This method employs sensors (mechanical or optical) attached to a glove that transduces finger flexions into electrical signals for determining the hand posture. This approach forces the user to carry a load of cables which are connected to the computer and hinders the ease and naturalness of the user interaction.

1.2.2. Vision based Tracking

Computer vision based techniques are non invasive and based on the way human beings perceive information about their surroundings. Although it is difficult to design a vision based interface for generic usage, yet it is feasible to design such an interface for a controlled environment.

In this report we have only focused on computer vision base hand gesture recognition and discussed different approaches for same.




In this section, as the gesture recognition can be used in many more areas, we present an overview of the some of the application domains that employ gesture interactions.

Virtual Reality: Gestures for virtual and augmented reality applications have experienced one of the greatest levels of uptake in computing. Virtual reality interactions use gestures to enable realistic manipulations of virtual objects using ones hands, for 3D display interactions or 2D displays that simulate 3D interactions.

Robotics and Telepresence: Telepresence and telerobotic applications are typically situated within the domain of space exploration and military-based research projects. The gestures used to interact with and control robots are similar to fully-immersed virtual reality interactions, however the worlds are often real, presenting the operator with video feed from cameras located on the robot. Here, gestures can control a robots hand and arm movements to reach for and manipulate actual objects, as well its movement through the world.

Desktop and Tablet PC Applications: In desktop computing applications, gestures can provide an alternative interaction to the mouse and keyboard. Many gestures for desktop computing tasks involve manipulating graphics, or annotating and editing documents using pen-based gestures .

Games: When, we look at gestures for computer games. Freeman et al. tracked a players hand or body position to control movement and orientation of interactive game objects such as cars. Konrad et al. used gestures to control the movement of avatars in a virtual world, and Play Station 2 has introduced the Eye Toy, a camera that tracks hand movements for interactive games .


Sign Language: Sign language is an important case of communicative gestures. Since sign languages are highly structural, they are very suitable as test beds for vision algorithms. At the same time, they can also be a good way to help the disabled to interact with computers. Sign language for the deaf (e.g. American Sign Language) is an example that has received significant attention in the gesture literature.



The trackball, the joystick, and the mouse are extemely successful devices for hand-based computer input. Yet all require that the user hold some hardware, which can be awkward. Furthermore, none accommodates the richness of expression of a hand gesture.

Devices such as the Data glove can be worn which sense hand and finger positions. While this captures the richness of a hand's gesture, it requires the special glove. We seek a visually based method which will be free of gloves and wires.

Relying on visual markings on the hands, previous researchers have recognized sign language and pointing gestures. However, these methods require the placement of markers on the hands. The marking-free systems of some researchers can recognize specific finger or pointing events, but not general gestures. Employing special hardware or offline learning, several researchers have developed successful systems to recognize general hand gestures. Blake and Isard have

developed a fast contour-based tracker which they applied to hands, but the discrimination of different hand poses is limited.

The real-time hand gesture recognition systems we are aware of require special hardware or lengthy training analysis. We are looking for real-time hand gesture recognition system that does not require special hardware or lengthy training analysis but that are natural and easy to use and implement.


This report dealing with the hand gesture recognition techniques used in the literature. Since the most of human-computer interaction is done through hand. Chapter 2 deals with the various algorithmic techniques that have been used over the years for the purpose. In Chapter 3 we represent the technique purposed by us and algorithm that we have approached.


Various approaches in the literature have been classified by the author into 2 major categories: 1. Statistical models. 2. Learning Models.


This category of methods deals with the extraction of the features in form of mathematical quantities from the available data which is captured through webcame. The method is further classified into sub categories like: 2.1.1. Template Based:

In this approach the data obtained is compared against some reference data and using the thresholds, the data is categorized into one of the gestures available in the reference data. The Template Matching block finds the best match of a template within an input image. The block computes match metric values by shifting a template over a region of interest or the entire image, and then finds the best match location. A basic method of template matching uses a convolution mask (template), tailored to a specific feature of the search image, which we want to detect. This technique can be easily performed on grey images or edge images. The convolution output will be highest at places where the image structure matches the mask structure, where large image values get multiplied by large mask values. This method is normally implemented by first picking out a part of the search image to use as a template: We will call the search image S(x, y), where (x, y) represent the coordinates of each pixel in the search image. We will call the template T(x t, y

where (xt, yt) represent the coordinates of each pixel in the template. We then

simply move the center (or the origin) of the template T(x t, y t) over each (x, y)


point in the search image and calculate the sum of products between the coefficients in S(x, y) and T(xt, yt) over the whole area spanned by the template. As all possible positions of the template with respect to the search image are considered, the position with the highest score is the best position. This method is sometimes referred to as 'Linear Spatial Filtering' and the template is called a filter mask. For example, one way to handle translation problems on images, using template matching is to compare the intens intensities of the pixels, using the SAD (Sum of , absolute differences) measure. ) A pixel in the search image with coordinates (xs, ys) has intensity Is(xs, ys) and a rch pixel in the template with coordinates (xt, yt) has intensity It(xt, yt ). Thus the absolute difference in the pixel intensities is defined as Diff(xs, ys, x t, y t) = | Is(xs, ys) It(x t, y t) |.

The mathematical representation of the idea about looping through the pixels in the search image as we translate the origin of the template at every pixel and take the SAD measure is the following:

Srows and Scols denote the rows and the columns of the search image and Trows and Tcols denote the rows and the columns of the template image, respectively. In this image, method the lowest SAD score gives the estimate for the best position of template within the search image. The method is simple to implement and understand, but it is one of the slowest methods.



Fig 2.1 Template matching This is a simple approach with little calibration but suffers from noise and doesnt work with overlapping gestures.




Feature extraction:

This approach deals with the extraction of the low level information from the data and combines the information to produce high level semantic feature information which can be used to classify gestures/ postures. The methods in this sub category usually deal with capturing the changes and measuring certain qualities during those changes. The collection of these values is used to label a posture which can be subsequently extended to a gesture.

Feature extraction is the process of generating a set of descriptors or characteristic attributes from a binary image. The descriptors are stored as a feature vector which can be used for recognition process as well as for machine training purposes in an adaptive recognition engine (i.e. artificial neuron network). Main features being experimented in the prototype are object crossing, projection histogram and scalar region descriptors. Object crossing works by considering significant pixel changes across a cross section. It is a good feature extraction where skeletonization of an object can be omitted, because the width of the objects will not affect the output of this feature extraction. Type of the cross section used for object crossing may vary in the form of horizontal, vertical or any arbitrary degree of diagonal cross section, which depends on implementation of the technique. The location of cross section is distributed uniformly across the object. Number of cross sections depends on how detail the features are to be extracted. Implementing larger number of cross sections implies more information extraction from the objects. However, this will cost more computational power as well as memory consumption due to larger number of data to be extracted and stored. In this only two types of object crossing are implemented. Object crossing in horizontal and vertical cross sections. Projection histogram works by taking account the shape of the object from a direction by calculating the distance of first significant pixel detected to its respective origin across a cross section.

Types of projection histogram vary in term of direction of projection (i.e. top, left, right, bottom, and any arbitrary diagonal direction). Scalar region descriptors are the simplest yet significant features to describe an object. In this prototype, four



main region descriptors are being used, which are width, height, area ratio and width and height ratio. Based on the ROI defined on a clipped palm from segmentation module, the width feature is the horizontal length of the ROI, and the height is the vertical length of the ROI. Area ratio is the proportion of foreground pixels captured over the overall area of ROI defined. Usually, a feature that can derive other features will not be used as a separate individual feature, for instance: The width and height features which can derive width/height ratio. This is to avoid redundancy information which may cause inefficiency and possible conflicts to the classifier. Basically the objective of the feature extraction is to reduce the dimensionality of the raw image while retaining as many descriptive features as possible. This method however suffers from heavy computational cost and also there should be a specific sequence that should frame a gesture else this method fails. 2.1.3. Color Segmentation Models:

The following figure shows the Color Segmentation demo model:

Fig 2.2 Color Segmentation Demo Model To create an accurate color model for the demo, many images containing skin color samples were processed to compute the mean (m) and covariance (C) of the Cb and Cr color channels. Using this color model, the Color Segmentation/Color Classifier subsystem classifies each pixel as either skin or non-skin by computing the square of the Mahalanobis distance and comparing it to a threshold. The equation for the Mahalanobis distance is shown below:



Squared Distance (Cb,Cr) = (x-m)'*inv(C)*(x-m), where x=[Cb; Cr] The result of this process is binary image, where pixel values equal to 1 indicate potential skin color locations. The Color Segmentation/Filtering subsystem filters and performs morphological operations on each binary image, which creates the refined binary images shown in the Skin Region window. The Color Segmentation/Region Filtering subsystem uses the Blob Analysis block and the Extract Face and Hand subsystem to determine the location of the person's face and hand in each binary image. The Display Results/Mark Image subsystem uses this location information to draw bounding boxes around these regions.

Fig 2.3 Skin color segmentation This method works well if image does not contain any other object of skin color.If so it will track wrong object. 2.1.4. Principle Component Analysis:

In this section, we will study the hand gesture recognition through Principal Components Analysis, but we will need some mathematical background to understand the method. This method is called: PCA or Eigen faces. It is a useful statistical technique that has found application in different fields (such as face recognition and image compression). This is also a common technique for finding patterns in data of high dimension too. Before realizing a description of this method, we will first introduce mathematical concepts that will be used in PCA.




1. Standard Deviation: In statistics, we generally use samples of population to Realize the measurements. For the notation, we will use the symbol X to refer to the entire sample and we will use the symbol Xi to indicate a specific data of the sample.

2. Standard deviation s,

3. Variance: Variance is another measure of the spread out of data in a set. In fact it is quite the same as the standard deviation.

4. Covariance: Covariance can be expressed as:

5. Eigenvectors: The eigenvector of a linear operator are non-vectors which, when operated on by the operator, result in a scalar multiple of themselves. The scalar is then called the Eigenvalue associated with the eigenvectors.



6. Eigen value: Each eigenvector is associated to an Eigen value. The Eigen value could give us some information about the importance of the eigenvector. The Eigen values are really important in the PCA method, because they will permit to realize some threshold to filter the non-significant eigenvectors, so that we can keep just the principal ones. Main Steps of the method:

First of all, we had to create the data set. The aim is to choose a good number of pictures and a good resolution of these in order to have the best recognition with the smallest database. Then, the next step is to subtract the mean from each of the data dimensions. The mean subtracted is simply the average across each dimension. The step three is to calculate the covariance matrix of the database. We could not calculate the covariance matrix of the first matrix, because it was too huge. So we had to find a way to find out the principal eigenvectors without calculating the big covariance matrix. The method consists in choosing a new covariance matrix. Our covariance matrix for A was called C and C is defined by: C = A* A' Then, the eigenvectors and the Eigen values of C are the principal components of our data set. But as explained before, we could not calculate C. The idea is to say that when we have 12 points in a huge space, the meaningful eigenvectors will be less than the dimension, and the number of the meaningful ones will be the number of points minus 1. So in our case, we can say that we will have 11 meaningful eigenvectors. The remaining eigenvectors will have an Eigen value around zero. Then, we calculated the eigenvectors and the Eigen values of the covariance matrix. This gave us the principal orientation of the data. With the help of MATLAB we did it easily.

After that, we have chosen the good components and form the feature vector. This is the principal step. We then chose the principal (most important) eigenvectors with which we expressed our data with the lowest information loss. We also had chosen a precise number of eigenvectors to have the less calculation time, but the best recognition. Here, the theory says that we will normally have 11 meaningful



eigenvectors. The final step is to make a new data set (that we will call Eigen set). Then, it made possible to realize the last script which could compare the different pictures and class them by resemblance order. To compare the different pictures, we had to express each image of the data set with these principal eigenvectors. The last thing to do is to compare (by calculating the Euclidian distance between the coefficients that are before each eigenvector).

This method suffers from a drawback that there should be variance in the at least one direction. If variance is uniformly distributed in the data, it will not yield the relevant Principle vectors also, if there is noise, PCA would consider it as a significant bias too. Besides this method suffers from scaling in hand size and position, which can be taken care by normalization. Even then, this method is user dependent.


These are machine learning algorithms that deal with the learning of the gesture based on the data manipulation and weight assignment. The popular techniques in this sub category are: 2.2.1. NEURAL NETWORKS:

Neural networks have received much attention for their successes in pattern recognition. Gesture recognition is no exception to this and several systems have been reported in the literature. Informally, the reason for their popularity is that once the network has been configured, it forms appropriate internal representer and decider systems based on training examples. Also, because the representation is distributed across the network as a series of interdependent weights instead of a conventional local data structure, the decider has certain advantageous properties: recognition in the presence of noise or incompleteness and pattern generalization. Generalization plays a crucial role in the systems performance, since most gestures will not be reproduced even by the same user with perfect accuracy, and when a range of



users are allowed to use the system the variation becomes even greater. Other useful properties of this approach include performing calibration automatically and the ability to classify raw sensor data, as with template matching. However, neural networks have very serious drawbacks. Often thousands of labeled examples are needed to train the network for accurate recognition. This training phase must be repeated from the start if a new gesture is added or one is removed. Neural networks can over learn, if given too many examples, and discard originally learnt patterns. It may also happen that one bad example may send the patterns learnt in the wrong direction, and this, or other factors such as orthogonality of the training vectors may prevent the network from converging at all. They tend to consume large amounts of processing power, especially in the training phase. The biggest problem is to understand what the network had actually learnt given the ad-hoc manner in which it would have been configured. There is no formal basis for constructing neural networks and the topology unit activation functions, learning strategy and learning rate all mustbe determined by trial and error. Despite these drawbacks, somewhat successful gesture recognition has been performed with neural networks, notably by Fels. 2.2.2. Hidden Markov Model:

A time-domain process demonstrates a Markov property if the conditional probability density of the current event, given all present and past events, depends only on the jth most recent event. If the current event depends solely on the most recent past event, then the process is termed a first order Markov process. This is a useful assumption to make, when considering the positions and orientations of the hands of a gesturer through time. The HMM is a double stochastic process governed by:



1. an underlying Markov chain with a finite number of states and 2. a set of random functions, each associated with one state. In discrete time instants, the process is in one of the states and generates an observation symbol according to the random function corresponding to the current state. Each transition between the states has a pair of probabilities, defined as follows: 1. Transition probability, which provides the probability for undergoing the transition; 2. Output probability, which defines the conditional probability of emitting an output symbol from a finite alphabet when given a state. The HMM is rich in mathematical structures and has been found to efficiently model spatiotemporal information in a natural way. The model is termed hidden because all that can be seen is only a sequence of observations. It also involves elegant and efficient algorithms, such as BaumWelch and Viterbi , for evaluation, learning, and decoding. An HMM is expressed as = (A,B, ) and is described as follows: A set of observation strings O = {O1, . . . , OT }, where t =1, . . . , T . A set of N states {s1, . . . , sN}. A set of k discrete observation symbols {v1, . . . , vk}. A state-transition matrix A = {aij}, where aij is the transition probability from state sj at time t to state sj at time t + 1. A={aij}=Prob(sj at t + 1|si at t), for 1 i, j N.

an observation symbol probability matrix B = {bjk},where bjk is the probability of generating symbol Vk from state sj an initial probability distribution for the states ={j}, j=1, 2, . . .,N, where j = Prob(sj at t = 1).



The generalized topology of an HMM is a fully connected structure, known as an ergodic model, where any state can be reached from any other state. When employed in dynamic gesture recognition, the state index transits only from left to right with time, as depicted in Fig. 2.1. The start state s and final state sN, for N = 5, are indicated on the figure. Here, the statetransition coefficients aij = 0 if j < i, and Nj=1 aij = 1. The Viterbi algorithm is used for evaluating a set of HMMs and decoding by considering only the maximum path at each time step instead of all paths.

Fig 2.4 Five state left to right HMM for GR The global structure of the HMM is constructed by parallel connections of each HMM(1, 2, . . . , M), whereby insertion (or deletion) of a new(or existing)HMMis easily accomplished. Here, corresponds to a constructed HMM model for each gesture, where M is the total number of gestures being recognized.

HMMs have been applied to hand and face recognition. Usually, a 2-D projection is taken from the 3-D model of the hand or face, and a set of input features are extracted experimentally. The spatial component of the dynamic gesture is typically neglected, while the temporal component (having a start state, end state, and a set of observation sequences) is mapped through an HMM classifier with appropriate



boundary conditions. A set of data are employed to train the classifier, and the test data are used for prediction verification.

Given an observation sequence, the following are the key issues in HMM use: 1. evaluation: determining the probability that the observed sequence was generated by the model (ForwardBackward algorithm) 2. training or estimation: adjusting the model to maximize the probabilities (BaumWelch algorithm) 3. decoding: recovering the state sequence (Viterbi algorithm).

For watch gesture a separate HMM is trained and the recognition of the gesture is based on the generation of maximum probability by a particular HMM. This method also suffers from training time involved and complex working nature as the results are unpredicted because of the hidden nature. For the gesture recognition, bakis HMM is commonly used.


We have purposed

real-time hand gesture recognition system that does not

require special hardware or lengthy training analysis but that are natural and easy to use and implement. It is based on the 2-D correlation algorithm for

object tracking. This system consists of two main parts, which are related strongly to each other: Gesture Recognition model and User Interface

Controlling. In the meantime these two parts can also be used for other projects separately. In next section we have discuss 2-D correlation algorithm.


One approach to identifying a pattern within an image uses cross correlation of the image with a suitable mask. Where the mask and the pattern being sought are similar the cross correlation will be high. The mask is itself an image which needs to have the same functional appearance as the pattern to be found. Cross correlation is a standard method of estimating the degree to which two series are correlated. Consider two series x(i) and y(i) where i=0,1,2...N-1. The cross correlation r at delay d is defined as

Where mx and my are the means of the corresponding series. If the above is computed for all delays d=0,1,2,...N-1 then it results in a cross correlation series of twice the length as the original series.

There is the issue of what to do when the index into the series is less than 0 or greater than or equal to the number of points. (i-d < 0 or i-d >= N) The most common approaches are to either ignore these points or assuming the series x and y are zero for i < 0 and i >= N. In many signal processing applications the



series is assumed to be circular in which case the out of range indexes are "wrapped" back within range, ie: x(-1) = x(N-1), x(N+5) = x(5) etc The range of delays d and thus the length of the cross correlation series can be less than N, for example the aim may be to test correlation at short delays only. The denominator in the expression above serves to normalise the correlation coefficients such that -1 <= r(d) <= 1, the bounds indicating maximum correlation and 0 indicating no correlation. A high negative correlation indicates a high correlation but of the inverse of one of the series.

Consider the image below in black and the mask shown in red. The mask is centered at every pixel in the image and the cross correlation calculated, this forms a 2D array of correlation coefficients.

Fig 3.1 2-D Correlation The form of the unnormalised correlation coefficient at position (i,j) on the image is given by


is the mean of the masks pixels and

is the mean of the image

pixels under (covered by) the mask.





This system consists of two main parts, which are related strongly to each other: Gesture Recognition model and User Interface Controlling. In the meantime these two parts can also be used for other projects separately. 3.2.1. GESTURE RECOGNITION MODEL

Gesture Recognition Model is the first part of the project. Tracking hand movements through web-camera, hand location of recognized movements and gesturing of these hand movements to control the devices is done by the model. GRE is the most important part of our project. It will consist of pattern classification and image processing algorithms.

The Gesture Recognition Model is shown Figure 3.2. It first Read Binary File takes the video through webcam and store it in binary file. Image from file consist of image to be matched.2-d XCORR find out the mask given in the Image from File in the real time video. Then This coordinate is given to the function which move the cursor accordingly.

The grs.jar file contains the java code shown below figure 3.2. It uses Robot class which has methods like mouseMove and keyPress and keyRelease. mouseMove method is used to move the mouse pointer to the coordinates passed to the method. In our model we are identifying the hand gesture and represent it by x and y coordinates. This coordinates are passed to the grs.jar file with command line arguments as x and y coordinates. Thus mouse moves according to our hand movement in realtime.



Fig 3.2 2-D Correlation Model

package grs; import java.awt.AWTException; import java.awt.Robot; import java.awt.event.InputEvent; import java.awt.event.KeyEvent; public class Main {

public static void main(String args[]) throws Exception { int x, y; Robot robot = new Robot(); x = Integer.parseInt(args[0]); y = Integer.parseInt(args[1]); robot.mouseMove(x, y); // SET THE MOUSE X Y POSITION } }



The matlab function which calls the external grs.jar file is: function y = fcn(u) a=12*u(2); b=8*u(1); coder.extrinsic('display'); coder.extrinsic('sprintf'); str=sprintf('(%d,%d) ',a,b); display(str); %coder.extrinsic('dos'); %coder.extrinsic('sprintf'); %stg = sprintf('GRS.jar %d %d',a,b); %dos(stg); y = 1;



User Interface Controller function is the last part of the project. It has following functionalities. Start acquiring video from webcam, stop video Acquisition, call the gesture reorganization model which has prerecorded video file as input, while the other option tracks the hand in real-time.

Fig 3.3 User Interface Control





The process is extremely time consuming, the 2D cross correlation function needs to be computed for every point in the image. Calculation of the cross correlation function is itself a order of N2 operation. Ideally the mask should be chosen as small as practicable.

In many image identification processes the mask may need to be rotated and/or scaled at each position.

This process is very similar to 2D filtering except in that case the image is replaced by an appropriately scaled version of the correlation surface.


Certificate Acknowledgement Abstract Contents List of figures III IV V 1 3

Chapter 1

Introduction 1.1 1.2 Motivation Enabling Technology 1.2.1. Instrumented Gloves 1.2.2. Vision based Tracking 1.3 1.4 Application Domains Related Works

4 4 5 5 5 6 7

Chapter 2

Approach and Methods 2.1 Statistical Modal 2.1.1. Template Based 2.1.2. Feature extraction 2.1.3. Color Segmentation Models 2.1.4. Principle Component Analysis 2.2 Learning Algorithms 2.2.1. Neural Networks 2.2.2. Hidden Markov Model

9 9 9 12 13 14 17 17 18 22 22 24 24 26 27 28

Chapter 3

Purposed Approch 3.1 3.2 2-D Correlation Implementation 3.2.1. Gesture Recognition Model 3.2.2. User Interface Controller 3.2 Issues In Our Approch

Chapter 4

Open Issues

Chapter 5

Summary and Conclusion 5.1 5.2 Summary Conclusion

29 29 29 30

References Appendix A Appendix B List of Useful Websites List of Papers referred

30 30

2.1 Template matching 2.2 Color Segmentation Demo Model 2.3 Skin color segmentation 2.4 Five state left to right HMM for GR 3.1.2-D Correlation 3.2 2-D Correlation Model 3.3 User Interface Control 8 10 11 17 20 22 23



Vision based hand gesture recognition is still an important area of research because the available algorithms are relatively primitive when comparing with mammalian vision. A main problem hampering most approaches is that they rely on several underlying assumptions that may be suitable in controlled lab setting but do not generalize to arbitrary settings. Several common assumptions include: assuming high contrast stationary backgrounds and ambient lighting conditions. Also, recognition results presented in the literature are based on each authors own collection of data, making comparisons of approaches impossible and also raising suspicion on the general applicability. Moreover, most of the methods have a limited feature set. The latest trend for hand gesture recognition is the use of AI to train classifiers, but the training process usually requires a large amount of data and choosing features that characterize the object being detected is a time consuming task.

Another problem that remains open is recognizing the temporal start and end points of meaningful gestures from continuous motion of the hand(s). This problem is sometimes referred to as gesture spotting or temporal gesture segmentation. The ways to reduce the training time and develop a cost effective real time gesture recognition system which is robust to environmental and lighting conditions and does not require any extra hardware poses grand yet exciting research challenge.




In summary, a review of vision-based hand gesture recognition methods has been presented. Considering the relative infancy of research related to vision-based gesture recognition, remarkable progress has been made. To continue this momentum, it is clear that further research in the areas of feature extraction, classification methods and gesture representation are required, to realize the ultimate goal of humans interfacing with machines on their own natural terms



The algorithmic details for the hand tracker were presented, followed by a discussion of issues related to each. This project presented a vision-based hand tracking system that does not require any special markers or gloves and can operate in real-time on a commodity PC with low-cost cameras. Specifically, the system can track the positions of the ring finger for each hand. The motivation for this hand tracker was a desktop-based two-handed interaction system in which a user can move cursor in real-time using natural hand motions.


Appendix A 1. Custom List of Useful Websites 3D Depth Sensing Prototype System for Gesture Control, 2. Matlab, mathworks/recorded_webinar

Appendix B

List of Papers Refered

1. G. R. S. Murthy & R. S. Jadon, A Review Of Vision Based Hand Gestures Recognition, International Journal of Information Technology and Knowledge Management ,July-December 2009, Volume 2, No. 2, pp. 405-410 2. Sushmita Mitra, Gesture Recognition: A Survey, IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 37, NO. 3, MAY 2007 3. Prateem Chakraborty, Prashant Sarawgi, Ankit Mehrotra, Gaurav Agarwal, Ratika Pradhan, Hand Gesture Recognition:A Comparative Study, Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol IIMECS 2008, 19-21 March, 2008, Hong Kong 4. Pragati Garg, Naveen Aggarwal and Sanjeev Sofat, Vision Based Hand Gesture Recognition, World Academy of Science, Engineering and Technology 49 2009