4 Year Project Report: Hand Gesture Recognition Using Computer Vision

4th Year Project Report
Hand Gesture Recognition Using

Computer Vision
Ray Lockton
Balliol College
Oxford University
Supervisor: Dr. A.W. Fitzgibbon
Department of Engineering Science
Figure 1: Successful recognition of a series of gestures

Raymond Lockton, Balliol College
1 Contents
1 CONTENTS................................................................................................................ 2
2 INTRODUCTION ...................................................................................................... 3
2.1 REPORT OVERVIEW .............................................................................................. 4
2.2 PROJECT SUMMARY.............................................................................................. 4
2.3 EXISTING SYSTEMS .............................................................................................. 5
3 DETECTION.............................................................................................................. 6
3.1 CHOICE OF SENSORS ............................................................................................. 6
3.2 HARDWARE SETUP................................................................................................ 7
3.3 CHOICE OF VISUAL DATA FORMAT ........................................................................ 8
3.4 COLOUR CALIBRATION ......................................................................................... 9
3.5 METHOD OF COLOUR DETECTION ........................................................................ 15
3.6 CONCLUSION ...................................................................................................... 15
4 REFINEMENT ......................................................................................................... 17
4.1 ANALYSIS OF DISTORTION .................................................................................. 17
4.2 REMOVAL OF SKIN PIXELS DETECTED AS WRIST BAND PIXELS ............................. 19
4.3 REMOVAL OF SKIN PIXELS DETECTED FROM FOREARM ........................................ 19
4.4 CONCLUSION ...................................................................................................... 23
5 RECOGNITION....................................................................................................... 24
5.1 CHOICE OF RECOGNITION STRATEGY .................................................................. 24
5.2 SELECTION OF TEST GESTURE SET ....................................................................... 24
5.3 ANALYSIS OF RECOGNITION PROBLEM ................................................................ 25
5.4 RECOGNITION METHOD 1: AREA METRIC ............................................................ 25
5.5 RECOGNITION METHOD 2: RADIAL LENGTH SIGNATURE ...................................... 26
5.6 RECOGNITION METHOD 3: TEMPLATE MATCHING IN THE CANONICAL FRAME ...... 34
5.7 REFINEMENT OF THE CANONICAL FRAME ............................................................ 40
5.8 REFINEMENT OF THE TRAINING DATA ................................................................. 41
5.9 METHOD OF DIFFERENTIATION (IN CANONICAL FRAME) ...................................... 43
5.10 REFINEMENT OF TEMPLATE SCORE METHOD (NO QUANTIZATION) ....................... 48
5.11 CONCLUSION ...................................................................................................... 51
6 APPLICATION: GESTURE DRIVEN INTERFACE ............................................ 52
6.1 SETUP................................................................................................................. 52
6.2 DEMONSTRATION ............................................................................................... 53
7 CONCLUSION......................................................................................................... 56
7.1 PROJECT GOALS ................................................................................................. 56
7.2 FURTHER WORK ................................................................................................. 56
8 REFERENCES ......................................................................................................... 58
9 APPENDIX ............................................................................................................... 59
9.1 APPENDIX A- GLOSSARY .................................................................................... 59
9.2 APPENDIX B- ENTIRE GESTURE SET ................................................................... 60
9.3 APPENDIX C- ALGORITHMS ................................................................................ 61
2
Introduction
This project will design and build a man-machine interface using a video camera to interpret
the American one-handed sign language alphabet and number gestures (plus others for
additional keyboard and mouse control).
The keyboard and mouse are currently the main interfaces between man and computer.
In other areas where 3D information is required, such as computer games, robotics and
design, other mechanical devices such as roller-balls, joysticks and data-gloves are used.
Humans communicate mainly by vision and sound, therefore, a man-machine interface
would be more intuitive if it made greater use of vision and audio recognition. Another
advantage is that the user not only can communicate from a distance, but need have no
physical contact with the computer. However, unlike audio commands, a visual system would
be preferable in noisy environments or in situations where sound would cause a disturbance.
The visual system chosen was the recognition of hand gestures. The amount of
computation required to process hand gestures is much greater than that of the mechanical
devices, however standard desktop computers are now quick enough to make this project —
hand gesture recognition using computer vision — a viable proposition.
A gesture recognition system could be used in any of the following areas:
• Man-machine interface: using hand gestures to control the computer mouse and/or
keyboard functions. An example of this, which has been implemented in this project,
controls various keyboard and mouse functions using gestures alone.
• 3D animation: Rapid and simple conversion of hand movements into 3D computer

space for the purposes of computer animation.
• Visualisation: Just as objects can be visually examined by rotating them with the
hand, so it would be advantageous if virtual 3D objects (displayed on the computer
screen) could be manipulated by rotating the hand in space [Bretzner & Lindeberg,
1998].
• Computer games: Using the hand to interact with computer games would be more
natural for many applications.
• Control of mechanical systems (such as robotics): Using the hand to remotely control
a manipulator.
3
2.1 Report Overview

Chapter 3 onwards describes the various design options, rationale and conclusions. The
structure of the write-up and completed project architecture is outlined below.
• The system will use a single,

colour camera mounted above a
neutral coloured desk surface next
to the computer (see Figure 2). The
output of the camera will be
displayed on the monitor. The user
will be required to wear a white
wrist band and will interact with
the system by gesturing in the view
of the camera. Shape and position
information about the hand will be
gathered using detection of skin
and wrist band colour. The
detection will be illustrated by a
colour change on the display. The
design of the detection process will
be covered in Chapter 3.
• The shape information will then be

refined using spatial knowledge of
the hand and wrist band. This will
Figure 2 Picture of system in use (note wrist
be discussed in Chapter 4. band and neutral coloured background)
• The refined shape information will then be compared with a set of predefined training
data (in the form of templates) to recognise which gesture is being signed. In
particular, the contribution of this project is a novel way of speeding up the
comparison process. A label corresponding to the recognised gesture will be
displayed on the monitor screen. Figure 1 (front cover) shows the successful
recognition of a series of gestures. The design process for the recognition will be
discussed in Chapter 5.
• Chapter 6 describes an application of the system — a gesture driven windows

interface.
• Finally, Chapter 7 describes how the project has achieved the goals set and further
work that could be carried out.
2.2 Project Summary

In order to detect hand gestures, data about the hand will have to be collected. A decision has
to be made as to the nature and source of the data. Two possible technologies to provide this
information are:
• A glove with sensors attached that measure the position of the finger joints.
• An optical method.
4
An optical method has been chosen, since this is more practical (many modern computers
come with a camera attached), cost effective and has no moving parts, so is less likely to be
damaged through use.
The first step in any recognition system is collection of relevant data. In this case the raw
image information will have to be processed to differentiate the skin of the hand (and various
markers) from the background. Chapter 3 deals with this step.
Once the data has been collected it is then possible to use prior information about the hand
(for example, the fingers are always separated from the wrist by the palm) to refine the data
and remove as much noise as possible. This step is important because as the number of
gestures to be distinguished increases the data collected has to be more and more accurate and
noise free in order to permit recognition. Chapter 4 deals with this step.
The next step will be to take the refined data and determine what gesture it represents. Any
recognition system will have to simplify the data to allow calculation in a reasonable amount
of time (the target recognition time for a set of 36 gestures is 25 frames per second). Obvious
ways to simplify the data include translating, rotating and scaling the hand so that it is always
presented with the same position, orientation and effective hand-camera distance to the
recognition system. Chapter 5 deals with this step.
2.3 Existing Systems

A simplification used in this project, which was not found in any recognition methods
researched, is the use of a wrist band to remove several degrees of freedom. This enabled
three new recognition methods to be devised. The recognition frame rate achieved is
comparable to most of the systems in existence (after allowance for processor speed) but the
number of different gestures recognised and the recognition accuracy are amongst the best
found. Figure 3 shows several of the existing gesture recognition systems along with
recognition statistics and method.
Paper Primary method Number of Background Additional Number of Accuracy Frame

of recognition gestures to gesture markers training rate
recognised images required (such images
as wrist band)
[Bauer & Hidden Markov 97 General Multi- 7-hours 91.7% -
Hienz, 2000] Models coloured signing
gloves
[Starner, Hidden Markov 40 General No 400 97.6% 10
Weaver & Models training
Pentland, sentences
1998]
[Bowden & Linear 26 Blue screen No 7441 - -
Sarhadi, approximation images
2000] to non-linear
point
distribution
models
[Davis & Finite state 7 Static Markers on 10 ≈98% 10
Shah, 1994] machine / glove sequences
model matching of 200
frames
each
This project Fast template 46 Static Wrist band 100 99.1% 15
matching examples
per gesture
Figure 3 Table showing existing gesture recognition systems found during research.
5
Detection
In order to recognise hand gestures it is first necessary to collect information about the hand
from raw data provided by any sensors used. This section deals with the selection of suitable
sensors and compares various methods of returning only the data that pertains to the hand.
3.1 Choice of sensors

Since the hand is by nature a three dimensional object the first optical data collection method
considered was a stereographic multiple camera system. Alternatively, using prior
information about the anatomy of the hand it would be possible to garner the same gesture
information using either a single camera or multiple two dimensional views provided by
several cameras. These three options are considered below:
Stereographic system: The stereographic system would provide pixellated depth information
for any point in the fields of view of the cameras. This would provide a great deal of
information about the hand. Features that would otherwise be hard to distinguish using a 2D
system, such as a finger against a background of skin, would be differentiable since the finger
would be closer to camera than the background. However the 3D data would require a great
deal of processor time to calculate and reliable real-time stereo algorithms are not easily
obtained or implemented.
Multiple two dimensional view system: This system would provide less information than the
stereographic system and if the number of cameras used was not great, would also use less
processor time. With this system two or more 2D views of the same hand, provided by
separate cameras, could be combined after gesture recognition. Although each view would
suffer from similar problems to that of the “finger” example above, the combined views of
enough cameras would reveal sufficient data to approximate any gesture.
Single camera system: This system would provide considerably less information about the
hand. Some features (such as the finger against a background of skin in the example above)
would be very hard to distinguish since no depth information would be recoverable.
Essentially only “silhouette” information (see Glossary) could be accurately extracted. The
silhouette data would be relatively noise free (given a background sufficiently distinguishable
from the hand) and would require considerably less processor time to compute than either
multiple camera system.
It is possible to detect a large subset of gestures using silhouette information alone and the
single camera system is less noisy, expensive and processor hungry. Although the system
exhibits more ambiguity than either of the other systems, this disadvantage is more than
outweighed by the advantages mentioned above. Therefore, it was decided to use the single
camera system.
6
3.2 Hardware setup

The output of the camera system chosen in Section 3.1 comprises of a 2D array of RGB pixels
provided at regular time intervals. In order to detect silhouette information it will be necessary
to differentiate skin from background pixels. It is also likely that other markers will be needed
to provide extra information about the hand (such as hand yaw- see Glossary) and the marker
pixels will also have to be differentiated from the background (and skin pixels). To make this
process as achievable as possible it is essential that the hardware setup is chosen carefully.
The various options are discussed below.
Lighting: The task of differentiating the skin pixels from those of the background and markers
is made considerably easier by a careful choice of lighting. If the lighting is constant across
the view of the camera then the effects of self-shadowing can be reduced to a minimum (see
Figure 4). The intensity should also be set to provide sufficient light for the CCD in the
camera.
A B
Figure 4 The effect of self shadowing (A) and cast shadowing (B). The top three images were lit
by a single light source situated off to the left. A self-shadowing effect can be seen on all three,
especially marked on the right image where the hand is angled away from the source. The bottom
three images are more uniformly lit, with little self-shadowing. Cast shadows do not affect the
skin for any of the images and therefore should not degrade detection. Note how an increase of
illumination in the bottom three images results in a greater contrast between skin and
background.
However, since this system is intended to be used by the consumer it would be a disadvantage
if special lighting equipment was required. It was decided to attempt to extract the hand and
marker information using standard room lighting (in this case a 100 watt bulb and shade
mounted on the ceiling). This would permit the system to be used in a non-specialist
environment.
Camera orientation: It is important to carefully choose the direction in which the camera
points to permit an easy choice of background. The two realistic options are to point the
camera towards a wall or towards the floor (or desktop). However since the lighting was a
single overhead bulb, light intensity would be higher and shadowing effects least if the
camera was pointed downwards.
7
Background: In order to maximise differentiation it is important that the colour of the

background differs as much as possible from that of the skin. The floor colour in the project
room was a dull brown. It was decided that this colour would suffice initially.
3.3 Choice of visual data format

An important trade-off when implementing a computer vision system is to select whether to
differentiate objects using colour or black and white and, if colour, to decide what colour
space to use (red, green, blue or hue, saturation, luminosity). For the purposes of this project,
the detection of skin and marker pixels is required, so the colour space chosen should best
facilitate this.
Colour or black and white: The camera and video card available permitted the detection of
colour information. Although using intensity alone (black and white) reduces the amount of
data to analyse and therefore decreases processor load it also makes differentiating skin and
markers from the background much harder (since black and white data exhibits less variation
than colour data). Therefore it was decided to use colour differentiation.
RGB or HSL: The raw data provided by the video card was in the RGB (red, green, blue)
format. However, since the detection system relies on changes in colour (or hue), it could be
an advantage to use HSL (hue, saturation, luminosity- see Glossary) to permit the separation
of the hue from luminosity (light level). To test this the maximum and minimum HSL pixel
colour values of a small test area of skin were manually calculated. These HSL ranges were
then used to detect skin pixels in a subsequent frame (detection was indicated by a change of
pixel colour to white). The test was carried out three times using either hue, saturation or
luminosity colour ranges to detect the skin pixels. Next, histograms were drawn of the number
of skin pixels of each value of hue, saturation and luminosity within the test area. Histograms
were also drawn for an equal sized area of non-skin pixels. The results are shown in Figure 5:
Figure 5 Results of detection using individual ranges of hue (left), saturation (centre) and
luminosity (right) as well as histograms showing the number of pixels detected for each value of
skin (top) and background (bottom). Images and graphs show that hue is a poor variable to use
to detect skin as the range of values for skin hue and background hue demonstrate significant
overlap (although this may have been due to the choice of hue of the background). Saturation is
slightly better and luminosity is the best variable. However, a combination of saturation and
luminosity would provide the best skin detection in this case.
8
The histogram test was repeated using the RGB colour space. The results are shown in Figure
6.
Figure 6 Histograms showing the number of pixels detected for each value of red (left), green
(centre) and blue (right) colour components for skin pixels (top) and background pixels (bottom).
The ranges for each of the colour components are well separated. This, combined with the fact
that using the RGB colour space is considerably quicker than using HSL suggests that RGB is
the best colour space to use.
Figure 7 shows recognition using red, green and blue colour ranges in combination:
Figure 7 Skin detection using red, green and blue colour ranges in combination. Detection is
adequate and frame rate over twice as fast as the HSL option.
Hue, when compared with saturation and luminosity, is surprisingly bad at skin differentiation
(with the chosen background) and thus HSL shows no significant advantage over RGB.
Moreover, since conversion of the colour data from RGB to HSL took considerable processor
time it was decided to use RGB.
3.4 Colour calibration

It is likely that the detection system will be subjected to varying lighting conditions (for
example, due to time of day or position of camera relative to light sources). Therefore it is
likely that an occasional recalibration will have to be performed. The various calibration
techniques are discussed below:
9
3.4.1 Initial Calibration

The method of skin and marker detection selected above (Section 3.3) involves checking the
RGB values of the pixels to see if they fall within red, green and blue ranges (these ranges are
different for skin and marker). The choice of how to calculate these ranges is an important
one. Not only does the calibration have to result in the detection of all hand and marker pixels
at varying light levels, but also the detection of erroneous background pixels has to be
reduced to a minimum.
In order to automatically calculate the colour ranges, an area of the screen was
demarcated for calibration. It was then a simple matter to position the hand or marker (in this
case a wrist band) within this area and then scan it to find the maximum and minimum RGB
values of the ranges (see Figure 8).
A formal description of the initial calibration method is as follows: The image is a 2D array of
pixels:
*I ( x, y) =  gr((xx,, yy)) 

 b( x, y ) 
 
* * *
The calibration area is a set of 2D points:
J = {x1 xn }
* where x i = (x, y )
*
The colour ranges can then be defined for this area:
*
r = max r ( x ) r = min r ( x )
max
**
x∈J
*
* min *
x∈ J
*
max
max
* *
x∈J
*
g = max g ( x ) g = min g ( x )
*
b = max b( x ) b = min b( x )
* *
min
min
*
x∈ J
*
*
*
x∈J x∈ J
A formal description of skin detection is then as follows:
The skin pixels are those pixels (r , g , b ) such that:

((r ≥ rmin ) & (r ≤ rmax )) &
(( g ≥ g min ) & ( g ≤ g max )) &
((b ≥ bmin ) & (b ≤ bmax ))
Call this predicate S (r , g , b)
The set of all skin pixel locations is then:

* * * *
L = {x | S (r (x ), g (x ), b(x )) = 1}
Using this method skin pixels were detected at a rate of 15fps on a 600Mhz laptop (see Figure
9).
10
A B
C (skin ranges) D (wrist band ranges)

Figure 8 Image A shows the colour calibration areas for wrist band (green) and skin (orange).
Calibration is performed by positioning the wrist band under the green calibration area and the
hand under the orange calibration area (image B shows a partially positioned hand). The
calibration algorithm then reads the colour values of both areas and calculates the ranges by
repeatedly updating maximum and minimum RGB values for each pixel. Images C and D show
the pixel colour values for the skin and wrist band areas. The colour ranges calculated for each
colour component are indicated by double headed arrows.
Figure 9 After calibration of skin and wrist band pixels, colour ranges are used to detect all
subsequent frames. A detected frame is shown here with skin pixel detection indicated in white
and wrist band pixel detection indicated in red.
11
It was decided to use the calibration routine, discussed in Section 3.4.1, to find the initial
values. However, the ranges returned by this method were less than perfect for the reasons
below:
• The calibration was carried out using a single frame; hence pixel colour variations
over time, due to camera noise, would not be accounted for.
• To ensure that the sampled area only contained skin pixels, by necessity it had to be
smaller than the hand itself. The extremities of the hand (where, due to self-
shadowing, the colour variation is the greatest) were therefore not included in the
calibration.
3.4.2 Improving calibration

In order to improve the calibration, four further methods were considered:
1. Multiple frame calibration: If the calibration was repeated over several frames and
the overall maximum and minimum colour values calculated, then the variation over
time due to camera noise would be included in those ranges and its effect thus
negated. The method would require the hand to be held stationary during the
calibration process.
The routine was thus modified to perform the calibration over 10 frames instead of
one. Figure 10 shows the results.
A B
Figure 10 Results of multiple frame calibration. Stage A is the result of the initial calibration.
Stage B is the result of calibration over 10 frames. There is no discernable difference in the skin
fit.
Calibration of several frames does little to improve skin detection. Therefore this
method was not retained.
2. Region-Growing: A second method would be to query pixels close to the detected

skin pixels found using the initial method. If the colour components of these fell just
outside the calibration ranges then the ranges could be increased to include them. This
process could then be repeated a number of times until the skin detection was
adequate. Figure 11 shows how the process works (simplified).
12
A B C
D E F
Figure 11 A simplified illustration of how region-growing works. Image A shows the initial
captured hand. Image B shows the result of initial calibration, detected pixels are shown in white.
For simplicity’s sake the pixels that fall within the initial colour ranges have been drawn as a
square. In practise, all pixels within the ranges will have been identified (these pixels would be
scattered throughout the hand area). Next, any pixels in the neighbourhood of those already
detected are scanned (the area within the black box of image C). If their colour values lie just
outside the current colour ranges, the ranges are increased to include them. The result is shown
in image D (again simplified). Although the pixels between the index and middle fingers fell
within the boundary, their values did not fall close to the ranges, so they were ignored. The
process is then repeated (images E and F) until, in theory, the ranges are such that all skin pixels
are detected.
A program was written to repeat the region-growing process a number of times on a

single frame. The results are shown in Figure 12.
13
A B
C D
Figure 12 Results of region-growing. Stage A is the result of the initial calibration. Stage B is the
result of 50 repetitions of the region-growing algorithm (the fit is better still but a single
erroneous pixel, circled and arrowed, has been detected in the background). Stage C is the result
of 100 repetitions. The background noise is growing even though the shadowed areas of the hand
are still not detected adequately. Finally, by Stage D with 200 repetitions there is a considerable
amount of background noise.
The results show that performing the region-growing process a small number of times
results in slightly better detection but the process becomes noisy if the number of
repetitions is too high (>100). It was decided to keep this method but restrict its
growth to a maximum of 50 repetitions.
3. Background Subtraction: With this method an image of the background is stored.

This information is then subtracted from any subsequent frames. In theory this would
negate the background leaving only the hand and marker information, making the
detection process much easier. However, although performing background
subtraction with a black and white system worked well, doing the same with colour
proved much more difficult as a simple subtraction of the colour components made
the remaining hand and marker colour information uneven over the frame. This
method also made the system considerably slower and was adversely influenced by
the automatic aperture adjust of the camera. As the current system worked adequately
it was decided not to proceed with this calibration step.
4. Removal of persistent aberrant pixels: Although it is a valid design choice to select a

background that differs greatly in hue both from that of the skin and wrist band
colour, it is possible that imperfections in the background colour or camera function
could result in aberrant pixels falling within the calibrated ranges and therefore be
repeatedly misinterpreted as skin or wrist band pixels. It would be possible to scan the
14
image when the hand is not in the frame and store any (aberrant) pixels detected.
Ignoring these pixels would affect the recognition depending on where the hand was
in the frame. It would therefore be necessary to choose the correct value for the
aberrant pixel based on the value of those surrounding it (if all surrounding pixels are
skin then detect as skin, else background). However, neither the camera nor the
background exhibited such pixels when a hand was in frame so it was decided not to
proceed in programming this calibration step.
3.5 Method of colour detection

Until now a simple RGB bounding box has been used in the classification of the skin and
marker pixels. However, if a plot is drawn of the detected skin pixels (see Figure 13) it can be
seen that they lie not within a cuboid (the principle used by the current detection system) but
within an ellipsoid.
Figure 13 Plots of different combinations of skin pixel colour values (green) and background
pixel colour values (red). The skin pixels are well separated from the background pixels in all
three colour components but lie within an ellipsoid as opposed to a cuboid. The values are well
enough separated, however, for a cuboid colour range system to work adequately.
In order to improve accuracy it would be necessary to check if the colour components of the
skin and wrist band pixels fell within this ellipsoid. However, this was considered
computationally intensive and given that the current cuboid system works adequately it was
not implemented.
3.6 Conclusion
This chapter has described the choice and setup of hardware and the methods of calibration
and detection in order to detect as many of the skin and marker pixels within the frame as
15
possible. The hardware chosen was a single colour camera pointing down towards a desk (or
floor) surface of a constant colour with no special lighting. Calibration is performed by
scanning the RGB colour values of pixels within a preset area of the frame and improved
using a limited amount of region-growing. Detection is performed by comparing each RGB
pixel value with ranges found during calibration. Figure 12 Stage B shows the successful
detection of the majority of the hand area.
16
Refinement
Using the methods discussed in the previous chapter it is possible to detect the majority of the
skin and band pixels in the frame whilst detecting very few aberrant pixels in the background.
However, some complications were noticed which could reduce the accuracy of recognition at
a later stage. These are:
1. Image distortion: If the camera’s visual axis is not perpendicular to the floor plane, a
given gesture would appear different depending on the position and yaw of the hand
(a given length in one area of the frame would appear longer or shorter in another
area of the frame). This is termed projective distortion. Also, if the camera lens is of
poor quality then the straight sides of a true square in the frame would appear curved.
This is termed radial distortion.
2. Skin pixels detected as wrist band pixels: If the wrist band colour ranges are increased
sufficiently for all pixels to be detected then areas of skin that are more reflective
(such as the knuckles) start to be incorrectly identified as band pixels. This is
disadvantageous as it leads to inaccurate recognition information.
3. Skin pixels of the arm being detected: Any skin pixels above the wrist band will also
be detected as skin. It would be preferable if these pixels could be ignored, as they
play no part in the gesture. Wearing a long sleeve top helps solve the problem but
forearm pixels are still detected between the wrist band and the sleeve (which has a
tendency to move up and down the arm as different gestures are made, leading to
variations in the amount of skin detected).
It was decided to reduce the effects of these complications as much as possible.
4.1 Analysis of distortion

Tests were devised to check for the presence of both radial and projective distortion. These
are discussed below.
4.1.1 Radial distortion

In order to assess whether radial distortion was present, a rectangular piece of card was placed
in the frame. It was then a simple matter to check the edges of the frame against the edges of
the card (see Figure 14).
17
Figure 14 A4 card placed in the frame. If the camera had significant radial distortion, the
straight edges of the paper would appear as curves. This is not the case so radial distortion is not
significant.
The straight sides of the paper are imaged not as curves, but as straight lines, therefore radial
distortion is not present.
4.1.2 Projective distortion

To check for projective distortion a strip of paper was placed in the frame at various positions.
By measuring its length (in pixels) at each location, any vertical or horizontal distortion could
be found (see Figure 15).
A: Strip length 101 pixels B: Strip length 102 pixels C: Strip length 99 pixels
D: Strip length 98 pixels E: Strip length 97 pixels F: Strip length 95 pixels

Figure 15 Paper strip placed in the frame at different positions (with superimposed lines to aid
measurement). From the measured strip lengths it can be seen that there is only a small amount
of projective distortion present. Overall, there is only 6% deviation in apparent strip length
anywhere in the frame, therefore it was considered unnecessary to correct for projective
distortion.
18
There is slight image distortion present but its effect is limited to only 6% and therefore was
not considered serious enough to attempt to remove (removal would involve transforming a
distorted to a regular rectangle which would be processor intensive).
4.2 Removal of skin pixels detected as wrist band

pixels
Although some skin pixels were incorrectly detected as wrist band pixels when the wrist band
colour ranges were increased, no wrist band pixels were incorrectly detected as skin. It was a
simple matter, therefore, to permit pixels to be detected as wrist band only if they had not
previously been detected as skin. This reduced the number of aberrant wrist band pixels
considerably.
4.3 Removal of skin pixels detected from forearm

As there is no difference between the colour ranges of a skin pixel of the hand and a skin
pixel of the forearm, position information will have to be used to remove forearm skin pixels.
4.3.1 Centroid calculation

By averaging the position of the pixels detected it is possible to calculate the centroid of both
the hand and the wrist band.
A formal description of centroid calculation is as follows: From before, the set of all skin
pixel locations was defined as:
* * * *
L = {x | S (r (x ), g (x ), b(x )) = 1}
Denote the number of elements of L by L
* *
This gives the hand centroid as:
1
chand =
L
∑x
*
x∈L
* *
The wrist band centroid is calculated in the same way:
1
cband =
Lband *
∑x
x∈Lband
Figure 16 shows an original image and the image with the detected skin pixels, wrist band
pixels and centroids visible.
19
A B
Figure 16 Original image before skin and wrist band pixel detection (A) and after (B). Detected
skin pixels are shown in blue and wrist band pixels in red. Centroids are displayed as black dots.
Notice how even with priority given to skin pixels over wrist band pixels, a number of wrist
band pixels are erroneously detected near the knuckles (where skin has not been detected due
to the higher reflectivity of those areas).
4.3.2 Localising the wrist band

It was considered that if the distance and angle of the edges of the wrist band relative to the
hand centroid could be found, the forearm skin pixels could be removed by comparing their
distances and angles with them.
The edges of the wrist band can be found by scanning lines parallel to the line joining the two
centroids.
*
Define the vector joining the two centroids as:
* *
c dif = ( x dif , y dif ) = c hand − cband
The yaw angle of the hand is therefore:
 y dif 
θ hand = tan −1  
x 
 dif 
*
The edges of the band are then found thus:
For each point p1 (s1 ) along the line
  π 
* *
p1 (s1 ) = cband
 cosθ hand +  
+ s1  
2 
  π 
 sin θ hand +  
  2 
where ( − 50 ≤ s1 ≤ 50 )
For each s1 count the number of wrist band pixels n(s1 ) along the line:
* *  cos(θ hand )
p2 (s1 , s 2 ) = p1 + s 2  
 sin (θ hand )
where ( − 50 ≤ s 2 ≤ 50 )
* ( )
* ( )
*
The two points defining the edges of the band bleft or xleft , yleft and bright or xright , y right
are then equal to p (s ) when n(s ) falls below a certain threshold.
1 1 1
20
Figure 17 shows a number of the lines scanned (reduced for clarity) along with a graph
showing the thresholds used in the program to detect the band edges.
Number of wrist band pixels

25
20
detected
15
10
0
-100 -50 0 50 100
Distance along line perpendicular to line
joining the centroids (pixels)
Figure 17 The left image shows the lines scanned to detect the edges of the wrist band. The
number of wrist band pixels detected along each line is counted. The edges have been detected
when the number falls below a certain threshold. The graph on the right shows the number of
pixels detected along each of the lines with the detected edges marked in red.
Using these thresholds it is then possible to utilize only those wrist band pixels that are within
the band’s width. This removes any remaining erroneous wrist band pixels detected near the
knuckles.
(* * * * )
The radius of the band is:
rband = max bleft − cband , bright − cband
*
Any band pixels further than rband from cband can then be disqualified.
The wrist band centroid can then be recalculated.
Figure 18 shows the wrist band pixels that have passed this radius test and the recalculated
centroid (passed pixels shown in yellow, radius indicated by black circle).
Figure 18 Radius test applied to wrist band pixels. Any pixels that are further from the wrist
band centroid than the band radius (black circle) previously calculated can be ignored (pixels
that pass shown in yellow, those that fail in red)
21
4.3.3 Removing skin pixels of the forearm

Finally, using the angle and distance from the hand centroid to the wrist band edges it is
possible to differentiate the skin pixels of the forearm and remove them.
(* * * * )
The minimum distance between the hand centroid and the edges of the band is:
rhand = min bleft − chand , bright − chand
*
The maximum and minimum angles of the band ( θ band max and θ band min ) relative to chand are:
  yleft   y 
θ band max = max  tan −1  , tan −1  right  
  xleft   x right  
    
  yleft   y 
θ band min = min  tan −1  , tan −1  right  
  xleft   xright  
    
*
Any hand pixels further than rhand and between θ band max and θ band min relative to chand can
then be disqualified (a case statement deals with the situation that occurs when the band
angles lie either side of 0 radians).
Figure 19 shows the angle and distance criterion being applied, with skin pixels that fail
highlighted in green.
Figure 19 Distance and angle criterion applied to skin pixels. The two straight black lines show
the angle in which the radius criterion is applied. The curved black line shows the radius beyond
which skin pixels are disqualified. In this example failed skin pixels are shown in green.
Finally the hand centroid can be recalculated. This is shown in Figure 20.
22
Figure 20 Image showing recalculated hand and wrist band centroids. Invalid wrist band pixels
have been ignored (passed pixels shown in yellow, failed pixels in red) and skin pixels up the
forearm have also been ignored.
4.4 Conclusion
This chapter has described several techniques to improve the hand detection. A combination
of pixel position and priority based information was used to remove any erroneous detected
pixels. Figure 21 shows that the process was very successful.
Figure 21 Detected pixels before and after refinement. The detected wrist band pixels are shown
in red. Notice how after refinement the erroneous wrist band pixels detected on the knuckles
have been ignored, with a corresponding shift in wrist band centroid. The detected skin pixels are
shown in blue. All of the hand pixels are detected except those in areas of higher reflectivity (near
the knuckles) which naturally show up as white. Notice how after refinement all skin pixels
detected up the forearm have been ignored; with a corresponding shift in hand centroid.
23
Recognition
In the previous two chapters, methods were devised to obtain accurate information about the
position of skin and wrist band pixels. This information can then be used to calculate the hand
and wrist band centroids with subsequent data pertaining to hand rotation and scaling. The
next step is to use all of this information to recognise the gesture within the frame.
5.1 Choice of recognition strategy

Two methods present themselves by which a given gesture could be recognised from two
dimensional “silhouette” information:
Direct method based on geometry: Knowing that the hand is made up of bones of fixed width
connected by joints which can only flex in certain directions and by limited angles it would be
possible to calculate the silhouettes for a large number of hand gestures. Thus, it would be
possible to take the silhouette information provided by the detection method and find the most
likely gesture that corresponds to it by direct comparison. The advantages of this method are
that it would require very little training and would be easy to extend to any number of
gestures as required. However, the model for calculating the silhouette for any given gesture
would be hard to construct and in order to attain a high degree of accuracy it would be
necessary to model the effect of all light sources in the room on the shadows cast on the hand
by itself.
Learning method: With this method the gesture set to be recognised would be “taught” to the
system beforehand. Any given gesture could then be compared with the stored gestures and a
match score calculated. The highest scoring gesture could then be displayed if its score was
greater than some match quality threshold. The advantage of this system is that no prior
information is required about the lighting conditions or the geometry of the hand for the
system to work, as this information would be encoded into the system during training. The
system would be faster than the above method if the gesture set was kept small. The
disadvantage with this system is that each gesture would need to be trained at least once and
for any degree of accuracy, several times. The gesture set is also likely to be user specific.
It was decided to proceed with the learning method for reasons of computation speed and ease
of implementation.
5.2 Selection of test gesture set

In order to test any comparison metric devised it is important to have a constant set of easily
reproducible gestures. It is also important to ensure that the gestures are not chosen to be as
dissimilar as possible (so that the system is tested robustly). Sign language gestures are an
excellent test, but sign language normally involves both hands with one hand regularly
24
occluding the other. This is outside the project remit. However, there is an American one-
handed sign language alphabet, which, with slight modification, can be used (see Appendix
B).
5.3 Analysis of recognition problem

In order for any comparison method to work, it is essential to remove as many degrees of
freedom as possible in order to make the comparison realistic. For instance, if a given gesture
has to be taught for every position in the frame, every hand yaw angle and for various
distances from the camera, then the comparison task becomes impossibly large. However, the
inclusion of a wrist band in detection helps simplify the process by removing these degrees of
freedom. The angle between the centroids of the wrist band and the hand designates the yaw
of the hand, so this degree of freedom can be removed. The distance between the centroids
allows the hand to be scaled to a constant size so the hand-to-camera distance degree of
freedom can be removed. Finally, since the centre of the hand is indicated by the hand
centroid the hand position degree of freedom can also be removed by centring detection about
this point. The only degree of freedom that cannot be removed is the roll angle of the hand
(see Glossary). However it could be argued that if the roll angle is changed (wrist is rotated)
then this represents a different gesture and should be detected as such.
Three recognition methods will be considered within this chapter. The first, developed
mainly to design the comparison architecture, is based on gesture skin area. The second uses
the amount of skin under a series of radials emanating from the hand centroid to generate a
“signature” for each gesture. The third is based upon matching templates generated during
training with a given test mask in the canonical frame. The three methods are discussed
below.
5.4 Recognition method 1: Area metric

A very simple comparison metric would be hand area, which would have the advantage of not
being affected by the yaw of the hand. However, the area of any given gesture is unlikely to
be unique within the test set. Nevertheless it was decided to proceed with the analysis of this
method in order to focus the attention on the comparison architecture of any future system
and the testing methodology. See Appendix C Section 1 for a formal description of this
method.
In order to test this method a program was devised to measure the area of a given gesture
(after scaling to keep the hand centroid to wrist band centroid distance constant). Several
examples from the one-handed sign language were presented and the average areas of each
calculated and stored. A test gesture was then presented to the system and the differences in
area between it and those previously stored calculated.
The recognition results for the sign language letter ‘c’ are shown in Figure 22 (compared
with letters ‘a’ through to ‘i’):
25
Comparison of area for gesture 'c' with trained letters 'a' through
to 'i'
8000
7000
6000
5000
4000 Area difference
3000
2000
1000
0
a a b b c c d d e e f f g g h h i i
Figure 22 Comparison of test letter 'c' with pairs of trained examples from 'a' through to 'i'.
Although the score is low for the letter ‘c’ the scores for several of the other gestures is also low.
Any of the gestures below the broken line could be misinterpreted as the letter ‘c’. This suggests,
as predicted, that area is not a good comparison metric to use (although the letters ‘a’, ‘e’, ‘g’
and ‘i’ are well differentiated from ‘c’).
As predicted, area is not a good comparison metric as several other trained gestures (‘b’, ‘d’
and ‘h’) also exhibited a similar area to the test letter ‘c’.
5.5 Recognition method 2: Radial length signature

A simple method to assess the gesture would be to measure the distance from the hand
centroid to the edges of the hand along a number of radials equally spaced around a circle.
This would provide information on the general “shape” of the gesture that could be easily
rotated to account for hand yaw (since any radial could be used as datum). Figure 23 shows a
gesture with example radials (simplified).
26
Figure 23 Example gesture with radials marked. The black radial lengths can easily be measured
(length in pixels shown). However, the red radials present a problem in that they either cross
between fingers or palm and finger.
However, a problem (as shown in Figure 23) is how to measure when the radial crosses a gap
between fingers or between the palm and a finger. To remedy this it was decided to count the
total number of skin pixels along a given radial. This is shown in Figure 24.
Figure 24 One of the problem radials with outlined solution. If only the skin pixels along any
given radial are counted then the sum is the “effective” length of that radial. In this case the
radial length is 46 + 21 = 67.
All of the radial measurements could then be scaled so that the longest radial was of constant
length. By doing this, any alteration in the hand camera distance would not affect the radial
length signature generated. See Appendix C Section 2 for a formal description of the radial
length calculation.
5.5.1 Evaluation of radial length metric

To evaluate this method a program was written to calculate the radial length signature of a
given gesture and display it in the form of a histogram. Figure 25 shows the skin count of the
radials from 0 to 2π radians for an open hand gesture in several different yaw angles and
distances from the camera.
27
Figure 25 Open hand gesture in several different positions and yaw angles. The histogram for
each gesture is largely the same shape but shifted dependent on the yaw of the hand.
The measurement is not affected by hand-to-camera distance. The measurement is affected by

the yaw of the hand, but this only shifts the readings to the left or right and does not affect
their shape. Figure 26, however, shows that the measurements are considerably different for
different gestures.
Figure 26 Images showing the histogram for two different gestures. The two histograms are
sufficiently different to permit differentiation.
5.5.2 Removing the hand yaw degree of freedom

In order to counter the shifting effect of hand yaw, a wrist marker was used. The angle
between the centroid of this marker and the centroid of the hand was then used as the initial
28
radial direction. This, along with the maximum radial length scaling makes the system robust
against changes in hand position, yaw and distance from camera. Figure 27 shows the same
open hand gesture (as in Figure 25) in a variety of positions and yaw angles.
Figure 27 The same open hand gesture as before in a variety of different positions and yaw
angles, but with hand yaw independence. The histograms for all the gestures are similar so it
should be possible to recognise this gesture from a set of different gestures.
The radial measurements are very similar no matter how the hand is positioned.
5.5.3 Comparison of radial signatures

Now that an invariant signature exists for each gesture it is possible to compare the signature
of a test gesture with those of a set of trained gestures. A match score for each trained gesture
was then calculated by adding up the differences between corresponding radial lengths. The
trained signature with the smallest difference could then be presented as the match. See
Appendix C Section 3 for a formal description of the radial signature comparison.
A program was written to display an image of the trained gesture with the best score at
the top left of the image window. Figure 28 shows the successful recognition of several
gestures.
29
Figure 28 Successful recognition of several different gestures. Gesture recognised is shown at the
top left of the frame. The gestures are recognised correctly even though the yaw of the test hand
is different from that taught.
5.5.4 Improving radial distribution

During tests it was noticed that the quality of recognition depended on the number of radials
used (in the example in Figure 28 only 100 radials were used where previously the number
was 200). It was also noticed that most of the significant data was concentrated around the
fingers, thus it would be more efficient to group radials in these areas. Figure 29 shows the
radials in their original grouping and after reorganisation.
30
Figure 29 The left image shows 100 radials in their original pattern. However, this pattern does
not give the necessary concentration bias towards the fingers. The image on the right shows 200
radials reorganised so that twice as many lie over the fingers as the rest of the hand (150 over the
fingers and 50 elsewhere).
5.5.5 Re-evaluation of radial length metric

Using this improved system the sign language letters ‘a’ through to ‘o’ were taught to the
system. This enabled a very limited sign language “word processor” to be made (see Figure
30).
Figure 30 Successful implementation of a simple sign language “word processor”. Clicking a

button whilst gesturing in the frame added the highest scoring gesture to the output window.
The graph in Figure 31 shows that the radial length metric is considerably better than the area
metric at differentiating this series of gestures. However, ‘c’ and ‘i’ have very similar low
scores even though the signs are physically different.
31
Comparison of radial length signature for gesture 'c' with

trained letters 'a' through to 'i'
4000
3500
3000
2500
2000 Total of
differences in
1500 number of
1000 pixels along
radials
500
0
Figure 31 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The score is
low for the letter ‘c’ and also high for most of the other gestures. However, one example of the
letter ‘i’ also gets a good comparison score even though the gesture corresponding to the letter ‘i’
is dissimilar to that of the letter ‘c’. However, the range of scores is considerably better than that
of the area recognition method discussed earlier.
5.5.6 Analysis of data provided by system

To examine why the scores were so similar for the physically different gestures ‘c’ and ‘i’
(see Figure 31), the recognition program was altered so that only a single pixel was displayed
along a given radial at a distance proportional to the number of pixels detected (along that
radial). This provided a good illustration of the information presented to the recognition
process (see Figure 32).
Figure 32 On the left is the original image and on the right is a representation of the data
provided by the radial length recognition system. The amount of information provided about
individual fingers is dependent on the angle of the radial covering that finger which means that
gestures involving the poorly represented fingers will not be well differentiated.
32
Due to the organisation of the radials, the amount of information provided about individual
fingers is dependent on the relative angle of the radial and the long axis of the finger (the
shallower the angle the more information is provided). This is obviously an inadequate
situation as gestures involving the parts of the hand that are not well covered would be hard to
differentiate.
5.5.7 Test of system using American sign-language gestures

The effects of the problem highlighted in Section 5.5.6 are further illustrated by the
recognition statistics in Figure 33, for a considerably larger gesture set involving all the sign
language letters and numbers as well as five mouse commands (left click (lc), right click (rc),
open hand (op), closed hand (cl), double click (dc)- see Figure 50) and space (sp). The test
procedure involved signing all of the gestures as well as transition gestures interleaved
between them. For a perfect score the system would not only have to correctly recognise all
the gestures but also provide a “blank” return for the transition gestures. A false positive is
where the system returns a gesture label even though the input was a transition gesture. A
false negative is where the system returns a “blank” even though the input was a valid
gesture.
Gesture Recognised Gesture Recognised

T T E E
H H D D
E E SP SP
SP SP O O
1 R V V
2 K E E
3 3 R U
4 - SP SP
5 5 T T
SP SP H H
Q Q E E
U U SP SP
I I 6 6
C C 7 7
K K 8 8
SP SP 9 9
B B 0 J
R R SP SP
O O L L
W U A A
N T Z Z
SP SP Y Y
F F SP SP
O O D D
X X O O
E E G G
S N S -
SP SP OP OP
J J CL CL
U U LC LC
M M RC RC
P P DC DC
Correct 55/64
Incorrect 9/64
False positives 33/64
False negatives 2/64
Figure 33 Results from a test of the radial length recognition method. Several of the test gestures
were incorrectly recognised. There were also a number of false positives and two false negatives
(the number of false positives and negatives is dependent on a threshold above which a score is
considered to have been caused by a valid gesture).
33
5.6 Recognition method 3: Template matching in the

canonical frame
In this section an alternative recognition strategy is discussed which involves first
transforming the hand into the canonical frame and then performing a comparison of the
“test” and “taught” transformed data.
Using the hand yaw and scaling information it is possible to transform the entire hand
into a frame where it always has the same yaw angle and scaling (this is called the canonical
frame). For each skin pixel the distance and angle to the hand centroid is calculated. The
distance can then scaled by the hand centroid to wrist band centroid distance and the angle
rotated by the angle of a line joining the centroids. The translated pixel can then be placed in
the canonical frame using another point as reference, say the centre of the screen. See
Appendix C Section 4 for a pseudocode description of this process.
5.6.1 Evaluation of the transformation into the canonical

frame
A program was written to perform the transformation. The results are shown in Figure 34.
Figure 34 On the left is the original image and on the right is the image after transformation into
the canonical frame. However, after scaling up from the original frame, gaps appear between the
pixels which would make the recognition comparison unreliable.
The problem is that scaling up from the original frame to the canonical frame results in gaps
between pixels. This would be disadvantageous in recognition as a specific pixel in the
trained set may not match up with a corresponding pixel in the test gesture and as such would
not score.
5.6.2 Modification of the transformation method

A solution to the problem highlighted in Section 5.6.1 would be to change the algorithm from
using a pixel “push” from the original to the canonical to using a pixel “pull”. With this
method the distance and angle between every pixel in the canonical frame and some anchor
point (such as the centre of the screen) is calculated. The inverse scaling and angle rotation is
then performed and the corresponding pixel in the original frame, relative to the hand
centroid, queried. If this pixel is skin then the pixel in the canonical frame is coloured blue. If
it is not skin it is coloured black. A disadvantage is that any given pixel in the original frame
may be queried several times, reducing efficiency. See Appendix C Section 5 for a
pseudocode description of the pixel “pull” from the canonical frame.
34
5.6.3 Evaluation of the modified transformation

A program was written to perform the modified transformation. Figure 35 illustrates how a
given gesture in two different positions in the original frame looks very similar in the
canonical frame. Notice that shadowing still affects the gesture similarity.
Figure 35 The left two images show two different examples of the same gesture at different
positions and rotations. The right two images show the corresponding images in the canonical
frame. Performing a pixel “pull” rather than a pixel “push” means that the problem of gaps
between pixels no longer occurs. The two gestures look similar in the canonical frame, most of the
differences being caused by shadowing.
5.6.4 Analysis of methods of representation of training data

The question is now how to compare training data with a test gesture in the canonical frame.
Unlike the radial length metric the amount of data to be compared for each gesture is large
(>40,000 pixels). Therefore, although it would be possible to directly compare the canonical
frame information of a test gesture with all of those trained, this process would be inefficient
and slow. It is evident that some pixels are better at differentiating a given set of gestures than
others (pixels near the wrist band are likely to be skin for the entirety of the gesture set and
those far from it never). It is also the case that some pixels are not reliable in identifying a
given gesture (such as pixels near the edge of the hand or those intermittently affected by
shadowing). To address this problem a program was written to take a number of example
images of a given gesture and compare every pixel over the set. The value for the amount of
variation of each pixel was then calculated and displayed by a colour from blue (small amount
of variation) to red (large amount of variation). These images were termed “jitter maps”. See
Appendix C Section 6 for a pseudocode description of the creation of these jitter maps.
35
Figure 36 shows the jitter maps for the one handed sign language letters m, n and l (40
examples of each gesture were used).
Figure 36 Jitter maps for the letters m, n and l respectively (40 examples of each gesture were
used). The most variation (most red) occurs near the edges of the hand. Greater influence should
therefore be given to the bluer pixels for the purposes of recognition.
As expected, the largest amount of variation occurs near the edges of the hand. Therefore, in
the recognition of these gestures, greater weight should be given to the bluer pixels. It would
also be advantageous to combine the information given by maps such as those in Figure 36 to
find the pixels that best differentiate them. In order to facilitate this a program was first
written to create a map where the value of each pixel is dictated by the proportion that the
corresponding pixels across the training set were skin. These images were termed “skin
concentration maps” (SCMs). See Appendix C Section 7 for a pseudocode description of the
creation of these skin concentration maps.
A simple subtraction of the SCMs for two sets of gestures could then be performed to
find the pixels that best differentiate the two (the best pixels being those that are mostly
background on one set and mostly skin on the other). See Appendix C Section 8 for a
pseudocode description of the creation of a skin concentration difference map.
Figure 37 shows the skin concentration maps for the letters m and n and the result of the
subtraction of the two.
36
Figure 37 The top two images are skin concentration maps for the letters m and n respectively.
As expected the skin is most concentrated at the centre of the hand (blue areas) and least
concentrated near the edges (red areas). The bottom image is the result of an image subtraction
of the top two. The best pixel areas to differentiate these two gestures lie just beyond the knuckles
of the letter n and in the shadowed area of the letter m (coloured red).
The best pixels to differentiate the letters m and n (coloured red) lie just beyond the knuckles
of the letter n and in the shadowed area of the letter m.
Both jitter and skin concentration maps are a compact way of representing the large amount
of data created during training. However, skin concentration maps proved more useful for the
purposes of gesture comparison and so were chosen.
5.6.5 Evaluation of template matching in the canonical frame

recognition method
Now that a skin concentration map could be formed for any gesture trained, a method had to
be found to compare a test gesture mask with each of them. Fundamentally, a trained and test
gesture are a good match if all the areas of skin and background match up. However, a skin
concentration map has no “skin” or “background” but rather a value between these two limits.
Therefore, in order to evaluate this recognition method a program was written to quantize the
skin concentration maps so that all areas above a certain threshold were considered “skin”, all
those below a second threshold considered “background” and all other pixels ignored. A
direct skin to skin and background to background comparison then became possible. See
37
Appendix C Section 9 for a pseudocode description of the creation of the quantized skin
concentration maps. Figure 38 shows an example skin concentration map before and after
quantization.
Figure 38 An example SCM of the letter ‘e’ before and after quantization (left and right
respectively). Any areas below a certain “cold” threshold are considered “skin” (coloured blue),
all those above another “hot” threshold considered “background” (coloured red). All other areas
are ignored (coloured white).
A score was then calculated by comparing the test gesture mask with each quantized skin
concentration map (QSCM). A point was awarded if the test mask skin pixel coincided with a
“skin” pixel of the QSCM and a point subtracted if the test mask skin pixel coincided with a
“background” pixel. Similarly a point was awarded if the test mask background coincided
with the “background” of the QSCM and vice versa. See Appendix C Section 10 for a
pseudocode description of the comparison of a test gesture mask and set of QSCMs. Figure 39
shows the comparison of the QSCM for the letter ‘e’ (Figure 38 right) with example masks of
the letters ‘c’ and ‘e’.
38
Figure 39 The comparison of the QSCM for the letter ‘e’ (Figure 38 right) with example masks of
the letters ‘c’ and ‘e’. Areas that achieve positive scores (background to background or skin to
skin match) are shown in green and those with negative scores (background to skin or skin to
background) are shown in yellow. The mask for the letter ‘e’ has many more areas of positive
score and fewer areas of negative score than the mask for the letter ‘c’.
The graph in Figure 40 shows the scores of a test gesture ‘c’ compared with the QSCMs of
gestures from ‘a’ through to ‘i’.
39
Comparison of QSCM match score for gesture 'c' with trained

letters 'a' through to 'i'
30000
25000
20000
QSCM
15000 match
score
10000
5000
0
Figure 40 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The
examples of the letter ‘c’ achieve the top two comparison scores and none of the others achieve
similar scores except the letter ‘d’ which, although close, is still a minimum 1,400 points different.
This suggests that the template matching in the canonical frame recognition method is better
than both the area and radial length recognition methods.
Both examples of the letter ‘c’ stored matched better to the test gesture than any of the others.
Based on the results obtained for the three metrics it was decided to use the template matching
in the canonical frame recognition method as it was the only method that provided sufficient
information to differentiate the similar gestures reliably and because it was the easiest to adapt
to using multiple training examples of each gesture.
5.7 Refinement of the canonical frame

In order to make the differentiation of a large number of gestures accurate, it is essential that
the canonical frame is as invariant as possible to movements of a gesture in the original
frame. The current system uses the hand centroid as an anchor in the original frame and the
hand centroid to wrist band centroid distance as a scaling factor. However, although the
centroids are calculated using the average of a large number of pixels they are not as robust as
other methods considered below.
5.7.1 Scaling using average radial distance

With this method the scaling factor is obtained using the average distance from the hand
centroid to every skin pixel detected. This is more robust than the hand centroid to wrist band
centroid distance scaling factor as it does not involve the use of the wrist band centroid
(which is less robust as it is calculated using a smaller number of pixels). See Appendix C
Section 11 for a pseudocode description of scaling using the average radial distance.
5.7.2 Shifting the hand in the canonical frame

This method translates the hand in the canonical frame based upon simple rules (e.g. shift up
until there are at least 40 skin pixels in the uppermost row). Once again this method makes the
40
canonical frame method more robust as it reduces the reliance on the hand centroid as an
anchor point. Several rules were considered, but the one that produced the best results
involved shifting the image in the canonical frame to the right until the wrist band was just off
the edge of the screen. This was performed by scanning columns of the canonical frame from
the right until the number of wrist band pixels detected fell to zero. The positioning in the y-
direction was calculated using the hand centroid as before. Figure 41 shows a gesture in the
canonical frame before and after translation.
x
Figure 41 Images showing the canonical frame before (left) and after (right) x-axis shift. The y-
axis position of the hand is dictated by the hand centroid as before.
It was decided to use both these methods.
5.8 Refinement of the training data

It was noticed that some gestures from the one handed sign language set exhibited more
variation than others. This was primarily the gestures where the fingers of the hand cast
shadows on the palm (such as the letters ‘e’ and ‘f’). The shadows cast varied greatly with a
small change of hand roll angle causing different areas of the palm not to be detected within
the set. It was considered that these gestures would be at a disadvantage relative to those with
less variation, as the skin concentration map would have more ‘red’ areas, which therefore
gives less credence to the comparison. For example, an extreme case would be a gesture that
has no pixels common to any of the teaching frames. The skin concentration map for this
gesture would therefore be entirely red. Any comparison method should give these high
variation pixels less weight so for this extreme example none of the pixels would cause a high
score even if the test gesture was an example of a taught gesture.
A solution to this problem would be to cluster the training set for this gesture into several
different exemplars (or “sub-groups”), all of which would share the same gesture label. The
exemplars could be formed using the most similar gestures from the main group. This would
then guarantee that the amount of variation within any of the exemplars would be kept low
and therefore solve the problem. A simplified example of the clustering process is shown in
Figure 42.
41
A: Input B: Output SCM without C: Output SCMs with

clustering clustering
Figure 42 A simplified example of how clustering improves recognition. In this case several of
each of three valid representations of the letter ‘c’ have been taught to the system. An example of
each of the three representations is shown (column A). The resultant SCM (column B) has a large
amount of ‘redder’ area. Any comparison method should give these areas less weight so this
gesture would be at a disadvantage relative to those with less variation. Column C shows the
SCMs produced after clustering. The three types of gesture input have been split into three
separate SCMs, each with much less ‘redder’ area.
A greedy algorithm was devised to take the first gesture image in the training group and
compare it pixel by pixel with all other members of the group. See Appendix C Section 12 for
a pseudocode description of the comparison.
Any gesture images whose compared difference (in pixels) fell below a set threshold,
t max , were then added to a sub-group and removed from the main group. Once all the gesture
images in the main group had been compared the next first member of the main group could
be compared with all the remaining images and so on. A threshold was also set to define the
minimum number of gesture images permitted in an exemplar. In the event that the number of
images in an exemplar fell below this threshold the first member of the main group was
simply removed entirely with the logic that if it was so dissimilar from all the rest then it must
be an outlier and as such could be safely removed without greatly affecting recognition
quality. The process continued until no gesture images remained in the main group. See
Appendix C Section 13 for a pseudocode description of the clustering process.
Figure 43 shows the result of running the algorithm on sets of 100 gesture images of the
sign language letters ‘a’ through to ‘e’. The value of t max in this case was 2500 pixels
different and a minimum of four gesture images were allowed in an exemplar.
42
Sign language Number of Number of gesture images in each exemplar

letter exemplars
A 6 48,12,11,5,14,5 (5 outliers)
B 3 81,12,6 (1 outlier)
C 3 41,53,6 (0 outliers)
D 3 73,15,12 (0 outliers)
E 13 11,12,10,7,12,8,6,5,7,5,4,4,4 (5 outliers)
Figure 43 Table showing the result of applying the segmentation algorithm to sets of 100 gesture
images of the sign language letters ‘a’ through to ‘e’. The gestures with the greatest amount of
shadowing are ‘a’ (due to the fingers resting against the palm) and ‘e’ (due to the suspended
fingers above the palm). Notice also how each of these gestures has five outliers. However, this is
only 5% of the total number of gesture images in the set so was not considered too large. The
gestures with no shadowing (‘c’ and ‘d’) are still clustered into more than one exemplar. This is
due to the range of positions the fingers can occupy and still present a valid version of this
gesture.
All of the training gesture image sets are clustered into at least three exemplars. As expected,
the gestures with the largest number of exemplars are those with the most shadowing (letters
‘a’ and ‘e’). Those with no shadowing (‘c’ and ‘d’) are also clustered into a small number of
exemplars as they involve a range of possible finger positions that still present a valid gesture.
A problem with clustering the training gesture image sets in this way is that it increases the
number of SCMs that need to be compared per frame in order to recognise a test gesture. For
instance, with no clustering, a set of 24 gestures would produce 24 SCMs to compare per
frame. If clustering produces 10 exemplars per gesture, then the number of SCMs increases to
240, with subsequent decrease in recognition frame rate. The choice of how much clustering
to perform is a trade-off between speed (less clustering) and accuracy (more clustering) and
should be chosen depending on the application. A compromise between the two was chosen
here.
5.9 Method of differentiation (in canonical frame)

As mentioned in Section 5.6.4, it is important for the comparison of the stored and test
canonical frames to be efficient. Three methods were considered:
5.9.1 Tree method with quantization

In Section 5.6.4 a method was discussed whereby a series of images of a given gesture can be
combined to form a skin concentration map (SCM). By subtracting two SCMs it is possible to
score each pixel on how effective it is at differentiating one gesture from the other (see Figure
37). This method cannot be easily extended to more than two gestures. However, if a set of
skin concentration maps are quantized into three values, say two for mostly skin, zero for
mostly background and one if neither, then the equivalent pixel in each of the maps can be
examined and that pixel added to a list if the quantized values over all the maps consisted
entirely of twos and zeros. The same pixel of a test gesture can then be queried. If it is skin,
then that would suggest that it is one of the gestures with mostly skin in that position, if not,
then one with mostly background. See Figure 44 for a simplified example of this process and
see Appendix C Section 14 for a pseudocode description.
43
Figure 44 Simplified example of how the pixels that split the set can be found. The four tables on
the left represent skin concentration maps. After quantization, the value of each pixel in the
quantized skin concentration map is either ‘0’, ‘1’ or ‘2’. The pixels that are either ‘0’ or ‘2’
across the set can then be found.
Although the process of quantization means that there is no strict guarantee provided by the
analysis of each individual pixel, the combined influence of the many pixels in the list
provides a better estimate.
With the tree method a group of pixels that split the set of exemplars roughly in two is
found. The greater the number of pixels the better the accuracy of the decision, so a
compromise has to be found between splitting the set into two halves and finding enough
pixels to accurately do so. See Appendix C Section 15 for a formal description of this
compromise.
Once the set is split the two subsets can be stored in the left and right branch of a tree
structure. The same process (of finding pixels that split the set in two) can be applied to both
subsets. The process continues until all subsets consist of a single gesture.
A program was written to perform the quantization and then scan all the pixels from all
the QSCMs for those that split the set roughly in two. Priority was given to finding sufficient
pixels so if on a given pass insufficient were found then the process was repeated but with
less emphasis on splitting the set exactly in two. After each split the location and value of all
the qualifying pixels was stored and a node of a tree structure filled. Both reduced sets of
gestures were then passed back into the splitting algorithm. The process was repeated until all
the bottom nodes of the tree consisted of a single gesture. See Appendix C Section 16 for a
pseudocode description of filling the tree structure.
Figure 45 shows the output of the algorithm for a set of five gestures from the one
handed sign language set (letters l, b, o, n and m).
44
Input Gestures
L B O N M
Set of pixels found that split the set and tree structure filled
L B O N M
L B O N M
B O N M
B O N M
Figure 45 An example of how the tree method works. At each level of the tree the number of skin
pixels under the green and yellow masks is counted. If the number under the green mask is larger
than that under the yellow mask the green branch is chosen. Alternatively the yellow branch is
chosen. The process is repeated until the bottom of the tree is reached.
45
The advantage of this system is that after the tree structure is filled, only a small number of
pixels need be analysed before the descent to the next tree level. As, at each stage, the number
of possible exemplars is split roughly in two, this method is very quick to execute. The
disadvantage of this method is that at the levels of the tree near the root, when the number of
exemplars is large, the number of pixels that split the set (even to split off a single exemplar)
is very small. During testing it was found that for a set of just 16 exemplars only 200 pixels
could be found to split off a single exemplar at the first level of the tree, greatly increasing the
possibility of error at this level. Another problem is that the tree can only be traversed
downwards- once it is decided to travel down one side of the tree the exemplars represented
on the other side cannot be compared even if they would provide a better match at a later
stage. For example, if the probability of correct branch traversal at each node is 98% or 0.98
(which corresponds to a 2% probability of failure) and the tree has 10 levels (all of which
must be traversed correctly), then the probability of success at the bottom is 0.9810 = 0.82
(which corresponds to a failure probability of 18%). This was reflected in the fact, that for a
set of more than eight different exemplars, the correct one was rarely recognised.
5.9.2 Template score method with quantization

With this method the quantization of the SCMs is performed as with the previous method. In
order to recognise the test gesture a score is calculated for each QSCM by looking at each
pixel in turn. Every pixel is scored as follows (see Appendix C Section 17 for a pseudocode
description):
• If the test gesture pixel is skin then a point is awarded to each of the QSCMs if the
value of that pixel is mostly skin.
• If the test gesture pixel is skin then a point is subtracted from each of the QSCMs if
the value of that pixel is mostly background.
• If the test gesture pixel is background then a point is awarded to each of the QSCMs
if the value of that pixel is mostly background.
• If the test gesture pixel is background then a point is subtracted from each of the
QSCMs if the value of that pixel is mostly skin.
• Otherwise no change is performed.
The final score for each QSCM can then be calculated by dividing the total score by the
maximum score possible (equal to the number of pixels over the template which are either
mostly skin or mostly background).
An advantage of this system is that each exemplar is judged separately so unlike the tree
method errors do not accumulate. A disadvantage is that a very large number of pixels have to
be examined for each of the QSCMs for a match to be made. Also, if a given training gesture
has a large amount of variation then there will be a large number of pixels which are neither
mostly skin or mostly background in the QSCM (equivalent to a large amount of white area in
Figure 38 right), leaving large areas where no score can be awarded, and as such increase the
possibility that two exemplars will be difficult to differentiate.
To test the system, the same training and test gesture sets that were used with the radial
length metric were fed to the system. Figure 46 shows the results:
46

T T E E
H H D D
E E SP SP
SP SP O O
1 1 V V
2 2 E E
3 3 R R
4 4 SP SP
5 5 T T
SP SP H H
Q Q E E
U U SP SP
I I 6 6
C C 7 7
K K 8 8
SP SP 9 9
B B 0 0
R U SP SP
O O L L
W W A A
N N Z Z
SP SP Y Y
F F SP SP
O O D D
X X O O
E E G G
S S S S
SP SP OP OP
J J CL CL
U U LC LC
M M RC RC
P P DC DC
Correct 63/64
Incorrect 1/64
Figure 46 Results from a test of the template score method with quantization. All but one of the
test gestures was correctly identified and there were no false negatives. However, there were a
considerable number of false positives. This is due to the fact that the recognition score for a
couple of the gestures was low even though the correct gesture obtained the highest score. This
meant that the recognition threshold had to be set low and as such a number of intermediary
frames were incorrectly recognised as gestures.
5.9.3 Template score method with no quantization

With this method no quantization is performed. Instead, the amount of skin present over the
set of images within the exemplar is represented by a floating point number between –0.5 and
0.5 for each pixel (-0.5 representing all background over the set and 0.5 representing all skin).
The score is then calculated as follows (see Appendix C Section 18 for a pseudocode
description):
• Add this floating point number when the corresponding test gesture pixel is skin
• Subtract this floating point number when the corresponding test gesture pixel is
background
Pixels that have a large amount of variation do not affect the score by a significant amount as
their value is close to zero.
The advantage of this method is that no pixels are ignored, so even exemplars with a
large amount of gesture image variation are fully considered. A disadvantage is that many
pixels have to be considered for each SCM (as with the quantization method). This method
47
will also be slower than the previous method as many floating point calculations have to be
performed (rather than integer ones).
Once again the system was tested using the same gesture sets as before. The results are
shown in Figure 47:

T T E E
H H D D
E E SP SP
SP SP O O
1 1 V V
2 2 E E
3 3 R R
4 4 SP SP
5 5 T T
SP SP H H
Q Q E E
U U SP SP
I I 6 6
C C 7 7
K K 8 8
SP SP 9 9
B B 0 0
R R SP SP
O O L L
W W A A
N N Z Z
SP SP Y Y
F F SP SP
O O D D
X X O O
E E G G
S S S S
SP SP OP OP
J J CL CL
U U LC LC
M M RC RC
P P DC DC
Correct 64/64
Incorrect 0/64
Figure 47 Results from a test of the template score method with no quantization. All of the test
gestures were correctly identified this time and once again there were no false negatives. There
were a considerable number of false positives. This is for the same reason as with the previous
figure.
From looking at the results of each of the recognition methods it was clear that the method
with the best recognition score was the template score method with no quantization. Therefore
this method was chosen.
5.10 Refinement of template score method (no

quantization)
Although the template score method correctly recognised all of the gestures in the test set it
seemed unnecessary to query such a large number of pixels to make the decision. Therefore
two methods were considered to perform the same task with the same accuracy but more
efficiently:
48
5.10.1 Removal of pixels that perform the same function

With the method described above, all of the pixels that are skin for any of the trained gesture
images are queried (roughly 38,000 pixels for each of the 300 exemplars that result after
clustering). It is likely, however, that many of these pixels perform largely the same job and
as such any duplicates need not be queried at all. A simple example of this is a training set
with only two gestures, say A and B. After the skin concentration maps are processed two
types of pixel will result, those that are mostly skin for A and mostly background for B and
those that are mostly skin for B and mostly background for A. However, in order to correctly
identify which gesture is presented it is only necessary to look at a single pixel from one of
the groups, preferably one which is always skin for one of the gestures and always
background for the other. It was decided to create an algorithm to find any duplicate pixels
and ignore them.
In order to make the process simpler it was decided to first quantize the pixel values into
three groups, ‘1’ if the skin concentration fell above a certain threshold, ‘0’ if it fell below
another threshold and ‘X’ otherwise. An identification string could then be generated for each
of the pixels across all of the groups of exemplars (one character in the string per exemplar
and one string per pixel). See Appendix C Section 19 for a pseudocode description of the
creation of these strings.
A procedure was written to compare each pixel string with all the others. A variable
containing the number of bits different was only incremented if a ‘1’ in one string matched
with a ‘0’ in the other or vice-versa. In other words the value ‘X’ was taken to mean either a
‘1’ or a ‘0’. If the number of “bits” different in the string fell below a certain threshold then
the two pixels were considered identical and as such one of the pixels could be discarded. If
this was the case then the procedure returned the string with the most ‘X’s so that this one
could be discarded (as this pixel contained less information). See Appendix C Section 20 for a
pseudocode description of the comparison.
The procedure was run several times on the set of exemplars with different threshold
values. As the upper and lower thresholds were moved closer and closer to “all skin” and “all
background” respectively the identification strings contained more and more ‘X’s and as such
more and more pixels were identified as duplicate and therefore discarded. Similarly as the
minimum number of pixels different allowed was increased, more and more pixel
identification strings were considered duplicate and were also discarded. Eventually so many
pixels were discarded that some of the gestures were no longer recognised correctly. The
widest thresholds that still permitted all of the gestures to be correctly recognised were
chosen. This reduced the number of queried pixels from 34,788 to 1,199 with a corresponding
30-fold increase of recognition speed. Figure 48 shows the pixels queried in order to identify
one of the exemplars for the letter ‘a’ before and after the duplicates were removed.
49
Figure 48 Images showing the pixels queried in order to detect one of the exemplars for the letter
‘a’ before and after removal of duplicates. The duplicate pixels are mostly evenly spread over the
recognition area. Notice how the pixels near the wrist band are less concentrated, as the pixels in
this area are skin for almost all the trained gestures.
The results show, that after removal of duplicate pixels, the remaining pixels are evenly
spread over the recognition area except for the area near the wrist band where a larger number
of duplicates exist. This is because most of the pixels near the wrist band are skin for all of the
trained gestures.
5.10.2 Sorting the pixels

If each pixel could be given a score based on how much information it provides about which
test gesture is being presented then it would be possible to sort them by this value. The
advantage of having a sorted pixel set is that the pixels with the worst scores need not be
queried at all, as they provide little extra information. This would make the system more
efficient. Once again, to make the problem simpler it was decided to use the quantized
information from the previous method. The best pixels would be those that have the fewest
‘X’s (as this does not give us any extra information) but also those which have similar
numbers of ‘1’s and ‘0’s. The reason for this is that, given a test gesture, repeated applications
of pixels such as these, most rapidly cut down the possible number of exemplars that match.
Therefore, the score for each pixel was calculated as follows:
abs (no _ of _ ones − no _ of _ zeros ) + no _ of _ Xs
It was then a simple matter to sort the pixels using this score, the pixels with the lowest score
being placed first.
After sorting, the lowest percentage of the pixels that still permitted all the gestures to be
detected was found by repeated tests. Using a combination of the two methods the number of
pixels queried was reduced from 34,788 to 1,026 (85.5% of 1,199). This corresponds to a
change in system frame rate from 0.5fps to 12.5fps (near real-time).
After application of both these methods all of the test gestures were still correctly recognised.
Therefore it was decided to use both. The results from the application of the set of test
gestures from before is shown in Figure 49:
50

T T E E
H H D D
E E SP SP
SP SP O O
1 1 V V
2 2 E E
3 3 R R
4 4 SP SP
5 5 T T
SP SP H H
Q Q E E
U U SP SP
I I 6 6
C C 7 7
K K 8 8
SP SP 9 9
B B 0 0
R R SP SP
O O L L
W W A A
N N Z Z
SP SP Y Y
F F SP SP
O O D D
X X O O
E E G G
S S S S
SP SP OP OP
J J CL CL
U U LC LC
M M RC RC
P P DC DC
Correct 64/64
Incorrect 0/64
Figure 49 Results of a test to the template score method with no quantization after sorting and
removal of duplicate pixels. All of the test gestures were correctly identified and there were no
false negatives. There were a considerable number of false positives. This is for the same reason
as before.
5.11 Conclusion
In this section, three methods of recognition have been discussed. Firstly, area comparison
was considered. Although this was considered an unsuitable metric it was used in order to
focus the attention on the comparison architecture of any future system and the testing
methodology. The second method involved the comparison of radial length signatures. This
was more suitable, but it was found that the amount of information provided about individual
fingers was dependent on the relative angle of the radial and the long axis of the finger,
making some gestures hard to differentiate. Finally, template matching in the canonical frame
was considered and chosen as it provided the best results. Various refinements were then
made to increase recognition speed. Using the methods chosen a set of 42 gestures were all
correctly recognised at a frame rate of 12.5fps.
51
Application: Gesture driven interface
As a demonstration of the capabilities of the system, a standard Microsoft Windows computer

was modified so that the only input device necessary was the hand.
6.1 Setup
The system was set up as in Figure 2. The template score (with no quantization) recognition
method was modified so that the recognised gesture generated mouse and keyboard events, as
shown in Figure 50.
Gesture Label Event Gesture Label Event

A Press key ‘A’ X Press key ‘X’
B Press key ‘B’ Y Press key ‘Y’
C Press key ‘C’ Z Press key ‘Z’
D Press key ‘D’ 0 Press key ‘0’
E Press key ‘E’ 1 Press key ‘1’
F Press key ‘F’ 2 Press key ‘2’
G Press key ‘G’ 3 Press key ‘3’
H Press key ‘H’ 4 Press key ‘4’
I Press key ‘I’ 5 Press key ‘5’
J Press key ‘J’ 6 Press key ‘6’
K Press key ‘K’ 7 Press key ‘7’
L Press key ‘L’ 8 Press key ‘8’
M Press key ‘M’ 9 Press key ‘9’
N Press key ‘N’ CA Press caps-lock key
O Press key ‘O’ RE Press return key
P Press key ‘P’ DO Press key ‘.’
Q Press key ‘Q’ SP Press spacebar
R Press key ‘R’ BS Press backspace key
S Press key ‘S’ LC Left mouse click
T Press key ‘T’ RC Right mouse click
U Press key ‘U’ DC Left double mouse click
V Press key ‘V’ OP Move mouse pointer
relative to hand centroid
position.
W Press key ‘W’ CL Left mouse button hold and
move mouse pointer relative
to hand centroid position.
Figure 50 Table showing the gesture labels and corresponding mouse or keyboard event.
52
In order to ignore transition movements of the hand, an event was only queued if five
identical contiguous gestures were recognised. Thereafter, further events were only processed
if the gesture changed (therefore, to type two identical letters a brief gesture change would
need to be interleaved).
6.2 Demonstration
To demonstrate the system in use, the following sequence of actions were performed using
the hand alone:
• The explorer icon on the task bar was clicked in order to restore it.
• The floppy drive was selected.

• A right click brought up a menu and a new text document was created.
53
• This document was renamed “my demo.txt”.
• A right click brought up a menu and a new folder was created.

• This folder was renamed “demo folder”.
• The text document was then dragged into the folder.
• The folder was double clicked to open it.

• The text document was then double clicked to edit it.
54
• The following text was then typed into the document:

“This is a demo of my 4th year project.
I CAN TURN CAPS LOCK ON and off.
I can also use the space and backspace keys.
Finally… I can control the mouse.
ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890.”
• The document was then closed and the changes saved.
• Finally, the folder was closed and dragged to the top left of the directory window.
During the demonstration six letter errors were made, two of which were due to operator
error.
An AVI movie file of a similar sequence is available at:

http://users.ox.ac.uk/~ball0622/index_files/demo.avi
(Recognition frame rate in the video example is slightly reduced due to the effect of the
screen capture software.)
55
Conclusion
7.1 Project Goals

The goal of this project was to create a system to recognise a set of 36 gestures at a rate of
25fps. The developed template matching in the canonical frame system accurately recognised
a set of 46 gestures at a rate of 12.5fps on a 600Mhz system. It was considered that a modern
computer system would therefore allow the project goals to be exceeded. Furthermore, the
system performance is amongst the best reported in existing literature.
7.2 Further Work

Collection of additional gesture information: The final system developed recognised gestures
using silhouette information alone. Although this was sufficient for the number of trained
gestures, the accuracy would doubtless suffer if the number of gestures were increased. In
order to remedy this, extra information about the test gesture would have to be gathered, such
as edge information.
Combination of area, radial length and template matching in the canonical frame: It was
noticed that each of the different recognition metrics demonstrated different benefits. For
instance, the area metric differentiated ‘a’ and ‘c’ well, the radial metric differentiated ‘b’ and
‘c’ well and template matching in the canonical frame differentiated ‘d’ and ‘c’ well.
Therefore a weighted combination of all three metrics would result in the highest accuracy.
Removal of wrist band: The system relies on the user wearing a coloured wrist band to
remove various degrees of freedom, making recognition, via comparison, possible. It would
be advantageous if this were not the case. There are methods (see Section 2.3) that could be
used to perform the recognition without a wrist band, but they would be unlikely to be as
accurate.
Using temporal coherence to improve recognition accuracy: English written text has
temporal coherence in that each letter has a probability of being followed by a given letter.
For instance, the letter ‘q’ is often followed by the letter ‘u’ but rarely any other letter. These
probabilities could be used to improve recognition accuracy by combining the list of top
scoring exemplars with the probability of each following the preceding letter. The same
process could also be used to permit standard American one-handed sign language to be used
(where the letters ‘O’, ‘V’ and ‘W’ are the same as the numbers ‘0’, ‘2’ and ‘6’ respectively-
see Appendix B) instead of the modified version.
Increase of the number of recognised gestures: For the purposes of a man-machine interface a
relatively small set of gestures (≈100) would be sufficient and is therefore within the bounds
56
of the final system developed. However, if detection of hand gestures for computer animation
is required (for instance), then the number of trained gestures would need to be in the
thousands. A system which relies on both training and comparison of all gestures used would
not be sufficient for this task. Further work, therefore, could involve the implementation of a
gesture recognition system which does not require training. An example of this is the direct
method based on hand geometry considered in Section 5.1.
Multi-stage gestures: It would be possible to represent a much larger number of labels if each
label consisted of two or more gestures combined with hand position changes. For instance,
the “wave hello” label could correspond to the open hand gesture with an alternating increase
and decrease of hand yaw angle and the “thumbs-up” label could correspond to the letter ‘m’
followed by the space gesture.
Two-handed sign language: It would be possible, using two different coloured gloves and two
different coloured wrist bands, to detect the gesture signed by both hands whilst both are in
the frame. A method would have to be devised to detect a gesture (or range of gestures) that is
represented by a partially occluded hand. This method would be considerably harder to
implement. It is important to note, however, that although the gesture of both hands could be
recognised this would not permit the recognition of the full American sign language as this
involves recognising many other features including facial expression and arm position.
57
8 References
[Bauer & Hienz, 2000] Relevant feature for video-based continuous sign language
recognition. Department of Technical Computer Science, Aachen University of Technology,
Aachen, Germany, 2000.
[Bowden & Sarhadi, 2000] Building temporal models for gesture recognition. In proceedings
British Machine Vision Conference, 2000, pages 32-41.
[Bretzner & Lindeberg, 1998] Use your hand as a 3-D mouse or relative orientation from
extended sequences of sparse point and line correspondences using the affine trifocal tensor.
In proceedings 5th European Conference on Computer Vision, 1998, pages 141-157.
[Davis & Shah, 1994] Visual gesture recognition. In proceedings IEEE Visual Image Signal
Process, 1994, vol.141, No.2, pages 101-106.
[Starner, Weaver & Pentland, 1998] Real-time American sign language recognition using a
desk- and wearable computer-based video. In proceedings IEEE transactions on Pattern
Analysis and Machine Intelligence, 1998, pages 1371-1375.
58
9 Appendix
9.1 Appendix A- Glossary
Hand roll The rotation of the hand about an axis defined by the wrist. The
following three images show the same gesture with increasing
roll.
Hand yaw The rotation of the hand about an axis defined by the camera
view direction. The following three images show the same
gesture with increasing yaw.
HSL Colour space defined by hue, saturation and luminosity. Also

called HSV (hue, saturation and intensity value).
Jitter map A map created using a number of examples of the same gesture.
The colour of each pixel in the map is defined by the amount of
variation exhibited by the corresponding pixel across all of the
examples (the greatest variation is where the pixel is skin for half
of the examples and background for the other half).
Silhouette information Detection of all skin within the hand without any feature
detection (the same information that would be contained in a
silhouette of the hand).
Skin concentration map A map created using a number of examples of the same gesture.
The colour of each pixel in the map is defined by the amount the
corresponding pixel across all of the examples was skin (the
greatest skin concentration is where the pixel is skin for all of the
examples).
59
9.2 Appendix B- Entire Gesture Set

The letters and number gestures are based on the American one-handed sign language. Letters
‘J’ and ‘Z’ were modified as they were moving gestures. Numbers ‘0’, ‘2’ and ‘6’ were
modified as they were identical to the letters ‘O’, ‘V’ and ‘W’ respectively.
A B C D E F G
H I J K L M N
O P Q R S T U
V W X Y Z 1 2
3 4 5 6 7 8 9
0 DO RE BS SP CA LC
RC DC OP CL
60
9.3 Appendix C- Algorithms

C.1 Area of gesture detection method
A formal description of the area of gesture detection method is as follows:
The detected set of pixels from before is L
The area of a given gesture can therefore be calculated thus:
a = ∑1
&
x∈L
A training sequence of n gestures can then be given and manually labelled. We denote a
single (gesture, label ) pair by:
(ai , li )
e.g. (a1 , ' A') , (a 2 , ' B ')
Define this training set as:
G = {(a i , l i )}i =1
n
Given a test image with signature anew choose the label li min where
2
i min = arg min a new − a i 2
i =1..n
C.2 Radial length calculation

A formal description of radial length calculation is as follows:
Examine a typical radial at angle θ .
The score for that radial is:
radscore (θ ) = ∑ S ( x)
*
&
x ∈Rθ
where S () is the skin pixel predicate defined earlier and where

  x   c x   cosθ  
Rθ = (x, y )  =   + r  ∀r > 0
  y   c y   sin θ  
The signature for a given gesture g could then be calculated as:

 radscore (θ ) 
 
g= 0 ≤ θ < 2π 
 max
α
(radscore(α )) 
C.3 Radial signature comparison

A formal description of the radial signature comparison is as follows:
From before, the signature for a given gesture g could be calculated as:
 radscore (θ ) 
 
g= 0 ≤ θ < 2π 
 max
α
(radscore(α )) 
A training sequence of n gestures can then be given and manually labelled. We denote a
single (gesture, label ) pair by:
(g i , li )
e.g. (g 1, ' A') , (g 2 , ' B ')
61
G = {(g i , l i )}i =1
n
Given a test image with signature g new choose the label li min where
i min = arg min g new − g i
2
2
i =1..n
C.4 Transformation into the canonical frame

A pseudocode description of the process of transformation into the canonical frame is as
follows:
Define the new hand and band centroids after refinement as c hand and c band .
* *
The vector joining the two centroids is
* * *
v dif = (x dif , y dif ) = c hand − cband
*
The radius scaling factor and angle shift to be used in canonicalisation can then be defined as
rcanonicalscalefactor = v dif
 y dif 
θ canonicalshift = tan −1  
x 
*
 dif 
Define the anchor of the canonical frame as x canonicalanchor , say (160,120 )
The set of all remaining skin pixel locations after refinement is L
*
*
For each x ∈ L :
* *
v pixel = (x pixel , y pixel ) = x − c hand
*
r pixel = v pixel
 y pixel 
θ pixel = tan −1  
 x pixel 
 
The transformation into the canonical frame then proceeds as follows:
 100 
Pixel distance scaling: rscaledpixel = rpixel ∗  

 rcanonicalscalefactor 
Pixel angle rotation: θ scaledpixel = (θ pixel + θ canonicalshift )mod 2π
The equivalent pixel in the canonical frame is then:
* *  cos θ scaledpixel 
x canonical = x canonicalanchor + rscaledpixel  
sin θ 
 scaledpixel 
C.5 Pixel “pull” from the canonical frame

*
A pseudocode description of the pixel “pull” from the canonical frame is as follows:
*
For all pixels x canonical within the canonical frame:
* *
vcanonical = (x canonical , y canonical ) = xcanonical − xcanonicalanchor
*
rcanonical = v canonical
y 
θ canonical = tan −1  canonical 
 xcanonical 
The pixel “pull” from the original frame then proceeds as follows:
62
 100 
Inverse pixel distance scaling: rinvscaledpixel = rcanonical ÷  
r 
 canonicalscalefactor 
Inverse pixel angle rotation: θ invscaledpixel = (θ canonical − θ canonicalshift )mod 2π
The equivalent pixel in the original frame is then:
* *  cosθ invscaledpixel 
x = c hand + rinvscaledpixel  
θ 
* *
 sin invscaledpixel 
If x ∈ L then mark the pixel in the canonical frame ( xcanonical ) as skin otherwise
mark it as background.
C.6 Creation of jitter maps

A pseudocode description of the process by which the jitter maps are created is as follows:
Each of the n images is defined as a mask (0 for background, 1 for skin) M j = 0..n , i
Define the number of skin pixels across the set as: n skin
Define the number of background pixels across the set as: nbackground
Define an array to store the variation (or jitter) of each pixel: Vi
For each pixel i :
nskin = 0
nbackground = 0
For each image j :
If M j ,i is skin then increment nskin else increment nbackground
The variation (0-1) for pixel i is then:
( )
If nbackground < n then Vi = abs (n skin − nbackground ) ÷ n else Vi = −1
The jitter map can then be generated by colouring each pixel:

Black if Vi = −1
else
Blue if Vi = 0
Red if Vi = 1
And colours in between
C.7 Creation of skin concentration maps (SCM)

A pseudocode description of the process by which the skin concentration maps are created is
as follows:
Define an array to store the skin concentration of each pixel: C i
For each pixel i :
nskin = 0
nbackground = 0
For each image j :
If M j ,i is skin then increment nskin else increment nbackground
The skin concentration (0-1) for pixel i is then:
If nbackground < n then C i = (n skin ÷ n ) else C i = −1
The skin concentration map can then be generated by colouring each pixel:
63
Black if C i = −1
else
Blue if C i = 1
Red if C i = 0
C.8 Creation of skin concentration difference map

A pseudocode description of the process by which the skin concentration difference map is
created as follows:
The two skin concentration maps are stored in the form of an array, CAi and CBi
Define an array to store the difference of each pixel: Di
For each pixel i :
Di = abs (CAi − CBi )
The skin concentration difference map can then be generated by colouring each pixel:
Black if Di = 0
Red if Di = 1
C.9 Creation of quantized skin concentration map (QSCM)

A pseudocode description of the creation of the quantized skin concentration maps (QSCM) is
as follows:
Define the upper skin concentration threshold as tU (say 0.8)
Define the lower skin concentration threshold as t L (say 0.2)
Define a quantized map Qi based upon a skin concentration map C i using the following
rule:
For each pixel i :
 2 → C i ≥ tU

Qi =  0 → C i ≤ t L
1 → otherwise

C.10 Comparison of a test gesture mask and set of QSCMs

A pseudocode description of the comparison of a test gesture mask and set of QSCMs is as
follows:
Given a set of n quantized skin concentration maps Q j = 0.. n ,i that have been manually
labelled, we can denote a single (gesture, label ) pair by:
(Q , l )
j j
e.g. (Q 1 , ' A') , (Q 2 , ' B ')

G = {(Q j , l j )}j =1
n
Given a test image with mask M i calculate the score for each concentration map thus:
Define an array of scores s j where s j = 0 for j = 0..n
For each QSCM j:
64
For each pixel i :

 + 1 → (M i = 1) & (Q j , i = 2)
+ 1 → (M = 0) & (Q = 0)
 i j ,i
s j = s j +  − 1 → (M i = 1) & (Q j , i = 0)
− 1 → ( M = 0) & (Q = 2)
 i j ,i
 0 → otherwise
C.11 Scaling the hand using the average radial distance

A pseudocode description of scaling using the average radial distance is as follows:
The set of all remaining skin pixel locations after refinement is L
*
Define the total radius as rtot
*
For each x ∈ L :
* *
= (x pixel , y pixel ) = x − c hand
*
v pixel
rtot = rtot + v pixel
rtot
The average radius is then defined as
L
C.12 Comparison of two examples of a gesture

A pseudocode description of the comparison process between two examples of a single
gesture (A and B) is as follows:
Define the number of pixels different as ndifferent
Each of the two examples is defined as a mask (0 for background, 1 for skin): MA and MB ,
each with 320 ∗ 240 = 76,800 pixels.
For each pixel i of MA:
If MAi ≠ MBi then increment ndifferent
The maximum difference threshold can be defined as t max (say, 2500 pixels)
The two masks are then sufficiently similar for clustering if n different ≤ t max
C.13 Process by which a set of gesture images is clustered

A pseudocode description of the process by which the set of gesture images is clustered is as
follows:
Each of the n gesture images is defined as a mask (0 for background, 1 for skin) M j = 0..n , i
Place each of the masks within an initial set SInit = {M 0 , M 1 ...M n }
Define a set of m exemplars S l = {{ }0 , { }1 ..{ }m }
Define the minimum number of masks permitted in an exemplar as t min (Say four)
Perform the clustering as follows:
l =0
For each mask j = 0 to j = (n − 2) :
For each mask k = ( j + 1) to k = (n − 1)
If M j is sufficiently similar to M k (see algorithm above) then
65
Remove M k from SInit and add to S l

If the number of elements in S l ≥ t min then
Remove M j from SInit and add to S l
Increment l
Else
Remove all elements from S l and replace in SInit
C.14 Finding the pixels that split the set of exemplars

A pseudocode description of the process by which pixels are found to split the set of possible
exemplars is as follows:
Define the number of “ones” across the set as: n ones
Define a set containing the “twos” exemplar labels: STwos = { }
Define a set containing the “zeros” exemplar labels: SZeros = { }
Define a set that contains:
• The location of each “polarised” pixel (all “zeros” and “twos”) location
• A set containing the “twos” exemplar labels for that pixel
• A set containing the “zeros” exemplar labels for that pixel
SPolarised = {(x, y , { }{
, })}
Given a set of n quantized skin concentration maps Q j = 0.. n ,i from before
For each pixel i :
n ones = 0
STwos = { }
SZeros = { }
For each QSCM j :
If Q j ,i =0 then add exemplar label j to SZeros
If Q j ,i =1 then increment n ones
If Q j ,i =2 then add exemplar label j to STwos
If n ones = 0 then
Add the location of pixel i , the set SZeros and the set STwos to
SPolarised
Now take a pixel k of a test mask M k

If pixel k is skin then that suggests the mask is an example of one of the STwos exemplars
If pixel k is not skin then that suggests the mask is an example of one of the SZeros
exemplars
C.15 The compromise between splitting the set into two halves
and finding enough pixels to accurately do so
A formal description of this compromise is as follows:
SPolarised can be scanned to find the sets of pixels for which:
SZeros and STwos are identical
or
SZeros and STwos are identically opposite (because this pixel split the set in the
same way)
66
A compromise then has to be found between finding a large set of pixels and a set that splits
the set as accurately in two as possible (a set for which SZeros and STwos are roughly of
the same size).
Store the eventual pixels decided upon in set SSplit
C.16 Filling the tree structure

A pseudocode description of filling the tree structure is as follows:
Define a stack with two procedures, push() to add an element to the top of the stack and pop()
to remove the topmost stack element.
Define a pointer p to point to a given tree node.
Take the set SPolarised and find the best compromise between splitting the set exactly in
two and finding sufficient pixels to do so, giving a set of pixels SSplit and two sets of
exemplar labels, SZeros and STwos .
The root of the tree is simply SSplit . The left branch of each node deals with the exemplars
within the SZeros set and the right branch the STwos set.
First set p to point to the root of the tree
Filling the tree then proceeds as follows:
Start: Fill node p with SSplit .
If SZeros contains more than one exemplar label:
If STwos contains more than one exemplar label:
Push the STwos exemplar labels and the right node of p onto the stack
Else
Fill the rightmost node of p with the single exemplar label in STwos .
Set p to point to the leftmost node. Repeat the operation to find the pixels that split
the set of exemplars, but with the reduced set of exemplars labelled within SZeros .
Goto Start.
Else
Fill the leftmost node of p with the single exemplar label in SZeros .
If STwos contains more than one exemplar label:
Set p to point to the rightmost node. Repeat the operation to find the pixels
that split the set of exemplars, but with the reduced set of exemplars labelled
within STwos . Goto Start.
Else
Fill the rightmost node of the tree with the single exemplar label in STwos .
If the stack contains any elements then
Pop element of stack. Set p to point to node popped. Repeat the
operation to find the pixels that split the set of exemplars, but with
the reduced set of exemplars labelled within element popped off
stack. Goto Start.
Else
Finished!
C.17 Template scoring method (with quantization)

A pseudocode description of the template scoring method (as shown before) is as follows:
Given a test image with mask M i calculate the score for each of the n quantized skin
concentration map thus:
Define an array of scores s j
For each template j:
For each pixel i :
67
 + 1 → ( M i = 1) & (Q j ,i = 2)
+ 1 → ( M = 0) & (Q = 0)
 i j ,i
s j = s j +  − 1 → ( M i = 1) & (Q j ,i = 0)
− 1 → (M = 0) & (Q = 2)
 i j ,i
 0 → otherwise
if Q j ,i = 2 then increment s max j
Recognition of the top scoring gesture is then performed by choosing the label l jmax where:
sj
j max = arg max
j =1..n s max j
C.18 Template scoring method (with no quantization)

A pseudocode description of the scoring method with no quantization is as follows:
Given a set of n skin concentration maps (0-1) C j = 0..n ,i which have been manually labelled,
we can denote a single (gesture, label ) pair by:
(C j ,l j )
e.g. (C 1 , ' A') , (C 2 , ' B ')

G = {(C j , l j )}j =1
n
Given a test image with mask M i calculate the score for each concentration map thus:
Define an array of scores s j
For each SCM j:
For each pixel i :
If M i = 1 then
s j = s j + (C j ,i − 0.5)
Else
s j = s j − (C j ,i − 0.5)
Then choose the label l jmax where

j max = arg max s j
j =1.. n
C.19 Creation of the pixel identification strings

A pseudocode description of the creation of the pixel identification strings is as follows:
Define the identification strings as an array of characters ID j ,i
Define the upper skin concentration threshold as tU (say 0.8)
Define the lower skin concentration threshold as t L (say 0.2)
Given a set of n skin concentration maps (0-1) C j = 0..n ,i
Create the identification string as follows:
For each pixel i :
68
For each SCM j:

 '1' → C j ,i ≥ tU

ID j ,i =  '0' → C j ,i ≤ t L
' X ' → otherwise

C.20 String comparison process

A pseudocode description of the string comparison process is as follows:
Define the number of bits different nbits
Define the maximum number of bits different below which two strings are considered “equal”
t max (say 2)
Define the number of ‘X’s in string A n XsA
Define the number of ‘X’s in string B n XsB
Given two identification strings IDA j and IDB j the comparison is as follows:
For each SCM j :
If IDA j =' X ' increment n XsA
If IDB j =' X ' increment n XsB
If ((IDA j ='1')and (IDB j ='0'))or ((IDA j ='0')and (IDB j = '1')) increment nbits
If nbits ≤ t max then
If IDA j ≤ IDB j then
Strings are “equal”, use pixel corresponding to set A
Else
Strings are “equal”, use pixel corresponding to set B
Else
Strings are not equal so do not discard either
69

4 Year Project Report: Hand Gesture Recognition Using Computer Vision

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

4 Year Project Report: Hand Gesture Recognition Using Computer Vision

Hochgeladen von

Copyright:

Verfügbare Formate

4th Year Project Report

Hand Gesture Recognition Using

Figure 1: Successful recognition of a series of gestures

A gesture recognition system could be used in any of the following areas:

• 3D animation: Rapid and simple conversion of hand movements into 3D computer

2.1 Report Overview

• The system will use a single,

• The shape information will then be

• Chapter 6 describes an application of the system — a gesture driven windows

2.2 Project Summary

2.3 Existing Systems

Paper Primary method Number of Background Additional Number of Accuracy Frame

3.1 Choice of sensors

3.2 Hardware setup

Background: In order to maximise differentiation it is important that the colour of the

3.3 Choice of visual data format

3.4 Colour calibration

3.4.1 Initial Calibration

*I ( x, y) =  gr((xx,, yy)) 

A formal description of skin detection is then as follows:

The skin pixels are those pixels (r , g , b ) such that:

The set of all skin pixel locations is then:

C (skin ranges) D (wrist band ranges)

3.4.2 Improving calibration

2. Region-Growing: A second method would be to query pixels close to the detected

A program was written to repeat the region-growing process a number of times on a

3. Background Subtraction: With this method an image of the background is stored.

4. Removal of persistent aberrant pixels: Although it is a valid design choice to select a

3.5 Method of colour detection

It was decided to reduce the effects of these complications as much as possible.

4.1 Analysis of distortion

4.1.1 Radial distortion

4.1.2 Projective distortion

D: Strip length 98 pixels E: Strip length 97 pixels F: Strip length 95 pixels

4.2 Removal of skin pixels detected as wrist band

4.3 Removal of skin pixels detected from forearm

4.3.1 Centroid calculation

4.3.2 Localising the wrist band

Number of wrist band pixels

The wrist band centroid can then be recalculated.

4.3.3 Removing skin pixels of the forearm

5.1 Choice of recognition strategy

5.2 Selection of test gesture set

5.3 Analysis of recognition problem

5.4 Recognition method 1: Area metric

4000 Area difference

5.5 Recognition method 2: Radial length signature

5.5.1 Evaluation of radial length metric

The measurement is not affected by hand-to-camera distance. The measurement is affected by

5.5.2 Removing the hand yaw degree of freedom

5.5.3 Comparison of radial signatures

5.5.4 Improving radial distribution

5.5.5 Re-evaluation of radial length metric

Figure 30 Successful implementation of a simple sign language “word processor”. Clicking a

Comparison of radial length signature for gesture 'c' with

5.5.6 Analysis of data provided by system

5.5.7 Test of system using American sign-language gestures

Gesture Recognised Gesture Recognised

5.6 Recognition method 3: Template matching in the

5.6.1 Evaluation of the transformation into the canonical

5.6.2 Modification of the transformation method

5.6.3 Evaluation of the modified transformation

5.6.4 Analysis of methods of representation of training data