Beruflich Dokumente
Kultur Dokumente
Ray Lockton
Balliol College
Oxford University
Supervisor: Dr. A.W. Fitzgibbon
Department of Engineering Science
1 Contents
1 CONTENTS................................................................................................................ 2
2 INTRODUCTION ...................................................................................................... 3
2.1 REPORT OVERVIEW .............................................................................................. 4
2.2 PROJECT SUMMARY.............................................................................................. 4
2.3 EXISTING SYSTEMS .............................................................................................. 5
3 DETECTION.............................................................................................................. 6
3.1 CHOICE OF SENSORS ............................................................................................. 6
3.2 HARDWARE SETUP................................................................................................ 7
3.3 CHOICE OF VISUAL DATA FORMAT ........................................................................ 8
3.4 COLOUR CALIBRATION ......................................................................................... 9
3.5 METHOD OF COLOUR DETECTION ........................................................................ 15
3.6 CONCLUSION ...................................................................................................... 15
4 REFINEMENT ......................................................................................................... 17
4.1 ANALYSIS OF DISTORTION .................................................................................. 17
4.2 REMOVAL OF SKIN PIXELS DETECTED AS WRIST BAND PIXELS ............................. 19
4.3 REMOVAL OF SKIN PIXELS DETECTED FROM FOREARM ........................................ 19
4.4 CONCLUSION ...................................................................................................... 23
5 RECOGNITION....................................................................................................... 24
5.1 CHOICE OF RECOGNITION STRATEGY .................................................................. 24
5.2 SELECTION OF TEST GESTURE SET ....................................................................... 24
5.3 ANALYSIS OF RECOGNITION PROBLEM ................................................................ 25
5.4 RECOGNITION METHOD 1: AREA METRIC ............................................................ 25
5.5 RECOGNITION METHOD 2: RADIAL LENGTH SIGNATURE ...................................... 26
5.6 RECOGNITION METHOD 3: TEMPLATE MATCHING IN THE CANONICAL FRAME ...... 34
5.7 REFINEMENT OF THE CANONICAL FRAME ............................................................ 40
5.8 REFINEMENT OF THE TRAINING DATA ................................................................. 41
5.9 METHOD OF DIFFERENTIATION (IN CANONICAL FRAME) ...................................... 43
5.10 REFINEMENT OF TEMPLATE SCORE METHOD (NO QUANTIZATION) ....................... 48
5.11 CONCLUSION ...................................................................................................... 51
6 APPLICATION: GESTURE DRIVEN INTERFACE ............................................ 52
6.1 SETUP................................................................................................................. 52
6.2 DEMONSTRATION ............................................................................................... 53
7 CONCLUSION......................................................................................................... 56
7.1 PROJECT GOALS ................................................................................................. 56
7.2 FURTHER WORK ................................................................................................. 56
8 REFERENCES ......................................................................................................... 58
9 APPENDIX ............................................................................................................... 59
9.1 APPENDIX A- GLOSSARY .................................................................................... 59
9.2 APPENDIX B- ENTIRE GESTURE SET ................................................................... 60
9.3 APPENDIX C- ALGORITHMS ................................................................................ 61
2
Raymond Lockton, Balliol College
Introduction
This project will design and build a man-machine interface using a video camera to interpret
the American one-handed sign language alphabet and number gestures (plus others for
additional keyboard and mouse control).
The keyboard and mouse are currently the main interfaces between man and computer.
In other areas where 3D information is required, such as computer games, robotics and
design, other mechanical devices such as roller-balls, joysticks and data-gloves are used.
Humans communicate mainly by vision and sound, therefore, a man-machine interface
would be more intuitive if it made greater use of vision and audio recognition. Another
advantage is that the user not only can communicate from a distance, but need have no
physical contact with the computer. However, unlike audio commands, a visual system would
be preferable in noisy environments or in situations where sound would cause a disturbance.
The visual system chosen was the recognition of hand gestures. The amount of
computation required to process hand gestures is much greater than that of the mechanical
devices, however standard desktop computers are now quick enough to make this project —
hand gesture recognition using computer vision — a viable proposition.
• Man-machine interface: using hand gestures to control the computer mouse and/or
keyboard functions. An example of this, which has been implemented in this project,
controls various keyboard and mouse functions using gestures alone.
• Visualisation: Just as objects can be visually examined by rotating them with the
hand, so it would be advantageous if virtual 3D objects (displayed on the computer
screen) could be manipulated by rotating the hand in space [Bretzner & Lindeberg,
1998].
• Computer games: Using the hand to interact with computer games would be more
natural for many applications.
• Control of mechanical systems (such as robotics): Using the hand to remotely control
a manipulator.
3
Raymond Lockton, Balliol College
• The refined shape information will then be compared with a set of predefined training
data (in the form of templates) to recognise which gesture is being signed. In
particular, the contribution of this project is a novel way of speeding up the
comparison process. A label corresponding to the recognised gesture will be
displayed on the monitor screen. Figure 1 (front cover) shows the successful
recognition of a series of gestures. The design process for the recognition will be
discussed in Chapter 5.
• Finally, Chapter 7 describes how the project has achieved the goals set and further
work that could be carried out.
• A glove with sensors attached that measure the position of the finger joints.
• An optical method.
4
Raymond Lockton, Balliol College
An optical method has been chosen, since this is more practical (many modern computers
come with a camera attached), cost effective and has no moving parts, so is less likely to be
damaged through use.
The first step in any recognition system is collection of relevant data. In this case the raw
image information will have to be processed to differentiate the skin of the hand (and various
markers) from the background. Chapter 3 deals with this step.
Once the data has been collected it is then possible to use prior information about the hand
(for example, the fingers are always separated from the wrist by the palm) to refine the data
and remove as much noise as possible. This step is important because as the number of
gestures to be distinguished increases the data collected has to be more and more accurate and
noise free in order to permit recognition. Chapter 4 deals with this step.
The next step will be to take the refined data and determine what gesture it represents. Any
recognition system will have to simplify the data to allow calculation in a reasonable amount
of time (the target recognition time for a set of 36 gestures is 25 frames per second). Obvious
ways to simplify the data include translating, rotating and scaling the hand so that it is always
presented with the same position, orientation and effective hand-camera distance to the
recognition system. Chapter 5 deals with this step.
Figure 3 Table showing existing gesture recognition systems found during research.
5
Raymond Lockton, Balliol College
Detection
In order to recognise hand gestures it is first necessary to collect information about the hand
from raw data provided by any sensors used. This section deals with the selection of suitable
sensors and compares various methods of returning only the data that pertains to the hand.
Stereographic system: The stereographic system would provide pixellated depth information
for any point in the fields of view of the cameras. This would provide a great deal of
information about the hand. Features that would otherwise be hard to distinguish using a 2D
system, such as a finger against a background of skin, would be differentiable since the finger
would be closer to camera than the background. However the 3D data would require a great
deal of processor time to calculate and reliable real-time stereo algorithms are not easily
obtained or implemented.
Multiple two dimensional view system: This system would provide less information than the
stereographic system and if the number of cameras used was not great, would also use less
processor time. With this system two or more 2D views of the same hand, provided by
separate cameras, could be combined after gesture recognition. Although each view would
suffer from similar problems to that of the “finger” example above, the combined views of
enough cameras would reveal sufficient data to approximate any gesture.
Single camera system: This system would provide considerably less information about the
hand. Some features (such as the finger against a background of skin in the example above)
would be very hard to distinguish since no depth information would be recoverable.
Essentially only “silhouette” information (see Glossary) could be accurately extracted. The
silhouette data would be relatively noise free (given a background sufficiently distinguishable
from the hand) and would require considerably less processor time to compute than either
multiple camera system.
It is possible to detect a large subset of gestures using silhouette information alone and the
single camera system is less noisy, expensive and processor hungry. Although the system
exhibits more ambiguity than either of the other systems, this disadvantage is more than
outweighed by the advantages mentioned above. Therefore, it was decided to use the single
camera system.
6
Raymond Lockton, Balliol College
Lighting: The task of differentiating the skin pixels from those of the background and markers
is made considerably easier by a careful choice of lighting. If the lighting is constant across
the view of the camera then the effects of self-shadowing can be reduced to a minimum (see
Figure 4). The intensity should also be set to provide sufficient light for the CCD in the
camera.
A B
Figure 4 The effect of self shadowing (A) and cast shadowing (B). The top three images were lit
by a single light source situated off to the left. A self-shadowing effect can be seen on all three,
especially marked on the right image where the hand is angled away from the source. The bottom
three images are more uniformly lit, with little self-shadowing. Cast shadows do not affect the
skin for any of the images and therefore should not degrade detection. Note how an increase of
illumination in the bottom three images results in a greater contrast between skin and
background.
However, since this system is intended to be used by the consumer it would be a disadvantage
if special lighting equipment was required. It was decided to attempt to extract the hand and
marker information using standard room lighting (in this case a 100 watt bulb and shade
mounted on the ceiling). This would permit the system to be used in a non-specialist
environment.
Camera orientation: It is important to carefully choose the direction in which the camera
points to permit an easy choice of background. The two realistic options are to point the
camera towards a wall or towards the floor (or desktop). However since the lighting was a
single overhead bulb, light intensity would be higher and shadowing effects least if the
camera was pointed downwards.
7
Raymond Lockton, Balliol College
Colour or black and white: The camera and video card available permitted the detection of
colour information. Although using intensity alone (black and white) reduces the amount of
data to analyse and therefore decreases processor load it also makes differentiating skin and
markers from the background much harder (since black and white data exhibits less variation
than colour data). Therefore it was decided to use colour differentiation.
RGB or HSL: The raw data provided by the video card was in the RGB (red, green, blue)
format. However, since the detection system relies on changes in colour (or hue), it could be
an advantage to use HSL (hue, saturation, luminosity- see Glossary) to permit the separation
of the hue from luminosity (light level). To test this the maximum and minimum HSL pixel
colour values of a small test area of skin were manually calculated. These HSL ranges were
then used to detect skin pixels in a subsequent frame (detection was indicated by a change of
pixel colour to white). The test was carried out three times using either hue, saturation or
luminosity colour ranges to detect the skin pixels. Next, histograms were drawn of the number
of skin pixels of each value of hue, saturation and luminosity within the test area. Histograms
were also drawn for an equal sized area of non-skin pixels. The results are shown in Figure 5:
Figure 5 Results of detection using individual ranges of hue (left), saturation (centre) and
luminosity (right) as well as histograms showing the number of pixels detected for each value of
skin (top) and background (bottom). Images and graphs show that hue is a poor variable to use
to detect skin as the range of values for skin hue and background hue demonstrate significant
overlap (although this may have been due to the choice of hue of the background). Saturation is
slightly better and luminosity is the best variable. However, a combination of saturation and
luminosity would provide the best skin detection in this case.
8
Raymond Lockton, Balliol College
The histogram test was repeated using the RGB colour space. The results are shown in Figure
6.
Figure 6 Histograms showing the number of pixels detected for each value of red (left), green
(centre) and blue (right) colour components for skin pixels (top) and background pixels (bottom).
The ranges for each of the colour components are well separated. This, combined with the fact
that using the RGB colour space is considerably quicker than using HSL suggests that RGB is
the best colour space to use.
Figure 7 shows recognition using red, green and blue colour ranges in combination:
Figure 7 Skin detection using red, green and blue colour ranges in combination. Detection is
adequate and frame rate over twice as fast as the HSL option.
Hue, when compared with saturation and luminosity, is surprisingly bad at skin differentiation
(with the chosen background) and thus HSL shows no significant advantage over RGB.
Moreover, since conversion of the colour data from RGB to HSL took considerable processor
time it was decided to use RGB.
9
Raymond Lockton, Balliol College
A formal description of the initial calibration method is as follows: The image is a 2D array of
pixels:
*
The colour ranges can then be defined for this area:
*
r = max r ( x ) r = min r ( x )
max
**
x∈J
*
* min *
x∈ J
*
max
max
* *
x∈J
*
g = max g ( x ) g = min g ( x )
*
b = max b( x ) b = min b( x )
* *
min
min
*
x∈ J
*
*
*
x∈J x∈ J
Using this method skin pixels were detected at a rate of 15fps on a 600Mhz laptop (see Figure
9).
10
Raymond Lockton, Balliol College
A B
Figure 9 After calibration of skin and wrist band pixels, colour ranges are used to detect all
subsequent frames. A detected frame is shown here with skin pixel detection indicated in white
and wrist band pixel detection indicated in red.
11
Raymond Lockton, Balliol College
It was decided to use the calibration routine, discussed in Section 3.4.1, to find the initial
values. However, the ranges returned by this method were less than perfect for the reasons
below:
• The calibration was carried out using a single frame; hence pixel colour variations
over time, due to camera noise, would not be accounted for.
• To ensure that the sampled area only contained skin pixels, by necessity it had to be
smaller than the hand itself. The extremities of the hand (where, due to self-
shadowing, the colour variation is the greatest) were therefore not included in the
calibration.
1. Multiple frame calibration: If the calibration was repeated over several frames and
the overall maximum and minimum colour values calculated, then the variation over
time due to camera noise would be included in those ranges and its effect thus
negated. The method would require the hand to be held stationary during the
calibration process.
The routine was thus modified to perform the calibration over 10 frames instead of
one. Figure 10 shows the results.
A B
Figure 10 Results of multiple frame calibration. Stage A is the result of the initial calibration.
Stage B is the result of calibration over 10 frames. There is no discernable difference in the skin
fit.
Calibration of several frames does little to improve skin detection. Therefore this
method was not retained.
12
Raymond Lockton, Balliol College
A B C
D E F
Figure 11 A simplified illustration of how region-growing works. Image A shows the initial
captured hand. Image B shows the result of initial calibration, detected pixels are shown in white.
For simplicity’s sake the pixels that fall within the initial colour ranges have been drawn as a
square. In practise, all pixels within the ranges will have been identified (these pixels would be
scattered throughout the hand area). Next, any pixels in the neighbourhood of those already
detected are scanned (the area within the black box of image C). If their colour values lie just
outside the current colour ranges, the ranges are increased to include them. The result is shown
in image D (again simplified). Although the pixels between the index and middle fingers fell
within the boundary, their values did not fall close to the ranges, so they were ignored. The
process is then repeated (images E and F) until, in theory, the ranges are such that all skin pixels
are detected.
13
Raymond Lockton, Balliol College
A B
C D
Figure 12 Results of region-growing. Stage A is the result of the initial calibration. Stage B is the
result of 50 repetitions of the region-growing algorithm (the fit is better still but a single
erroneous pixel, circled and arrowed, has been detected in the background). Stage C is the result
of 100 repetitions. The background noise is growing even though the shadowed areas of the hand
are still not detected adequately. Finally, by Stage D with 200 repetitions there is a considerable
amount of background noise.
The results show that performing the region-growing process a small number of times
results in slightly better detection but the process becomes noisy if the number of
repetitions is too high (>100). It was decided to keep this method but restrict its
growth to a maximum of 50 repetitions.
14
Raymond Lockton, Balliol College
image when the hand is not in the frame and store any (aberrant) pixels detected.
Ignoring these pixels would affect the recognition depending on where the hand was
in the frame. It would therefore be necessary to choose the correct value for the
aberrant pixel based on the value of those surrounding it (if all surrounding pixels are
skin then detect as skin, else background). However, neither the camera nor the
background exhibited such pixels when a hand was in frame so it was decided not to
proceed in programming this calibration step.
Figure 13 Plots of different combinations of skin pixel colour values (green) and background
pixel colour values (red). The skin pixels are well separated from the background pixels in all
three colour components but lie within an ellipsoid as opposed to a cuboid. The values are well
enough separated, however, for a cuboid colour range system to work adequately.
In order to improve accuracy it would be necessary to check if the colour components of the
skin and wrist band pixels fell within this ellipsoid. However, this was considered
computationally intensive and given that the current cuboid system works adequately it was
not implemented.
3.6 Conclusion
This chapter has described the choice and setup of hardware and the methods of calibration
and detection in order to detect as many of the skin and marker pixels within the frame as
15
Raymond Lockton, Balliol College
possible. The hardware chosen was a single colour camera pointing down towards a desk (or
floor) surface of a constant colour with no special lighting. Calibration is performed by
scanning the RGB colour values of pixels within a preset area of the frame and improved
using a limited amount of region-growing. Detection is performed by comparing each RGB
pixel value with ranges found during calibration. Figure 12 Stage B shows the successful
detection of the majority of the hand area.
16
Raymond Lockton, Balliol College
Refinement
Using the methods discussed in the previous chapter it is possible to detect the majority of the
skin and band pixels in the frame whilst detecting very few aberrant pixels in the background.
However, some complications were noticed which could reduce the accuracy of recognition at
a later stage. These are:
1. Image distortion: If the camera’s visual axis is not perpendicular to the floor plane, a
given gesture would appear different depending on the position and yaw of the hand
(a given length in one area of the frame would appear longer or shorter in another
area of the frame). This is termed projective distortion. Also, if the camera lens is of
poor quality then the straight sides of a true square in the frame would appear curved.
This is termed radial distortion.
2. Skin pixels detected as wrist band pixels: If the wrist band colour ranges are increased
sufficiently for all pixels to be detected then areas of skin that are more reflective
(such as the knuckles) start to be incorrectly identified as band pixels. This is
disadvantageous as it leads to inaccurate recognition information.
3. Skin pixels of the arm being detected: Any skin pixels above the wrist band will also
be detected as skin. It would be preferable if these pixels could be ignored, as they
play no part in the gesture. Wearing a long sleeve top helps solve the problem but
forearm pixels are still detected between the wrist band and the sleeve (which has a
tendency to move up and down the arm as different gestures are made, leading to
variations in the amount of skin detected).
17
Raymond Lockton, Balliol College
Figure 14 A4 card placed in the frame. If the camera had significant radial distortion, the
straight edges of the paper would appear as curves. This is not the case so radial distortion is not
significant.
The straight sides of the paper are imaged not as curves, but as straight lines, therefore radial
distortion is not present.
A: Strip length 101 pixels B: Strip length 102 pixels C: Strip length 99 pixels
18
Raymond Lockton, Balliol College
There is slight image distortion present but its effect is limited to only 6% and therefore was
not considered serious enough to attempt to remove (removal would involve transforming a
distorted to a regular rectangle which would be processor intensive).
A formal description of centroid calculation is as follows: From before, the set of all skin
pixel locations was defined as:
* * * *
L = {x | S (r (x ), g (x ), b(x )) = 1}
Denote the number of elements of L by L
* *
This gives the hand centroid as:
1
chand =
L
∑x
*
x∈L
* *
The wrist band centroid is calculated in the same way:
1
cband =
Lband *
∑x
x∈Lband
Figure 16 shows an original image and the image with the detected skin pixels, wrist band
pixels and centroids visible.
19
Raymond Lockton, Balliol College
A B
Figure 16 Original image before skin and wrist band pixel detection (A) and after (B). Detected
skin pixels are shown in blue and wrist band pixels in red. Centroids are displayed as black dots.
Notice how even with priority given to skin pixels over wrist band pixels, a number of wrist
band pixels are erroneously detected near the knuckles (where skin has not been detected due
to the higher reflectivity of those areas).
The edges of the wrist band can be found by scanning lines parallel to the line joining the two
centroids.
*
Define the vector joining the two centroids as:
* *
c dif = ( x dif , y dif ) = c hand − cband
The yaw angle of the hand is therefore:
y dif
θ hand = tan −1
x
dif
*
The edges of the band are then found thus:
For each point p1 (s1 ) along the line
π
* *
p1 (s1 ) = cband
cosθ hand +
+ s1
2
π
sin θ hand +
2
where ( − 50 ≤ s1 ≤ 50 )
For each s1 count the number of wrist band pixels n(s1 ) along the line:
* * cos(θ hand )
p2 (s1 , s 2 ) = p1 + s 2
sin (θ hand )
where ( − 50 ≤ s 2 ≤ 50 )
* ( )
* ( )
*
The two points defining the edges of the band bleft or xleft , yleft and bright or xright , y right
are then equal to p (s ) when n(s ) falls below a certain threshold.
1 1 1
20
Raymond Lockton, Balliol College
Figure 17 shows a number of the lines scanned (reduced for clarity) along with a graph
showing the thresholds used in the program to detect the band edges.
20
detected
15
10
0
-100 -50 0 50 100
Distance along line perpendicular to line
joining the centroids (pixels)
Figure 17 The left image shows the lines scanned to detect the edges of the wrist band. The
number of wrist band pixels detected along each line is counted. The edges have been detected
when the number falls below a certain threshold. The graph on the right shows the number of
pixels detected along each of the lines with the detected edges marked in red.
Using these thresholds it is then possible to utilize only those wrist band pixels that are within
the band’s width. This removes any remaining erroneous wrist band pixels detected near the
knuckles.
(* * * * )
The radius of the band is:
rband = max bleft − cband , bright − cband
*
Any band pixels further than rband from cband can then be disqualified.
Figure 18 shows the wrist band pixels that have passed this radius test and the recalculated
centroid (passed pixels shown in yellow, radius indicated by black circle).
Figure 18 Radius test applied to wrist band pixels. Any pixels that are further from the wrist
band centroid than the band radius (black circle) previously calculated can be ignored (pixels
that pass shown in yellow, those that fail in red)
21
Raymond Lockton, Balliol College
(* * * * )
The minimum distance between the hand centroid and the edges of the band is:
rhand = min bleft − chand , bright − chand
*
The maximum and minimum angles of the band ( θ band max and θ band min ) relative to chand are:
yleft y
θ band max = max tan −1 , tan −1 right
xleft x right
yleft y
θ band min = min tan −1 , tan −1 right
xleft xright
*
Any hand pixels further than rhand and between θ band max and θ band min relative to chand can
then be disqualified (a case statement deals with the situation that occurs when the band
angles lie either side of 0 radians).
Figure 19 shows the angle and distance criterion being applied, with skin pixels that fail
highlighted in green.
Figure 19 Distance and angle criterion applied to skin pixels. The two straight black lines show
the angle in which the radius criterion is applied. The curved black line shows the radius beyond
which skin pixels are disqualified. In this example failed skin pixels are shown in green.
Finally the hand centroid can be recalculated. This is shown in Figure 20.
22
Raymond Lockton, Balliol College
Figure 20 Image showing recalculated hand and wrist band centroids. Invalid wrist band pixels
have been ignored (passed pixels shown in yellow, failed pixels in red) and skin pixels up the
forearm have also been ignored.
4.4 Conclusion
This chapter has described several techniques to improve the hand detection. A combination
of pixel position and priority based information was used to remove any erroneous detected
pixels. Figure 21 shows that the process was very successful.
Figure 21 Detected pixels before and after refinement. The detected wrist band pixels are shown
in red. Notice how after refinement the erroneous wrist band pixels detected on the knuckles
have been ignored, with a corresponding shift in wrist band centroid. The detected skin pixels are
shown in blue. All of the hand pixels are detected except those in areas of higher reflectivity (near
the knuckles) which naturally show up as white. Notice how after refinement all skin pixels
detected up the forearm have been ignored; with a corresponding shift in hand centroid.
23
Raymond Lockton, Balliol College
Recognition
In the previous two chapters, methods were devised to obtain accurate information about the
position of skin and wrist band pixels. This information can then be used to calculate the hand
and wrist band centroids with subsequent data pertaining to hand rotation and scaling. The
next step is to use all of this information to recognise the gesture within the frame.
Direct method based on geometry: Knowing that the hand is made up of bones of fixed width
connected by joints which can only flex in certain directions and by limited angles it would be
possible to calculate the silhouettes for a large number of hand gestures. Thus, it would be
possible to take the silhouette information provided by the detection method and find the most
likely gesture that corresponds to it by direct comparison. The advantages of this method are
that it would require very little training and would be easy to extend to any number of
gestures as required. However, the model for calculating the silhouette for any given gesture
would be hard to construct and in order to attain a high degree of accuracy it would be
necessary to model the effect of all light sources in the room on the shadows cast on the hand
by itself.
Learning method: With this method the gesture set to be recognised would be “taught” to the
system beforehand. Any given gesture could then be compared with the stored gestures and a
match score calculated. The highest scoring gesture could then be displayed if its score was
greater than some match quality threshold. The advantage of this system is that no prior
information is required about the lighting conditions or the geometry of the hand for the
system to work, as this information would be encoded into the system during training. The
system would be faster than the above method if the gesture set was kept small. The
disadvantage with this system is that each gesture would need to be trained at least once and
for any degree of accuracy, several times. The gesture set is also likely to be user specific.
It was decided to proceed with the learning method for reasons of computation speed and ease
of implementation.
24
Raymond Lockton, Balliol College
occluding the other. This is outside the project remit. However, there is an American one-
handed sign language alphabet, which, with slight modification, can be used (see Appendix
B).
25
Raymond Lockton, Balliol College
Comparison of area for gesture 'c' with trained letters 'a' through
to 'i'
8000
7000
6000
5000
3000
2000
1000
0
a a b b c c d d e e f f g g h h i i
Figure 22 Comparison of test letter 'c' with pairs of trained examples from 'a' through to 'i'.
Although the score is low for the letter ‘c’ the scores for several of the other gestures is also low.
Any of the gestures below the broken line could be misinterpreted as the letter ‘c’. This suggests,
as predicted, that area is not a good comparison metric to use (although the letters ‘a’, ‘e’, ‘g’
and ‘i’ are well differentiated from ‘c’).
As predicted, area is not a good comparison metric as several other trained gestures (‘b’, ‘d’
and ‘h’) also exhibited a similar area to the test letter ‘c’.
26
Raymond Lockton, Balliol College
Figure 23 Example gesture with radials marked. The black radial lengths can easily be measured
(length in pixels shown). However, the red radials present a problem in that they either cross
between fingers or palm and finger.
However, a problem (as shown in Figure 23) is how to measure when the radial crosses a gap
between fingers or between the palm and a finger. To remedy this it was decided to count the
total number of skin pixels along a given radial. This is shown in Figure 24.
Figure 24 One of the problem radials with outlined solution. If only the skin pixels along any
given radial are counted then the sum is the “effective” length of that radial. In this case the
radial length is 46 + 21 = 67.
All of the radial measurements could then be scaled so that the longest radial was of constant
length. By doing this, any alteration in the hand camera distance would not affect the radial
length signature generated. See Appendix C Section 2 for a formal description of the radial
length calculation.
27
Raymond Lockton, Balliol College
Figure 25 Open hand gesture in several different positions and yaw angles. The histogram for
each gesture is largely the same shape but shifted dependent on the yaw of the hand.
Figure 26 Images showing the histogram for two different gestures. The two histograms are
sufficiently different to permit differentiation.
28
Raymond Lockton, Balliol College
radial direction. This, along with the maximum radial length scaling makes the system robust
against changes in hand position, yaw and distance from camera. Figure 27 shows the same
open hand gesture (as in Figure 25) in a variety of positions and yaw angles.
Figure 27 The same open hand gesture as before in a variety of different positions and yaw
angles, but with hand yaw independence. The histograms for all the gestures are similar so it
should be possible to recognise this gesture from a set of different gestures.
The radial measurements are very similar no matter how the hand is positioned.
29
Raymond Lockton, Balliol College
Figure 28 Successful recognition of several different gestures. Gesture recognised is shown at the
top left of the frame. The gestures are recognised correctly even though the yaw of the test hand
is different from that taught.
30
Raymond Lockton, Balliol College
Figure 29 The left image shows 100 radials in their original pattern. However, this pattern does
not give the necessary concentration bias towards the fingers. The image on the right shows 200
radials reorganised so that twice as many lie over the fingers as the rest of the hand (150 over the
fingers and 50 elsewhere).
The graph in Figure 31 shows that the radial length metric is considerably better than the area
metric at differentiating this series of gestures. However, ‘c’ and ‘i’ have very similar low
scores even though the signs are physically different.
31
Raymond Lockton, Balliol College
4000
3500
3000
2500
2000 Total of
differences in
1500 number of
1000 pixels along
radials
500
0
a a b b c c d d e e f f g g h h i i
Figure 31 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The score is
low for the letter ‘c’ and also high for most of the other gestures. However, one example of the
letter ‘i’ also gets a good comparison score even though the gesture corresponding to the letter ‘i’
is dissimilar to that of the letter ‘c’. However, the range of scores is considerably better than that
of the area recognition method discussed earlier.
Figure 32 On the left is the original image and on the right is a representation of the data
provided by the radial length recognition system. The amount of information provided about
individual fingers is dependent on the angle of the radial covering that finger which means that
gestures involving the poorly represented fingers will not be well differentiated.
32
Raymond Lockton, Balliol College
Due to the organisation of the radials, the amount of information provided about individual
fingers is dependent on the relative angle of the radial and the long axis of the finger (the
shallower the angle the more information is provided). This is obviously an inadequate
situation as gestures involving the parts of the hand that are not well covered would be hard to
differentiate.
Correct 55/64
Incorrect 9/64
False positives 33/64
False negatives 2/64
Figure 33 Results from a test of the radial length recognition method. Several of the test gestures
were incorrectly recognised. There were also a number of false positives and two false negatives
(the number of false positives and negatives is dependent on a threshold above which a score is
considered to have been caused by a valid gesture).
33
Raymond Lockton, Balliol College
Figure 34 On the left is the original image and on the right is the image after transformation into
the canonical frame. However, after scaling up from the original frame, gaps appear between the
pixels which would make the recognition comparison unreliable.
The problem is that scaling up from the original frame to the canonical frame results in gaps
between pixels. This would be disadvantageous in recognition as a specific pixel in the
trained set may not match up with a corresponding pixel in the test gesture and as such would
not score.
34
Raymond Lockton, Balliol College
Figure 35 The left two images show two different examples of the same gesture at different
positions and rotations. The right two images show the corresponding images in the canonical
frame. Performing a pixel “pull” rather than a pixel “push” means that the problem of gaps
between pixels no longer occurs. The two gestures look similar in the canonical frame, most of the
differences being caused by shadowing.
35
Raymond Lockton, Balliol College
Figure 36 shows the jitter maps for the one handed sign language letters m, n and l (40
examples of each gesture were used).
Figure 36 Jitter maps for the letters m, n and l respectively (40 examples of each gesture were
used). The most variation (most red) occurs near the edges of the hand. Greater influence should
therefore be given to the bluer pixels for the purposes of recognition.
As expected, the largest amount of variation occurs near the edges of the hand. Therefore, in
the recognition of these gestures, greater weight should be given to the bluer pixels. It would
also be advantageous to combine the information given by maps such as those in Figure 36 to
find the pixels that best differentiate them. In order to facilitate this a program was first
written to create a map where the value of each pixel is dictated by the proportion that the
corresponding pixels across the training set were skin. These images were termed “skin
concentration maps” (SCMs). See Appendix C Section 7 for a pseudocode description of the
creation of these skin concentration maps.
A simple subtraction of the SCMs for two sets of gestures could then be performed to
find the pixels that best differentiate the two (the best pixels being those that are mostly
background on one set and mostly skin on the other). See Appendix C Section 8 for a
pseudocode description of the creation of a skin concentration difference map.
Figure 37 shows the skin concentration maps for the letters m and n and the result of the
subtraction of the two.
36
Raymond Lockton, Balliol College
Figure 37 The top two images are skin concentration maps for the letters m and n respectively.
As expected the skin is most concentrated at the centre of the hand (blue areas) and least
concentrated near the edges (red areas). The bottom image is the result of an image subtraction
of the top two. The best pixel areas to differentiate these two gestures lie just beyond the knuckles
of the letter n and in the shadowed area of the letter m (coloured red).
The best pixels to differentiate the letters m and n (coloured red) lie just beyond the knuckles
of the letter n and in the shadowed area of the letter m.
Both jitter and skin concentration maps are a compact way of representing the large amount
of data created during training. However, skin concentration maps proved more useful for the
purposes of gesture comparison and so were chosen.
37
Raymond Lockton, Balliol College
Appendix C Section 9 for a pseudocode description of the creation of the quantized skin
concentration maps. Figure 38 shows an example skin concentration map before and after
quantization.
Figure 38 An example SCM of the letter ‘e’ before and after quantization (left and right
respectively). Any areas below a certain “cold” threshold are considered “skin” (coloured blue),
all those above another “hot” threshold considered “background” (coloured red). All other areas
are ignored (coloured white).
A score was then calculated by comparing the test gesture mask with each quantized skin
concentration map (QSCM). A point was awarded if the test mask skin pixel coincided with a
“skin” pixel of the QSCM and a point subtracted if the test mask skin pixel coincided with a
“background” pixel. Similarly a point was awarded if the test mask background coincided
with the “background” of the QSCM and vice versa. See Appendix C Section 10 for a
pseudocode description of the comparison of a test gesture mask and set of QSCMs. Figure 39
shows the comparison of the QSCM for the letter ‘e’ (Figure 38 right) with example masks of
the letters ‘c’ and ‘e’.
38
Raymond Lockton, Balliol College
Figure 39 The comparison of the QSCM for the letter ‘e’ (Figure 38 right) with example masks of
the letters ‘c’ and ‘e’. Areas that achieve positive scores (background to background or skin to
skin match) are shown in green and those with negative scores (background to skin or skin to
background) are shown in yellow. The mask for the letter ‘e’ has many more areas of positive
score and fewer areas of negative score than the mask for the letter ‘c’.
The graph in Figure 40 shows the scores of a test gesture ‘c’ compared with the QSCMs of
gestures from ‘a’ through to ‘i’.
39
Raymond Lockton, Balliol College
30000
25000
20000
QSCM
15000 match
score
10000
5000
0
a a b b c c d d e e f f g g h h i i
Figure 40 Comparison of test letter 'c' with trained examples from 'a' through to 'i'. The
examples of the letter ‘c’ achieve the top two comparison scores and none of the others achieve
similar scores except the letter ‘d’ which, although close, is still a minimum 1,400 points different.
This suggests that the template matching in the canonical frame recognition method is better
than both the area and radial length recognition methods.
Both examples of the letter ‘c’ stored matched better to the test gesture than any of the others.
Based on the results obtained for the three metrics it was decided to use the template matching
in the canonical frame recognition method as it was the only method that provided sufficient
information to differentiate the similar gestures reliably and because it was the easiest to adapt
to using multiple training examples of each gesture.
40
Raymond Lockton, Balliol College
canonical frame method more robust as it reduces the reliance on the hand centroid as an
anchor point. Several rules were considered, but the one that produced the best results
involved shifting the image in the canonical frame to the right until the wrist band was just off
the edge of the screen. This was performed by scanning columns of the canonical frame from
the right until the number of wrist band pixels detected fell to zero. The positioning in the y-
direction was calculated using the hand centroid as before. Figure 41 shows a gesture in the
canonical frame before and after translation.
x
Figure 41 Images showing the canonical frame before (left) and after (right) x-axis shift. The y-
axis position of the hand is dictated by the hand centroid as before.
41
Raymond Lockton, Balliol College
A greedy algorithm was devised to take the first gesture image in the training group and
compare it pixel by pixel with all other members of the group. See Appendix C Section 12 for
a pseudocode description of the comparison.
Any gesture images whose compared difference (in pixels) fell below a set threshold,
t max , were then added to a sub-group and removed from the main group. Once all the gesture
images in the main group had been compared the next first member of the main group could
be compared with all the remaining images and so on. A threshold was also set to define the
minimum number of gesture images permitted in an exemplar. In the event that the number of
images in an exemplar fell below this threshold the first member of the main group was
simply removed entirely with the logic that if it was so dissimilar from all the rest then it must
be an outlier and as such could be safely removed without greatly affecting recognition
quality. The process continued until no gesture images remained in the main group. See
Appendix C Section 13 for a pseudocode description of the clustering process.
Figure 43 shows the result of running the algorithm on sets of 100 gesture images of the
sign language letters ‘a’ through to ‘e’. The value of t max in this case was 2500 pixels
different and a minimum of four gesture images were allowed in an exemplar.
42
Raymond Lockton, Balliol College
All of the training gesture image sets are clustered into at least three exemplars. As expected,
the gestures with the largest number of exemplars are those with the most shadowing (letters
‘a’ and ‘e’). Those with no shadowing (‘c’ and ‘d’) are also clustered into a small number of
exemplars as they involve a range of possible finger positions that still present a valid gesture.
A problem with clustering the training gesture image sets in this way is that it increases the
number of SCMs that need to be compared per frame in order to recognise a test gesture. For
instance, with no clustering, a set of 24 gestures would produce 24 SCMs to compare per
frame. If clustering produces 10 exemplars per gesture, then the number of SCMs increases to
240, with subsequent decrease in recognition frame rate. The choice of how much clustering
to perform is a trade-off between speed (less clustering) and accuracy (more clustering) and
should be chosen depending on the application. A compromise between the two was chosen
here.
43
Raymond Lockton, Balliol College
Figure 44 Simplified example of how the pixels that split the set can be found. The four tables on
the left represent skin concentration maps. After quantization, the value of each pixel in the
quantized skin concentration map is either ‘0’, ‘1’ or ‘2’. The pixels that are either ‘0’ or ‘2’
across the set can then be found.
Although the process of quantization means that there is no strict guarantee provided by the
analysis of each individual pixel, the combined influence of the many pixels in the list
provides a better estimate.
With the tree method a group of pixels that split the set of exemplars roughly in two is
found. The greater the number of pixels the better the accuracy of the decision, so a
compromise has to be found between splitting the set into two halves and finding enough
pixels to accurately do so. See Appendix C Section 15 for a formal description of this
compromise.
Once the set is split the two subsets can be stored in the left and right branch of a tree
structure. The same process (of finding pixels that split the set in two) can be applied to both
subsets. The process continues until all subsets consist of a single gesture.
A program was written to perform the quantization and then scan all the pixels from all
the QSCMs for those that split the set roughly in two. Priority was given to finding sufficient
pixels so if on a given pass insufficient were found then the process was repeated but with
less emphasis on splitting the set exactly in two. After each split the location and value of all
the qualifying pixels was stored and a node of a tree structure filled. Both reduced sets of
gestures were then passed back into the splitting algorithm. The process was repeated until all
the bottom nodes of the tree consisted of a single gesture. See Appendix C Section 16 for a
pseudocode description of filling the tree structure.
Figure 45 shows the output of the algorithm for a set of five gestures from the one
handed sign language set (letters l, b, o, n and m).
44
Raymond Lockton, Balliol College
Input Gestures
L B O N M
Set of pixels found that split the set and tree structure filled
L B O N M
L B O N M
B O N M
B O N M
Figure 45 An example of how the tree method works. At each level of the tree the number of skin
pixels under the green and yellow masks is counted. If the number under the green mask is larger
than that under the yellow mask the green branch is chosen. Alternatively the yellow branch is
chosen. The process is repeated until the bottom of the tree is reached.
45
Raymond Lockton, Balliol College
The advantage of this system is that after the tree structure is filled, only a small number of
pixels need be analysed before the descent to the next tree level. As, at each stage, the number
of possible exemplars is split roughly in two, this method is very quick to execute. The
disadvantage of this method is that at the levels of the tree near the root, when the number of
exemplars is large, the number of pixels that split the set (even to split off a single exemplar)
is very small. During testing it was found that for a set of just 16 exemplars only 200 pixels
could be found to split off a single exemplar at the first level of the tree, greatly increasing the
possibility of error at this level. Another problem is that the tree can only be traversed
downwards- once it is decided to travel down one side of the tree the exemplars represented
on the other side cannot be compared even if they would provide a better match at a later
stage. For example, if the probability of correct branch traversal at each node is 98% or 0.98
(which corresponds to a 2% probability of failure) and the tree has 10 levels (all of which
must be traversed correctly), then the probability of success at the bottom is 0.9810 = 0.82
(which corresponds to a failure probability of 18%). This was reflected in the fact, that for a
set of more than eight different exemplars, the correct one was rarely recognised.
• If the test gesture pixel is skin then a point is awarded to each of the QSCMs if the
value of that pixel is mostly skin.
• If the test gesture pixel is skin then a point is subtracted from each of the QSCMs if
the value of that pixel is mostly background.
• If the test gesture pixel is background then a point is awarded to each of the QSCMs
if the value of that pixel is mostly background.
• If the test gesture pixel is background then a point is subtracted from each of the
QSCMs if the value of that pixel is mostly skin.
The final score for each QSCM can then be calculated by dividing the total score by the
maximum score possible (equal to the number of pixels over the template which are either
mostly skin or mostly background).
An advantage of this system is that each exemplar is judged separately so unlike the tree
method errors do not accumulate. A disadvantage is that a very large number of pixels have to
be examined for each of the QSCMs for a match to be made. Also, if a given training gesture
has a large amount of variation then there will be a large number of pixels which are neither
mostly skin or mostly background in the QSCM (equivalent to a large amount of white area in
Figure 38 right), leaving large areas where no score can be awarded, and as such increase the
possibility that two exemplars will be difficult to differentiate.
To test the system, the same training and test gesture sets that were used with the radial
length metric were fed to the system. Figure 46 shows the results:
46
Raymond Lockton, Balliol College
Correct 63/64
Incorrect 1/64
False positives 60/64
False negatives 0/64
Figure 46 Results from a test of the template score method with quantization. All but one of the
test gestures was correctly identified and there were no false negatives. However, there were a
considerable number of false positives. This is due to the fact that the recognition score for a
couple of the gestures was low even though the correct gesture obtained the highest score. This
meant that the recognition threshold had to be set low and as such a number of intermediary
frames were incorrectly recognised as gestures.
• Add this floating point number when the corresponding test gesture pixel is skin
• Subtract this floating point number when the corresponding test gesture pixel is
background
Pixels that have a large amount of variation do not affect the score by a significant amount as
their value is close to zero.
The advantage of this method is that no pixels are ignored, so even exemplars with a
large amount of gesture image variation are fully considered. A disadvantage is that many
pixels have to be considered for each SCM (as with the quantization method). This method
47
Raymond Lockton, Balliol College
will also be slower than the previous method as many floating point calculations have to be
performed (rather than integer ones).
Once again the system was tested using the same gesture sets as before. The results are
shown in Figure 47:
Correct 64/64
Incorrect 0/64
False positives 48/64
False negatives 0/64
Figure 47 Results from a test of the template score method with no quantization. All of the test
gestures were correctly identified this time and once again there were no false negatives. There
were a considerable number of false positives. This is for the same reason as with the previous
figure.
From looking at the results of each of the recognition methods it was clear that the method
with the best recognition score was the template score method with no quantization. Therefore
this method was chosen.
48
Raymond Lockton, Balliol College
49
Raymond Lockton, Balliol College
Figure 48 Images showing the pixels queried in order to detect one of the exemplars for the letter
‘a’ before and after removal of duplicates. The duplicate pixels are mostly evenly spread over the
recognition area. Notice how the pixels near the wrist band are less concentrated, as the pixels in
this area are skin for almost all the trained gestures.
The results show, that after removal of duplicate pixels, the remaining pixels are evenly
spread over the recognition area except for the area near the wrist band where a larger number
of duplicates exist. This is because most of the pixels near the wrist band are skin for all of the
trained gestures.
After application of both these methods all of the test gestures were still correctly recognised.
Therefore it was decided to use both. The results from the application of the set of test
gestures from before is shown in Figure 49:
50
Raymond Lockton, Balliol College
Correct 64/64
Incorrect 0/64
False positives 43/64
False negatives 0/64
Figure 49 Results of a test to the template score method with no quantization after sorting and
removal of duplicate pixels. All of the test gestures were correctly identified and there were no
false negatives. There were a considerable number of false positives. This is for the same reason
as before.
5.11 Conclusion
In this section, three methods of recognition have been discussed. Firstly, area comparison
was considered. Although this was considered an unsuitable metric it was used in order to
focus the attention on the comparison architecture of any future system and the testing
methodology. The second method involved the comparison of radial length signatures. This
was more suitable, but it was found that the amount of information provided about individual
fingers was dependent on the relative angle of the radial and the long axis of the finger,
making some gestures hard to differentiate. Finally, template matching in the canonical frame
was considered and chosen as it provided the best results. Various refinements were then
made to increase recognition speed. Using the methods chosen a set of 42 gestures were all
correctly recognised at a frame rate of 12.5fps.
51
Raymond Lockton, Balliol College
6.1 Setup
The system was set up as in Figure 2. The template score (with no quantization) recognition
method was modified so that the recognised gesture generated mouse and keyboard events, as
shown in Figure 50.
52
Raymond Lockton, Balliol College
In order to ignore transition movements of the hand, an event was only queued if five
identical contiguous gestures were recognised. Thereafter, further events were only processed
if the gesture changed (therefore, to type two identical letters a brief gesture change would
need to be interleaved).
6.2 Demonstration
To demonstrate the system in use, the following sequence of actions were performed using
the hand alone:
• The explorer icon on the task bar was clicked in order to restore it.
53
Raymond Lockton, Balliol College
54
Raymond Lockton, Balliol College
• Finally, the folder was closed and dragged to the top left of the directory window.
During the demonstration six letter errors were made, two of which were due to operator
error.
55
Raymond Lockton, Balliol College
Conclusion
Combination of area, radial length and template matching in the canonical frame: It was
noticed that each of the different recognition metrics demonstrated different benefits. For
instance, the area metric differentiated ‘a’ and ‘c’ well, the radial metric differentiated ‘b’ and
‘c’ well and template matching in the canonical frame differentiated ‘d’ and ‘c’ well.
Therefore a weighted combination of all three metrics would result in the highest accuracy.
Removal of wrist band: The system relies on the user wearing a coloured wrist band to
remove various degrees of freedom, making recognition, via comparison, possible. It would
be advantageous if this were not the case. There are methods (see Section 2.3) that could be
used to perform the recognition without a wrist band, but they would be unlikely to be as
accurate.
Using temporal coherence to improve recognition accuracy: English written text has
temporal coherence in that each letter has a probability of being followed by a given letter.
For instance, the letter ‘q’ is often followed by the letter ‘u’ but rarely any other letter. These
probabilities could be used to improve recognition accuracy by combining the list of top
scoring exemplars with the probability of each following the preceding letter. The same
process could also be used to permit standard American one-handed sign language to be used
(where the letters ‘O’, ‘V’ and ‘W’ are the same as the numbers ‘0’, ‘2’ and ‘6’ respectively-
see Appendix B) instead of the modified version.
Increase of the number of recognised gestures: For the purposes of a man-machine interface a
relatively small set of gestures (≈100) would be sufficient and is therefore within the bounds
56
Raymond Lockton, Balliol College
of the final system developed. However, if detection of hand gestures for computer animation
is required (for instance), then the number of trained gestures would need to be in the
thousands. A system which relies on both training and comparison of all gestures used would
not be sufficient for this task. Further work, therefore, could involve the implementation of a
gesture recognition system which does not require training. An example of this is the direct
method based on hand geometry considered in Section 5.1.
Multi-stage gestures: It would be possible to represent a much larger number of labels if each
label consisted of two or more gestures combined with hand position changes. For instance,
the “wave hello” label could correspond to the open hand gesture with an alternating increase
and decrease of hand yaw angle and the “thumbs-up” label could correspond to the letter ‘m’
followed by the space gesture.
Two-handed sign language: It would be possible, using two different coloured gloves and two
different coloured wrist bands, to detect the gesture signed by both hands whilst both are in
the frame. A method would have to be devised to detect a gesture (or range of gestures) that is
represented by a partially occluded hand. This method would be considerably harder to
implement. It is important to note, however, that although the gesture of both hands could be
recognised this would not permit the recognition of the full American sign language as this
involves recognising many other features including facial expression and arm position.
57
Raymond Lockton, Balliol College
8 References
[Bauer & Hienz, 2000] Relevant feature for video-based continuous sign language
recognition. Department of Technical Computer Science, Aachen University of Technology,
Aachen, Germany, 2000.
[Bowden & Sarhadi, 2000] Building temporal models for gesture recognition. In proceedings
British Machine Vision Conference, 2000, pages 32-41.
[Bretzner & Lindeberg, 1998] Use your hand as a 3-D mouse or relative orientation from
extended sequences of sparse point and line correspondences using the affine trifocal tensor.
In proceedings 5th European Conference on Computer Vision, 1998, pages 141-157.
[Davis & Shah, 1994] Visual gesture recognition. In proceedings IEEE Visual Image Signal
Process, 1994, vol.141, No.2, pages 101-106.
[Starner, Weaver & Pentland, 1998] Real-time American sign language recognition using a
desk- and wearable computer-based video. In proceedings IEEE transactions on Pattern
Analysis and Machine Intelligence, 1998, pages 1371-1375.
58
Raymond Lockton, Balliol College
9 Appendix
9.1 Appendix A- Glossary
Hand roll The rotation of the hand about an axis defined by the wrist. The
following three images show the same gesture with increasing
roll.
Hand yaw The rotation of the hand about an axis defined by the camera
view direction. The following three images show the same
gesture with increasing yaw.
59
Raymond Lockton, Balliol College
A B C D E F G
H I J K L M N
O P Q R S T U
V W X Y Z 1 2
3 4 5 6 7 8 9
0 DO RE BS SP CA LC
RC DC OP CL
60
Raymond Lockton, Balliol College
Given a test image with signature anew choose the label li min where
2
i min = arg min a new − a i 2
i =1..n
61
Raymond Lockton, Balliol College
G = {(g i , l i )}i =1
n
Given a test image with signature g new choose the label li min where
i min = arg min g new − g i
2
2
i =1..n
*
The radius scaling factor and angle shift to be used in canonicalisation can then be defined as
rcanonicalscalefactor = v dif
y dif
θ canonicalshift = tan −1
x
*
dif
Define the anchor of the canonical frame as x canonicalanchor , say (160,120 )
The set of all remaining skin pixel locations after refinement is L
*
*
For each x ∈ L :
* *
v pixel = (x pixel , y pixel ) = x − c hand
*
r pixel = v pixel
y pixel
θ pixel = tan −1
x pixel
The transformation into the canonical frame then proceeds as follows:
100
Pixel distance scaling: rscaledpixel = rpixel ∗
rcanonicalscalefactor
Pixel angle rotation: θ scaledpixel = (θ pixel + θ canonicalshift )mod 2π
The equivalent pixel in the canonical frame is then:
* * cos θ scaledpixel
x canonical = x canonicalanchor + rscaledpixel
sin θ
scaledpixel
*
For all pixels x canonical within the canonical frame:
* *
vcanonical = (x canonical , y canonical ) = xcanonical − xcanonicalanchor
*
rcanonical = v canonical
y
θ canonical = tan −1 canonical
xcanonical
The pixel “pull” from the original frame then proceeds as follows:
62
Raymond Lockton, Balliol College
100
Inverse pixel distance scaling: rinvscaledpixel = rcanonical ÷
r
canonicalscalefactor
Inverse pixel angle rotation: θ invscaledpixel = (θ canonical − θ canonicalshift )mod 2π
The equivalent pixel in the original frame is then:
* * cosθ invscaledpixel
x = c hand + rinvscaledpixel
θ
* *
sin invscaledpixel
If x ∈ L then mark the pixel in the canonical frame ( xcanonical ) as skin otherwise
mark it as background.
The skin concentration map can then be generated by colouring each pixel:
63
Raymond Lockton, Balliol College
Black if C i = −1
else
Blue if C i = 1
Red if C i = 0
And colours in between
The skin concentration difference map can then be generated by colouring each pixel:
Black if Di = 0
Red if Di = 1
And colours in between
Given a test image with mask M i calculate the score for each concentration map thus:
Define an array of scores s j where s j = 0 for j = 0..n
For each QSCM j:
64
Raymond Lockton, Balliol College
s j = s j + − 1 → (M i = 1) & (Q j , i = 0)
− 1 → ( M = 0) & (Q = 2)
i j ,i
0 → otherwise
*
Define the total radius as rtot
*
For each x ∈ L :
* *
= (x pixel , y pixel ) = x − c hand
*
v pixel
rtot
The average radius is then defined as
L
l =0
For each mask j = 0 to j = (n − 2) :
For each mask k = ( j + 1) to k = (n − 1)
If M j is sufficiently similar to M k (see algorithm above) then
65
Raymond Lockton, Balliol College
C.15 The compromise between splitting the set into two halves
and finding enough pixels to accurately do so
A formal description of this compromise is as follows:
SPolarised can be scanned to find the sets of pixels for which:
SZeros and STwos are identical
or
SZeros and STwos are identically opposite (because this pixel split the set in the
same way)
66
Raymond Lockton, Balliol College
A compromise then has to be found between finding a large set of pixels and a set that splits
the set as accurately in two as possible (a set for which SZeros and STwos are roughly of
the same size).
Store the eventual pixels decided upon in set SSplit
67
Raymond Lockton, Balliol College
+ 1 → ( M i = 1) & (Q j ,i = 2)
+ 1 → ( M = 0) & (Q = 0)
i j ,i
s j = s j + − 1 → ( M i = 1) & (Q j ,i = 0)
− 1 → (M = 0) & (Q = 2)
i j ,i
0 → otherwise
if Q j ,i = 2 then increment s max j
Recognition of the top scoring gesture is then performed by choosing the label l jmax where:
sj
j max = arg max
j =1..n s max j
Given a test image with mask M i calculate the score for each concentration map thus:
Define an array of scores s j
For each SCM j:
For each pixel i :
If M i = 1 then
s j = s j + (C j ,i − 0.5)
Else
s j = s j − (C j ,i − 0.5)
68
Raymond Lockton, Balliol College
69