Beruflich Dokumente
Kultur Dokumente
Abstract
We present a method to recognize hand gestures, based on a pattern recog-
nition technique developed by McConnell [16] employing histograms of local
orientation. We use the orientation histogram as a feature vector for gesture
classcation and interpolation.
This method is simple and fast to compute, and oers some robustness to
scene illumination changes. We have implemented a real-time version, which
can distinguish a small vocabulary of about 10 dierent hand gestures. All the
computation occurs on a workstation; special hardware is used only to digitize
the image. A user can operate a computer graphic crane under hand gesture
control, or play a game. We discuss limitations of this method.
For moving or \dynamic gestures", the histogram of the spatio-temporal
gradients of image intensity form the analogous feature vector and may be
useful for dynamic gesture recognition.
Reprinted from: IEEE Intl. Wkshp. on Automatic Face and Gesture Recognition, Zurich, June,
1995.
This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission
to copy in whole or in part without payment of fee is granted for nonprot educational and research pur-
poses provided that all such whole or partial copies include the following: a notice that such copying is by
permission of Mitsubishi Electric Research Laboratories of Cambridge, Massachusetts; an acknowledgment
of the authors and individual contributions to the work; and all applicable portions of the copyright notice.
Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to
Mitsubishi Electric Research Laboratories. All rights reserved.
Copyright
c Mitsubishi Electric Research Laboratories, 1994
201 Broadway, Cambridge, Massachusetts 02139
1. First printing, TR94-03, May 1994
2. Second printing, TR94-03a, December 1994
William T. Freeman and Michal Roth
Mitsubishi Electric Research Labs
201 Broadway
Cambridge, MA 02139 USA
e-mail: ffreeman, g
roth @merl.com
From: IEEE Intl. Wkshp. on Automatic Face the special glove. We seek a visually based method
and Gesture Recognition, Zurich, June, 1995. which will be free of gloves and wires.
Abstract
Relying on visual markings on the hands, previ-
ous researchers have recognized sign language and
We present a method to recognize hand gestures, pointing gestures [24, 5, 8]. However, these methods
based on a pattern recognition technique developed require the placement of markers on the hands.
by McConnell [16] employing histograms of local ori- The marking-free systems of [12, 21] can recog-
entation. We use the orientation histogram as a fea- nize specic nger or pointing events, but not gen-
ture vector for gesture classcation and interpolation. eral gestures. Employing special hardware or o{
This method is simple and fast to compute, and line learning, several researchers have developed suc-
oers some robustness to scene illumination changes. cessful systems to recognize general hand gestures
We have implemented a real-time version, which can [22, 14, 6, 7, 20]. Blake and Isard [4] have developed
distinguish a small vocabulary of about 10 dier- a fast contour-based tracker which they applied to
ent hand gestures. All the computation occurs on a hands, but the discrimination of dierent hand poses
workstation; special hardware is used only to digitize is limited. The real-time hand gesture recognition
the image. A user can operate a computer graphic systems we are aware of require special hardware or
crane under hand gesture control, or play a game. lengthy training analysis.
We discuss limitations of this method.
For moving or \dynamic gestures", the histogram 2 Our Approach
of the spatio-temporal gradients of image intensity
form the analogous feature vector and may be useful We seek a simple and fast algorithm, which works in
for dynamic gesture recognition. real-time on a workstation. We want the recognition
to be relatively robust to changes in lighting.
1 Introduction A high level approach might employ models of the
hand, ngers, joints, and perhaps t such a model
Computer recognition of hand gestures may provide to the visual data. This approach oers power and
a more natural human-computer interface, allowing robustness, but at the expense of speed.
people to point, or rotate a CAD model by rotat- A low-level approach, such as was taken by [6],
ing their hands. Interactive computer games would would process data at a level not much higher than
be enhanced if the computer could understand play- that of pixel intensities. This approach would not
ers' hand gestures. Gesture recognition may even be have the power to make inferences about occluded
useful to control household appliances. data. However, it could be simple and fast. We chose
We distinguish two categories of gestures: static this approach.
and dynamic. A static gesture is a particular hand We use a pattern recognition system of the form of
conguration and pose, represented by a single im- Fig. (1). Some transformation, T , converts an image
age. A dynamic gesture is a moving gesture, rep- or image sequence into a feature vector, which we
resented by a sequence of images. We focus on the then compare with the feature vectors of a training
recognition of static gestures, although our method set of gestures. We will use a Euclidean distance
generalizes in a natural way to dynamic gestures. metric.
For the broadest possible application, a gesture We seek the simplest possible transformation T
recognition algorithm should be fast to compute. which allows recognition of hand gestures. To moti-
Here, we apply a simple pattern recognition method vate the algorithm, let us rst examine a transforma-
to hand gesture recogntion, resulting in a fast, use- tion which is too simple, and then x it.
able hand gesture recognition system. Suppose we did nothing as our transformation, T
and used the image itself as our feature vector. We
1.1 Related Work would sum the squares of image pixel dierences to
The trackball, the joystick, and the mouse are ex- measure distances between gestures. Figure 2 illus-
temely successful devices for hand-based computer trates a problem with this scheme, and suggests a
input. Yet all require that the user hold some hard- solution. (a) and (b) show the same hand gesture un-
ware, which can be awkward. Furthermore, none ac- der two dierent lighting conditions, illustrating that
comodates the richness of expression of a hand ges- pixel intensities can be sensitive to changes of scene
ture. lighting. A pixel-by-pixel dierence of the images (a)
Devices such as the Dataglove [1] can be worn and (b) would show a large distance between these
which sense hand and nger positions [9]. While this identical gestures. However, others have observed [3]
captures the richness of a hand's gesture, it requires that local orientation measurments are less sensitive
recognition system
training
set
image
(a) (b)
feature
vector
compare
T
Figure 1: Outline of the recognition system. (c) (d)
We apply some transformation T to the im-
age data to form a feature vector which repre- Figure 2: Showing the robustness of local
sents that particular gesture. To classify the orientation to lighting changes. Pixel inten-
gesture, we compare the feature fector with sities are sensitive to lighting change. (a)
the feature vectors from a previously gener-
ated training set. For dynamic gesture recog- and (b) show the same hand gesture illumi-
nition, the input would be a sequence of im- nated under two dierent lighting conditions.
ages. The pixel intensities change signicantly as
the lighting changes. Maps of local orienta-
tion, (c) and (d), are more stable. (The ori-
entation maps were computed using steerable
to lighting changes. (d) and (e) show the local direc- lters [10]. Orientation bars below a contrast
tion of dominant orientation (as computed in [10]) threshold are suppressed.)
for the images (a) and (b). In this representation,
the two gestures look quite similar.
Next, we need to enforce translation invariance. distance between orientation histograms of image fea-
The same hand at dierent positions in the image tures of gradually diering orientation. (d) shows the
should produce the same feature vector. A simple same data plotted in polar coordinates, which allows
way to enforce this is to form a histogram of the local convenient comparison with the original image.
orientations. This treats each orientation element the One has a choice of representing orientation in the
same, independent of location. angle or twice angle representations [13]. A repre-
With one minor modication, this orientation his- sentation by angle would treat a given edge and a
togram will be our feature vector to represent hand contrast reversed version as having opposite orienta-
gestures. The orientation analysis gives robustness tion. A double angle representation maps these into
to lighting changes; the histogramming gives trans- the same orientation. A double angle representation
lational invariance. As we will see, such a histogram would map a hand on dark and light backgrounds
can be calculated quickly. This simple method will to approximately the same feature vector. A (sin-
work if examples of the same gesture map to similar gle) angle representation may allow more gestures to
orientation histograms, and dierent gestures map to be distinguished. For our work, we used an angle
substantially dierent histograms. (We show some representation, to allow more dierentiation between
cases where this doesn't hold in Fig. 6). McConnell gestures.
[16] proposed this pattern recognition method, al- The resulting algorithm is simple and fast to com-
though he used a more complicated histogram com- pute. The representation for each gesture is small,
parison scheme than the squared error measure we and comparisons between feature vectors can be very
use here. fast.
Figure 3 shows an example histogram and our ad- Our approach relates to other examples of image
ditional modication. When analyzing local orien- analysis through orientation analysis. In [3], Bich-
tation, we set a contrast threshold below which we sel analyzed faces, using local orientation to achieve
assume the orientation measurement is inaccurate. some lighting invariance. Gorkani and Picard [18]
For the image, (a), all points would be below that used orientation histograms to compute dominant
threshold except for the region near the horizontal texture orientations. Nelson [17] used orientation
line, where the pixels all have the same orientation patterns for visual homing. This work is also in the
(b). We then blur the histogram in the angular do- same spirit as texture analysis schemes which ana-
main, (c). This allows for a gradual fall-o in the lyze the outputs of ensembles of oriented lters at
diering orientations [2, 15]. entation histogram of the new image (d), and com-
We have implemented this algorithm with an HP- pares it with each of the orientation histograms from
735 workstation, and a Raster Ops digitizing board. the training phase. In our implementation described
We took various steps to achieve real-time speed. here, we selected the command corresponding to the
The 640 x 480 digitized image is averaged and sub- closest feature vector. With this system we have
sampled to a resolution of 106 x 80. We use black made a computer graphic crane, (g), which the user
and white video. We measure the gradient direction may control in real-time. (see [23] for a related sys-
and local contrast using simple two 2-tap x and y tem, not visually controlled). Rows (e) and (f) show
derivative lters. With the above steps, the total the same gestures made under a dierent lighting
processing time is 100 msec per frame. condition. The system can still identify the gestures
If dx and dy are the outputs of the x and y properly.
derivative operators, then the gradient direction is
arctan(dx; dy), and the contrast is dx2 + dy2 . We
p Figure 5 shows a measure of performance for the
gestures shown in Fig. 4. Each matrix indicates the
divide orientation into 36 bins and use a 1 4 6 4 distance between for the feature vectors from each
1 lter to blur the orientation histogram. We set gesture of the test set and each gesture of the train-
the contrast threshold as some amount, k, times the ing set. Darker intensities correspond to closer dis-
mean image contrast. Values of k between 1.2 and tances. The gestures of the test set made under the
2.7 worked well. same lighting conditions are unambiguously classied
by the orientation histogram feature vectors. Even
orientation histogram example under the dierent lighting condition of Fig. 4 (e),
each gesture is still properly classied, although the
discrimination is less clear-cut.
of occurence
frequency
(b) raw
histogram
orientation angle
(a) image
of occurence
frequency
(c) blurred
orientation angle
(d)
polar plot
3 Operation
Figure 4 illustrates operation. There is rst a train-
ing phase. The user rst indicates the hand posi-
tions for the desired vocabulary of gestures, such as
the commands for \up", \down", \right", \left" and
\stop" in this example, (a). We show only 3 com-
mands in the gure, but typically more are used. The
user may show several gesture examples correspond-
ing to a single command. The computer stores the
orientation histograms corresponding to each image,
(b).
In the run phase, the user repeats the gesture for
a desired command (c). The computer forms the ori-
||{Train||{
up down right
(a)
750 750 750
500
250
500
250
500
250
test set 1
-200
-100 100
200
300 -300
-200
-100 100
200
300
400 -200
-100 100
200 (same lighting)
-250 -250 -250
(b)
down
-750 -750 -750
right
stop
left
up
||{Test1||{ up
up down right training down
set right
left
stop
(c)
750 750 750
down
right
stop
left
up
(e) up
750 750 750
training down
right
500 500 500
-250
200 -300
-200
-100 100
-250
200
300 -200
-100 100
-250
200
300
left
-500 -500 -500
stop
(f) -750 -750 -750
feature vector
distances
(b)
Figure 5: The two matrices display, in grey
level, the distances between training and test
sets such as those shown in Fig. 4 (black means
close). The test gestures, made under the
same lighting conditions, are well classied ac-
(g)
cording to the training gestures, as indicated
by the diagonal dark line in the matrix of fea-
ture vector distances, (a). For the case of test
Figure 4: Subset of vocabulary of gestures gestures made under dierent lighting condi-
used to control computer graphic crane. (a) tions, (b), a nearest neighbor classier still
shows the training set of gestures for the com- gives the correct answer, indicating some de-
mands up, down and right. (c) shows a test gree of lighting invariance. Some of the test
set of the same gestures, under the same light- feature vectors have moved closer to train-
ing conditions. (e) is a test set, made under ing feature vectors of dierent categories|the
dierent lighting conditions. (b), (d), and (f) lighting invariance is not perfect.
are the corresponding orientation histograms.
Note that the shapes look approximately the
same as for the same hand positions made un-
der dierent lighting conditions, (b). An ex-
tension of this vocabulary of commands can
control in real-time a computer graphic crane,
(g).
Our system has two real-time demonstrations
of gesture classication: control of the computer
graphic crane of Fig. 4, and the game of \scis-
sors/paper/stone", where the computer analyzes the
user's hand gesture to decide the winner of each
round.
3.1 Other uses
One can interpolate between orientation histogram
values. For example, one can train the system at
several dierent hand orientations. Then interpola- 1000
polation [11]
-1000
by image intensities changing over time and space. -1000 -500 500 1000
2000
From our experience watching many people use the -4000 -2000 2000 4000
the user is not satised with the gesture classication. -4000