Sie sind auf Seite 1von 6

Multimodal Speech-Gesture Interface for Handfree Painting on a Virtual

Paper Using Partial Recurrent Neural Networks as Gesture Recognizer


Andrea Corradini, Philip R. Cohen
Oregon Graduate Institute for Science and Technology
Center for Human-Computer Communication
20000 N.W. Walker Rd, 97006 Beaverton, OR
andrea@cse.ogi.edu

-
ABSTRACT We describe a pointing and speech alternative 1.1 Related Work
to the current paint programs based on traditional devices like Like speech, gestures vary both from instance to instance
mouse, pen or keyboard. for a given human being and among individuals. Beside this
We used a simple magnetic field tracker-based pointing system
as input device for a painting system to provide a convenient temporal variability, gestures vary even spatially making
means for the user to specify paint locations on any virtual them more difficult to deal with. For the recognition of
paper. The virtual paper itself is determined by the operator those single modalities, only few systems make use of
as a limited plane surface in the three dimensional space. connectionist models [3,7,17,27] for they are not
Drawing occurs with natural human pointing by using the considered well suited to completely address the problem of
hand to define a line in space, and considering its possible time alignment and segmentation. However, some neural
intersection point with this plane. The recognition of pointing architecture [ 10,14,29] has been put forward and
gestures occurs by means of a partial recurrent artificial successfully exploited to partially solve problems involving
neural network. Gestures along with several vocal commands the generation, learning or recognition of sequence of
are utilized to act on the current painting in conformity with a
predefined grammar. patterns.
Recently, several research groups have more thoroughly
Keywords
addressed the issue of combining verbal and nonverbal
User-centered Interface, Painting Tool, Pointing Gesture, Speech
Recognition, Communication Agent, Multimodal System, behavior. In this context, most of such multimodal systems
Augmented and Virtual reality, Partial Recurrent Artificial Neural have been quite successful in combining speech and gesture
Network. [4,6,26,28] but, to our knowledge, none exploits artificial
neural networks.
1. INTRODUCTION One of the first such systems is Put-That-There [4] which
The natural combination of a variety of modalities such as uses speech recognition and allows for simple deictic
speech, gesture, gaze, and facial expression makes human- reference to visible entities. A text editor featuring a multi-
human communication easy, flexible and powerful. modal interface that allows users to manipulate text using a
Similarly, when interacting with computer systems, people combination of speech and pen-based gestures has been
seem to prefer a combination of several modes to a single presented in [28]. Quickset [6] along with a novel
one alone [12, 191. Despite the strong efforts and deep integration strategy offers mutual compensation between
investigations in the last decade, human-computer pen and voice modalities.
interaction (HCI) is in its childhood and therefore its
Among gestures, pointing is a compelling input modality
ultimate goal, aiming at building natural perceptual user
that has led to friendlier interfaces (such as the mouse-
interfaces, remains a challenging problem.
enabled GUI) in the past. Unfortunately, few 3D systems
Two concurrent factors produce awkwardness. First, current that integrate speech and deictic gesture have been built to
HCI systems make use of both rigid rules and syntax over detect when a person is pointing without special hardware
the individual modalities involved in the dialogue. Second, support and to provide the necessary information to
speech and gesture recognition, gaze tracking, and other determine the direction of pointing. Most of those systems
channels are isolated because we do not understand how to have been implemented by applying computer vision
integrate them to maximize their joint benefit [20,2 1,25,30]. techniques to observe and track finger and hand motion.
While the first issue is intrinsically dificult (everyone The hand gesture-based pointing interface detailed in [24]
claims to know what a gesture is, but nobody can tell you was proposed to track the position of the fingertip in which
precisely), progress is being made in combining different the user points map directly into the 2D cursor movement
modalities into a unified system. Such a multimodal system, on the screen. Fukumoto et a1 [ l l ] report a glove-free
allowing interactions that more resemble everyday camera based system providing pointing input for
communication, becomes more attractive to users. applications requiring computer control from a distance
(such as a slide presentation aid). Further stereo-camera

0-7803-7278-6/02/$10.00 Q2002 IEEE 2293


techniques for detection of real time pointing gestures and a prelinguistic and visually perceivable event. In fact, in
estimation of direction of pointing have been exploited in face-to-face communication deictic speech never occurs
[S, 8, 131. More recently, [26] describes a bimodal without an accompanying pointing gesture. In a shared
speechlgesture interface, which is integrated in a 3D visual visual context, any verbal deictic expression like “there” is
environment for computing in molecular biology. The unspecified without a parallel pointing gesture’. These
interface lets researchers interact with 3D graphical objects multiple modes may seem redundant until we consider a
in a virtual environment using spoken words and simple pointing gesture as a complement to speech, which helps
hand gesture. form semantic units. We can easily realize that by speaking
In our system, we make use of the Flock of Birds (FOB) with children. They compensate for their limited vocabulary
[ 11, a six-degree-of-freedom tracker device based on by pointing more, probably because they cannot convey so
magnetic fields to estimate the pointing direction. In an much information about an objecdlocation by speaking as
initialization phase, the user is required to set the target they could by directing their interlocutor to perceive it with
coordinates in the 3D space that bound his painting region. his own eyes.
With natural human pointing behavior, the hand is used to 2.1 An Empirical Study
define a line in space, roughly passing through the base and As described in the previous section, pointing is an
the tip of the index finger. This line does not usually lie in intentional behavior that aims at directing the listener’s
the target plane, but may intersect it at some point. visual attention to either an object or a direction. It is
We recognize pointing gestures by means of a hybrid controlled both by the pointer’s eyes and by muscular sense
partial recurrent artificial neural network (RNN) consisting (proprioception).
of a Jordan network [ 141 and a static network with buffered
In the real world, our pointing actions are not coupled with
input to handle the temporal structure of the movement
“cursors”, yet our interlocutors can often discern the
underlying the gesture. Concurrently, several speech
intended referents, processing the pointing action and the
commands can be issued asynchronously. They are
deictic language together.
recognized using Dragon 4.0, a commercial speech engine.
Speech along with gestures is then used to put the system We conducted an empirical experiment to investigate how
into various modes to affect the appearance of the current precise pointing is when no visual feedback is available.
painting. Depending on the spoken command, we solve for We invited four subjects to point at a target spot on the wall
the intersection point and use it either to directly render ink using a laser pointer. They did this task from six different
or to draw a graphical object (e.g. circle, rectangle, or line) distances away from the wall (equally distributed from 0.5m
at this position in the plane. Since we implemented the to 3m), ten times for each. Each time the subject attempted
speech and tracking modules are on different machines, we to point at the target with the beam turned off. Once a
employed our agent architecture to allow the different subject was convinced that he had directed a laser pointer
modules to exchange messages and information. toward the bulls-eye, we turned on the laser and determined
the point the user was really aiming at. We determined the
2. DEICTIC GESTURES distance between the bulls-eye the user was aiming for and
As for HCI, currently no comprehensive classification of the actual area he indicated with the laser pointer. We then
natural gestures exists that would help in establishing a computed overall error for each distance as the average
methodology for gesture understanding. However, there is a distance between desired and actual points on the wall over
general agreement in defining the class of deictic or all the trials for that distance from the wall.
pointing gestures [9,15,18]. The term deictic is used in
reference to gestures or words that draw attention to a The subjects were requested to perform this experiment
physical point or area in course of a conversation. twice and in two different ways: in a “natural” way and in
an “improved” way. As natural way we asked the persons
Among natural human gestures, pointing gestures are the involved in this experiment to naturally point at the target,
easiest to identify and interpret. There are three body parts while in the improved way we specifically asked the person
which can conventionally be used to point: the hands, the to try to achieve the best result (some people put the laser
head, and the eyes. Here we are concerned only with pointer right in front of the eye and closed the other, other
manual pointing. put it right in front of the nose, etc.). The outcome of the
In the western society, there are two distinct forms of experiment is shown in Figure 1.
manual pointing which regularly CO-occur with deictic
words (like “this”, “that”, “those” etc.): the one-finger- 1 This dos not happen in smlences that are used far referencing places, objects or w m t ~the interlocutors
pointing to identify a single or a group of objects, a place or have clear their mink because of the dialogue context. E.g.,in “Have you b m l o Italy? YS, I have
a direction, and the flat-hand-pointing to describe paths or b m there twice” or “I watched Nuovo Cincma Paradiso at the TV yesterday. DiB‘t that film win an

spatial evolutions of roads, range of hills. Some researchers Oscar in 1989?”, dcictic w o r k are UOI accompanied by pointing gstures. Neither they are in sentnrs
like e.g. “There shall come a time”, or “They all know that Lam is cute”, or “The house thot she built is
[23] argue that pointing has iconic properties and represents huge”, where they arc used as conjunctionor pronoun.

0-7803-7278-6/02/$10.00 02002 IEEE 2294


As expected, we can see how with increasing distance the vertices of the future rectangular painting region. These
error increases as well. In addition, when the user pointed at points are chosen by pointing at them. However, since this
the given spot from a distance of 1 meter the error procedure is to be done in the 3D space, the user has to aim
decreased from 9.08 to 3.89 centimeters from the “natural” at each of the vertices from two different positions. The two
to the “improved” way. different vectors triangulate to select a point as vertex. In
3D space, two lines will generally not have an intersection.
Pointing Inaccuracy In such cases, we will use the point of minimum distance
from both lines.
With natural human pointing behavior, the hand is used to
define a line in space, roughly passing through the base and
the tip of the index finger. Normally, this line does not lie in
the target plane but may intersect it at some point. It is this
point that we aim to recover.
For this reason, when the region selected in the 3D space is
0.5 I 1.5 2 2.5 3
neither a wall screen, nor a general surface on which the
distance from wall (m)
input can be directly output (tablet, the computer’s monitor
I+improved pointing +natural pointing I etc.), the system can be properly used only when the
magnetic sensor is aligned and used together with a light
pointer. However, in this situation we also implemented a
Figure 1: target pointing precision. rendering module to draw the actual painting on the screen
In light of this experiment, reference resolution of deictic regardless of the target plane chosen in the 3D space.
gestures without verbal language is an issue. In particular,
when small objects are placed close together, reference
resolution via deictic gesture can be impossible without the
help of spoken specification. In addition, direct mapping
between the 3D user input (the hand movement) and user
intention (pointing on the target plane) can be carefully
performed only with visual feedback (information on
current position). In the next session, we describe the
system that has been built according to these
considerations. Figure 2: selecting a graphic tablet as target region for painting
3. THE PAINTING SYSTEM enables directly visual feedback. The frame of reference of the sensor
is shown on the left. On the right an exemplarity painting is shown as
3.1 Estimating the pointing direction it appears on the tablet.
For the whole system to work, the user is required to wear a
hand glove on whose top we put one FOB’s sensor. The 3.2 Motion Detection for Segmentation
FOB is a six-degree-of-freedom tracker device based on In order for us to describe in detail the motion detector, we
magnetic fields which we exploit to track the position and first need to give some definitions.
orientation of the user’s hand with respect to the coordinate We consider the FOB data stream static anytime the sensor
system determined by the FOB’s transmitter The hand’s attached to the user’s hand remains stationary for at least
position is given by the position vector reported by the five consecutive FOB’S reports. In this case, we also refer
sensor at a frequency of approximately 50Hz. For the to the user as to be in the resting position. In a similar way,
orientation, we put the sensor almost at the back of the we say the user is moving and we consider the data stream
index finger with its relative x-coordinate axis directed dynamic whenever the incoming reports change in their
toward the index fingertip. In this way, using the quatemion spatial location for at least five times in a row.
values reported by the sensor, we can apply mathematical Static and dynamic data stream are defined in such a way
transformations within quatemion algebra to determine the that they are mutually exclusive, but not exhaustive. In
unit vector X which unambiguously defines the direction of other words, if one definition is satisfied, that implies that
the sensor and therefore that of pointing (Figure 2). the other is not. However, the converse situation is not the
The point along with vector X is then used to determine case, since if one definition is not satisfied, this does not
the equation of the imaginary line passing through and imply that the other is. Such non-complementarity makes
having direction X.When the system is started for the first the motion detector module robust against noisy data.
time, the user has to choose the region he wants to paint in. For real-time performance purposes, the FOB’s data are
This is accomplished by letting the user choose three of the currently downsampled to lOHz so that, both static and

0-7803-7278-6/02/$10.00 02002 IEEE 2295


dynamic conditions need be hlfilled over a time range of addition, we fiunish the input layer with a time window to
approximately half a second. accommodate a part of the input.
The motion detector is in charge of reporting to the gesture
recognizer anytime transition from (into) dynamic into
(from) static occurs. If the transition is from static into
dynamic, the motion detector forwards the input stream to
the RNN for the classification to start. The RNN is queried
for the classification result if the following conditions hold
simultaneously:
a) the opposite transition from dynamic into static is
detected
b) the imaginary line described by the sensor in the
space intersects the target region (the virtual paper)
c) the time elapsed between the begin of the Figure 3: Example of R"for Gesture Recognition.
classification and the current time is less than a given The resulting network (Figure 3) is a hybrid combination of
threshold chosen according to the maximum duration a Jordan network [14] and a static network with buffered
of the gestures used for the €2" training input. The input layer is divided into two parts: the context
The motion detector provides the recognizer with explicit unit and the buffered input units, which currently contains 5
start and end points for classification. Therefore, the RNN time steps (i.e. half a second). The context unit holds a copy
needs not segmentation to identify these characteristics of of the output layer and from itself at the previous time step.
the movement. With the use of such a motion detector The recurrence's weight from the output layer is kept fixed
gestures need not to start with the hand in a given user- at 1. Indicating with p the strength of the self connection,
defined position, either. the updating rule at time instant t for the context unit C(t)
can be expressed in mathematical form as h c t i o n of the
3.3 Gesture Recognition by means of R" network's output O(t) as:
We use artificial neural networks to capture the spatially
nonlinear local dependencies and handle the temporal
structure of a gesture.
The most direct way to perform sequence recognition by
C(t)= C(t - 1)p + O(t - 1) = x:-*
r=O (L"i))
static artificial neural networks is to turn the largest It turns out that the context unit accumulates the past output
possible part of the temporal pattern sequence into an input values of the network in a manner depending on the choice
buffer on the input layer of a network. In such a network a of the parameter p i n the range [O,l]. Close to one, it
part of the input sequence is presented simultaneouslyto the
extends the memory further back into the past but causes
network by feeding the signal into the input buffer and next
loss of sensitivity to detail. This recurrence permits the
shifting it at various time intervals. The buffer must be
network to remember some aspects of the most recent past
chosen in advance and has to be large enough to both
giving the network a kind of short-term memory. At a given
contain the longest possible sequence and to maintain as
time, the network output depends as well on the current
much as possible context-dependent information but a large
input as on an aggregate of past values.
buffer means a large number of parameters and implicitly a
As input values for the RNN, we use the signs of the
large quantity of required training data for successhl
differences between the spatial components of the current
learning and generalization.
and previous sensor location. Thus, the input is three-
Partial recurrent artificial neural networks (RNNs) are a
dimensional, translation invariant, and its components can
compromise between the simplicity of these feedforward
assume only one value among { - 1,O, 1} .
nets and the complexity of recurrent models. RNNs are
During the training, the desired output is always associated
currently the most successhl architectures for sequence
with the center of the buffered input window. As target, we
recognition and reproduction by connectionist methods.
take the sample following that center within the training
They are feedforward models with an additional set of fixed
sequence. An additional output neuron is always presented
feedback connections to encode the information from the
the value 1 when the sequence is a deictic gesture, the value
most recent past. That recurrence does not complicate the
0 otherwise. Preclassified patterns are not necessary. The
training algorithm since its value is fixed and hence not
input vectors constituting the sequence are stepped through
trainable.
the context window time step by time step. The weight
We deploy a partial recurrent network in which the most
adjustment is carried out by backpropagation with least
recent output patterns, which intrinsically depend on the
mean square error function. Both hidden and output
input sequence, are fed back into the input layer. In

0-7803-7278-6/02/$10.00 02002 IEEE 2296


neurons compute the sigmoid activation function. The The user uses voice commands to put the system into
additional output neuron is checked anytime a classification various modes that remain in effect until he changes them.
result is required. Speech commands can be entered at anytime and are
We tested the recognizer on four sequences, two of a person recognized in continuous mode.
performing pointing gestures toward a given virtual paper,
3.5 The Fusion Agent
and two of the same person gesticulating during a
The Fusion Agent is a finite state automaton that is in
monologue without deictic gestures. Each sequence lasts 10
charge for two major functions, i.e., the rendering, and the
minutes and is sampled at 10Hz. One sequence for each
temporal fusion of speech and gesture information.
class was used for training and testing. The recognition rate
was up to 89% for pointing gestures, and up to 76% for This rendering is implemented with OpenGL on a SGI
non-pointing gestures. While only a few sequences from the machine utilizing the Virtual Reality Peripheral Network
deictic gesture data set were misrecognized (false negative), (VRF”) [2] driver for the FOB.
much more movements from the non-pointing gesture data The fusion bases on a time-out variable. Once a pointing
set were misrecognized as pointing gesture (false positive). gesture is recognized, a valid spoken command must be
This is not surprising since long lasting gestures, which entered within a given time (currently 4 seconds, as usually
occur frequently during a monologue/conversation, are very speech follows gestures [22]) or another pointing gesture
likely to contain segment patterns that are very similar to must occur. Eventually, the Fusion Agent either takes the
deictic gestures. Due to the nature of the training data, the action (such as changing drawing color, select the first point
performed test looks only at the boundary conditions (false of a line, etc.) associated with the speech command or
positivehegative). We plan to collect and transcribe data issues an acoustic warning signal.
from users during a conversational event where both deictic
The model nature of the state machine ensures consistent
and non-deictic gestures occur. Testing the system with this
command sequences (e.g., “line begin” can only be
more natural data will permit to assess more precisely the
followed by “undo”, “cancel” or “line end”). Depending on
performance of the recognizer.
the performed action, the system may undergo a state
3.4 The Speech Agent change.
We make use of Dragon 4.0, a Microsoft SAP1 4.0
compliant speech engine. This speech recognition engine 3.6 Agent Architecture
captures an audio stream and produces a list of text The modules implemented for tracking, pointing and
interpretations (with association probabilities of correct painting, and speech command recognition, need to
recognition) of that speech audio. These text interpretations communicate with each other. Agents communicate by
are limited by a grammar that is supplied to the speech passing Prolog-type
- . ASCII strings (Horn clauses) via
engine upon startup. TCP/IP.
The following grammar specifies the possible self-

T
explanatory sentences: Audio
1: <Sentence> = <answer> I <color> I <double> I <single>
2: <answer> = no / yes
3: <color> = green / red / blue /yellow / white /
magenta / cyan
4: <double> = draw on / draw off /zoom in / zoom out /
cursor on /cursor off/line begin /paste /
select end / select begin / line end / copy /
circle end / circle begin / rectangle end /
rectangle begin
5 : <single> > = exit / he& / undo / switch to foreground /
save /free buffer / switch to background / Figure 4: agent communication within the entire system.

send to background / cancel / restart 1 The central agent is the facilitator. Agents can inform the
facilitator of their interest in messages which match
delete /load (logically unify) with a certain expression. Thereafter, when
Here, <single> and <double> refers to the sets of the facilitator receives a matching message from some other
commands which need to be issued without and with an agent, it will pass it along to the interested agent. Since
accompanying pointing gesture, respectively. ASCII strings and TCP/IP are common across various

0-7803-7278-6/02/%10.00 c92002 IEEE 2297


platforms, agents can be used as software components that [8] Crowley J.L., Berard F., and Coutaz J., Finger Tracking as an Input
can communicate across platforms. Device for Augmented Reality, in Proc. of the Int’l Workshop on
Automatic Face and Gesture Recognition, 195-200, 1995.
In this case, the Speech Agent is running on a Windows
platform. The best off-the-shelf speech recognition engines
[9] Efron D., Gesture, Race and Culture, Mouton and Co., 1972.
available to us (currently, Dragon) are on the Windows [lo]Elman J.L., Finding Structure in Time, Cog. Sci., 14:179-211, 1990
platform. On the other hand, the Flock of Birds and the [11] Fukumoto M., Mase K., and Suenaga Y., Realtime detection of
VRPN server are set up for Unix. Therefore, it makes sense pointing actions for a glove-free interface, in Proceedings of IAPR
to tie them together with the agent architecture (Figure 4). Workshop on Machine Vision Applications, 473-476, 1992.

Communication is straightforward. The Speech Agent [12]Hauptmann A.G., and McAvinney P., Gesture with speech for
graphics manipulation. International Journal of Man-Machine
produces messages of the type parse-speech(Message) Studies, Vol. 38,231-249, February 1993.
which the facilitator forwards to the Fusion Agent. This
[13]Jojic N., et al., Detection and Estimation of Pointing Gestures in
ladder, with some simple parsing, can then extract speech
Dense Disparity Maps, in Proceedings of International Conference
recognition alternate interpretations and their associated on Automatic Face and Gesture Recognition, 468474,2000,
probabilities from the message strings. The command
[141Jordan M., Serial Order: A Parallel Distributed Processing Approach
associated with the highest probability value above an Advances in Connectionist Theory, Lawrence Erlbaum, 1989.
experimental threshold (currently 0.85) is chosen.
[151Kendon A., The Biological Fundations of Gestures: Motor and
4. Conclusions and Future Work Semiotic Aspects, Lawrence Erlbaum Associates, 1986.
The presented system represents a real-time application of [16]Kumar S., Cohen P.R., Levesque, H.J., The Adaptive Agent
drawing in space on a two-dimensional limited rectangular Architecture: Achieving Fault-Tolerance Using Persistent Broker
surface. This is a first step toward a 3D multimodal speech Teams, Proc. 4th Int’l Conf. Multi-Agent Systems, 159-166,2000.
and gesture system for computer aided design and [17]Lippmann R.P., Review of Neural Networks for Speech
cooperative tasks. A system might perhaps recognize from Recognition, Neural Computation, 1:1--38, 1989.
the user’s input some 3D objects from an iconic library and [ 181McNeill D., Hand and Mind: what gestures reveal about thought, the
refine the user’s drawings accordingly. We anticipate University of Chicago Press, 1992.
expanding the use of speech to operate with 3D objects. [191Oviatt S.L., Multimodal interfaces for dynamic interactive maps, in
Since the h i o n component is an agent, we are going to Proceedings of Conference on Human Factors in Computing
’ make it a module in the entire QuickSet Adaptive Agent Systems: CHI, 95-102, 1996.
Architecture [ 161, to further use it as a sort of virtual mouse [20] Oviatt S.L., Cohen, P.R., Multimodal interfaces that process what
for the QuickSet [6] user interface. Possible alternative comes naturally. Communication of the ACM, 43(3):45-53,2000,
applications for this system range from hand cursor control [2 13Oviatt S.L., et al., Designing the user interface for multimodal
by pointing to target selection in virtual environments. speech and gesture applications: State-of-the-art systems and
research directions, Human Comp. Interaction, 15(4):263-322,2000,
5. ACKNOWLEDGMENTS [22] Oviatt S., De Angeli A., Kuhn K., Integration and Synchronization
This research is supported by the Office of Naval Research, Grant of Input Modes during Multimodal HCI, Proceedings of CHI ‘97,
N00014-99-1-0377 and N00014-99-1-0380. Thanks to Rachel Coulston 415-422, 1997.
for help editing and Richard M. Wesson for programming support.
[23]Place U.T. The Role of the Hand in the Evolution of Language.
6. REFERENCES Psycoloquy, Vol. 11, No. 7,2000, httn://www.coasci.soton.ac.uk
[ 11 httu://www.ascension-tech.com [24]Quek F., Mysliwiec T.A., Zhao M., FingerMOuse: A Freehand
[2] Taylor R.M., VRPN: A Device-Independent, Network-Transparent Computer Pointing Interface in Proc. of Int’l Conf. on Automatic
VR Peripheral System, Proceedings of the ACM Symposium on Face and Gesture Recognition, 372-377, 1995.
Virtual Reality Software and Technology, 2001. [25] Queck F., et al., Gesture and Speech Multimodal Conversational
[3] Boehm K., Broll W., Sokolewicz M., Dynamic Gesture Recognition Interaction, Tech. Rep., VISLab-01-01, University of Illinois, 2001.
using Neural Networks; A Fundament for Advanced Interaction [26] Sharma R., et al., SpeecWGesture Interface to Visual-computing
Construction, SPIE Conf. Elect. Imaging Science & Tech., 1994. Environment, IEEE Comp. Graphics and Appl., 20(2):29-37,2000,
[4] Bolt R.A. Put-That-There: voice and gesture at the graphics [27] Tank D.W., Hopfield J.J., Concentrating Information in Time:
interface. Computer Graphics, Vol. 14, No. 3, 1980,262-270. Analog Neural Networks with Applications to Speech Recognition,
[5] Cipolla R., Hadfield P.A., Hollinghurst, N.J., Uncalibrated Stereo Proc. of the 1st Int’l Conf. on Neural Nets, Vol. IV, 455468, 1987.
Vision with Pointing for a Man-Machine Interface, in Proc. of the [28] Vo M.T., Waibel A.A., Multimodal human-computer interface:
IAPR Workshop on Machine Vision Application, 163-166, 1994. combination of gesture and speech recognition, InterCHI, 1993.
[6] Cohen P.R., et al., Quickset: Multimodal interactions for distributed [29] Waibel A., et al., Phoneme Recognition Using Time-Delay Neural
applications. Proc. of the 5th Int’l Multimedia Conf., 31-40, 1997. Networks, IEEE Transactions on Acoustics, Speech, and Signal
[7] Corradini A., Gross H.-M., Camera-based Gesture Recognition for Processing, 37( 12): 1888-1898, 1989.
Robot Control, Proceedings of the IEEE-INNS-ENNS International [30] Wu L., Oviatt S., Cohen P.R., Multimodal Integration - A Statistical
Joint Conference on Neural Network, vol. IV, 133-138,2000, View, IEEE Transactions on Multimedia, 1(4):334-34 1, 2000

0-7803-7278-6/02/$10.00 02002 IEEE 2298

Das könnte Ihnen auch gefallen