Beruflich Dokumente
Kultur Dokumente
-
ABSTRACT We describe a pointing and speech alternative 1.1 Related Work
to the current paint programs based on traditional devices like Like speech, gestures vary both from instance to instance
mouse, pen or keyboard. for a given human being and among individuals. Beside this
We used a simple magnetic field tracker-based pointing system
as input device for a painting system to provide a convenient temporal variability, gestures vary even spatially making
means for the user to specify paint locations on any virtual them more difficult to deal with. For the recognition of
paper. The virtual paper itself is determined by the operator those single modalities, only few systems make use of
as a limited plane surface in the three dimensional space. connectionist models [3,7,17,27] for they are not
Drawing occurs with natural human pointing by using the considered well suited to completely address the problem of
hand to define a line in space, and considering its possible time alignment and segmentation. However, some neural
intersection point with this plane. The recognition of pointing architecture [ 10,14,29] has been put forward and
gestures occurs by means of a partial recurrent artificial successfully exploited to partially solve problems involving
neural network. Gestures along with several vocal commands the generation, learning or recognition of sequence of
are utilized to act on the current painting in conformity with a
predefined grammar. patterns.
Recently, several research groups have more thoroughly
Keywords
addressed the issue of combining verbal and nonverbal
User-centered Interface, Painting Tool, Pointing Gesture, Speech
Recognition, Communication Agent, Multimodal System, behavior. In this context, most of such multimodal systems
Augmented and Virtual reality, Partial Recurrent Artificial Neural have been quite successful in combining speech and gesture
Network. [4,6,26,28] but, to our knowledge, none exploits artificial
neural networks.
1. INTRODUCTION One of the first such systems is Put-That-There [4] which
The natural combination of a variety of modalities such as uses speech recognition and allows for simple deictic
speech, gesture, gaze, and facial expression makes human- reference to visible entities. A text editor featuring a multi-
human communication easy, flexible and powerful. modal interface that allows users to manipulate text using a
Similarly, when interacting with computer systems, people combination of speech and pen-based gestures has been
seem to prefer a combination of several modes to a single presented in [28]. Quickset [6] along with a novel
one alone [12, 191. Despite the strong efforts and deep integration strategy offers mutual compensation between
investigations in the last decade, human-computer pen and voice modalities.
interaction (HCI) is in its childhood and therefore its
Among gestures, pointing is a compelling input modality
ultimate goal, aiming at building natural perceptual user
that has led to friendlier interfaces (such as the mouse-
interfaces, remains a challenging problem.
enabled GUI) in the past. Unfortunately, few 3D systems
Two concurrent factors produce awkwardness. First, current that integrate speech and deictic gesture have been built to
HCI systems make use of both rigid rules and syntax over detect when a person is pointing without special hardware
the individual modalities involved in the dialogue. Second, support and to provide the necessary information to
speech and gesture recognition, gaze tracking, and other determine the direction of pointing. Most of those systems
channels are isolated because we do not understand how to have been implemented by applying computer vision
integrate them to maximize their joint benefit [20,2 1,25,30]. techniques to observe and track finger and hand motion.
While the first issue is intrinsically dificult (everyone The hand gesture-based pointing interface detailed in [24]
claims to know what a gesture is, but nobody can tell you was proposed to track the position of the fingertip in which
precisely), progress is being made in combining different the user points map directly into the 2D cursor movement
modalities into a unified system. Such a multimodal system, on the screen. Fukumoto et a1 [ l l ] report a glove-free
allowing interactions that more resemble everyday camera based system providing pointing input for
communication, becomes more attractive to users. applications requiring computer control from a distance
(such as a slide presentation aid). Further stereo-camera
spatial evolutions of roads, range of hills. Some researchers Oscar in 1989?”, dcictic w o r k are UOI accompanied by pointing gstures. Neither they are in sentnrs
like e.g. “There shall come a time”, or “They all know that Lam is cute”, or “The house thot she built is
[23] argue that pointing has iconic properties and represents huge”, where they arc used as conjunctionor pronoun.
T
explanatory sentences: Audio
1: <Sentence> = <answer> I <color> I <double> I <single>
2: <answer> = no / yes
3: <color> = green / red / blue /yellow / white /
magenta / cyan
4: <double> = draw on / draw off /zoom in / zoom out /
cursor on /cursor off/line begin /paste /
select end / select begin / line end / copy /
circle end / circle begin / rectangle end /
rectangle begin
5 : <single> > = exit / he& / undo / switch to foreground /
save /free buffer / switch to background / Figure 4: agent communication within the entire system.
send to background / cancel / restart 1 The central agent is the facilitator. Agents can inform the
facilitator of their interest in messages which match
delete /load (logically unify) with a certain expression. Thereafter, when
Here, <single> and <double> refers to the sets of the facilitator receives a matching message from some other
commands which need to be issued without and with an agent, it will pass it along to the interested agent. Since
accompanying pointing gesture, respectively. ASCII strings and TCP/IP are common across various
Communication is straightforward. The Speech Agent [12]Hauptmann A.G., and McAvinney P., Gesture with speech for
graphics manipulation. International Journal of Man-Machine
produces messages of the type parse-speech(Message) Studies, Vol. 38,231-249, February 1993.
which the facilitator forwards to the Fusion Agent. This
[13]Jojic N., et al., Detection and Estimation of Pointing Gestures in
ladder, with some simple parsing, can then extract speech
Dense Disparity Maps, in Proceedings of International Conference
recognition alternate interpretations and their associated on Automatic Face and Gesture Recognition, 468474,2000,
probabilities from the message strings. The command
[141Jordan M., Serial Order: A Parallel Distributed Processing Approach
associated with the highest probability value above an Advances in Connectionist Theory, Lawrence Erlbaum, 1989.
experimental threshold (currently 0.85) is chosen.
[151Kendon A., The Biological Fundations of Gestures: Motor and
4. Conclusions and Future Work Semiotic Aspects, Lawrence Erlbaum Associates, 1986.
The presented system represents a real-time application of [16]Kumar S., Cohen P.R., Levesque, H.J., The Adaptive Agent
drawing in space on a two-dimensional limited rectangular Architecture: Achieving Fault-Tolerance Using Persistent Broker
surface. This is a first step toward a 3D multimodal speech Teams, Proc. 4th Int’l Conf. Multi-Agent Systems, 159-166,2000.
and gesture system for computer aided design and [17]Lippmann R.P., Review of Neural Networks for Speech
cooperative tasks. A system might perhaps recognize from Recognition, Neural Computation, 1:1--38, 1989.
the user’s input some 3D objects from an iconic library and [ 181McNeill D., Hand and Mind: what gestures reveal about thought, the
refine the user’s drawings accordingly. We anticipate University of Chicago Press, 1992.
expanding the use of speech to operate with 3D objects. [191Oviatt S.L., Multimodal interfaces for dynamic interactive maps, in
Since the h i o n component is an agent, we are going to Proceedings of Conference on Human Factors in Computing
’ make it a module in the entire QuickSet Adaptive Agent Systems: CHI, 95-102, 1996.
Architecture [ 161, to further use it as a sort of virtual mouse [20] Oviatt S.L., Cohen, P.R., Multimodal interfaces that process what
for the QuickSet [6] user interface. Possible alternative comes naturally. Communication of the ACM, 43(3):45-53,2000,
applications for this system range from hand cursor control [2 13Oviatt S.L., et al., Designing the user interface for multimodal
by pointing to target selection in virtual environments. speech and gesture applications: State-of-the-art systems and
research directions, Human Comp. Interaction, 15(4):263-322,2000,
5. ACKNOWLEDGMENTS [22] Oviatt S., De Angeli A., Kuhn K., Integration and Synchronization
This research is supported by the Office of Naval Research, Grant of Input Modes during Multimodal HCI, Proceedings of CHI ‘97,
N00014-99-1-0377 and N00014-99-1-0380. Thanks to Rachel Coulston 415-422, 1997.
for help editing and Richard M. Wesson for programming support.
[23]Place U.T. The Role of the Hand in the Evolution of Language.
6. REFERENCES Psycoloquy, Vol. 11, No. 7,2000, httn://www.coasci.soton.ac.uk
[ 11 httu://www.ascension-tech.com [24]Quek F., Mysliwiec T.A., Zhao M., FingerMOuse: A Freehand
[2] Taylor R.M., VRPN: A Device-Independent, Network-Transparent Computer Pointing Interface in Proc. of Int’l Conf. on Automatic
VR Peripheral System, Proceedings of the ACM Symposium on Face and Gesture Recognition, 372-377, 1995.
Virtual Reality Software and Technology, 2001. [25] Queck F., et al., Gesture and Speech Multimodal Conversational
[3] Boehm K., Broll W., Sokolewicz M., Dynamic Gesture Recognition Interaction, Tech. Rep., VISLab-01-01, University of Illinois, 2001.
using Neural Networks; A Fundament for Advanced Interaction [26] Sharma R., et al., SpeecWGesture Interface to Visual-computing
Construction, SPIE Conf. Elect. Imaging Science & Tech., 1994. Environment, IEEE Comp. Graphics and Appl., 20(2):29-37,2000,
[4] Bolt R.A. Put-That-There: voice and gesture at the graphics [27] Tank D.W., Hopfield J.J., Concentrating Information in Time:
interface. Computer Graphics, Vol. 14, No. 3, 1980,262-270. Analog Neural Networks with Applications to Speech Recognition,
[5] Cipolla R., Hadfield P.A., Hollinghurst, N.J., Uncalibrated Stereo Proc. of the 1st Int’l Conf. on Neural Nets, Vol. IV, 455468, 1987.
Vision with Pointing for a Man-Machine Interface, in Proc. of the [28] Vo M.T., Waibel A.A., Multimodal human-computer interface:
IAPR Workshop on Machine Vision Application, 163-166, 1994. combination of gesture and speech recognition, InterCHI, 1993.
[6] Cohen P.R., et al., Quickset: Multimodal interactions for distributed [29] Waibel A., et al., Phoneme Recognition Using Time-Delay Neural
applications. Proc. of the 5th Int’l Multimedia Conf., 31-40, 1997. Networks, IEEE Transactions on Acoustics, Speech, and Signal
[7] Corradini A., Gross H.-M., Camera-based Gesture Recognition for Processing, 37( 12): 1888-1898, 1989.
Robot Control, Proceedings of the IEEE-INNS-ENNS International [30] Wu L., Oviatt S., Cohen P.R., Multimodal Integration - A Statistical
Joint Conference on Neural Network, vol. IV, 133-138,2000, View, IEEE Transactions on Multimedia, 1(4):334-34 1, 2000