Sie sind auf Seite 1von 5

R&D

INNOVATION GALLERY

Multimodal interfaces

march 2005

Multimodal interfaces, movement and voice to control machines The progress made these past few years in the elds of movement and voice recognition today enable us to think of new ways to control the machines around us. Combining these two dimensions in fact make it possible to eliminate contact, and thereby develop dialogue interfaces that are simpler and more natural. France Telecom researchers are therefore working on developing new tools for tomorrow that could simplify our relationship with machines and offer rich new applications, especially in areas such as co-working or video games.

u What is it ?
To act on our environment by means of an interface or a tool, we must make a movement: press a switch or a remote control button, turn a steering wheel, use a keyboard or a mouse. We must in other words be in physical contact with an object. France Telecom researchers work in the eld of voice and especially movement recognition will soon make it possible to eliminate touch and to interact with our environment in a simpler way. The idea is therefore to use voice and hands to control a machine. Since we use these two from early childhood, there is nothing to be learnt. Besides, voice and hands are tools that we use every day: they are always available, not cumbersome, and we dont need to share them, because we all have our own! The user can for instance use his hands instead of the mouse to control his computer. Adding voice to the movements, he can control a variety of more complex actions. These two modes are therefore complementary. What France Telecoms R&D teams have in mind, is to take the best of each in order to simplify man-machine relationships. Just as our vocabulary is rich, our bodies have a big plasticity, if one thinks of the vast number of movements a person makes in the course of a day. Although they are progressing, articialCopyright France Tlcom - 2005

1/5

sight technologies are not yet able to recognise the multitude of movement activities carried out in different environments and contexts.

On the other hand, the pointing gesture is already well recognised by articial-sight technologies. It is very effective for spacial or navigational designation tasks. This could lead to the creation of more natural man-machine interfaces, by which control of the environment would be simplied: objects can be pointed at and displaced, while voice commands can carry out more complex actions. Co-working Interacting through voice and movement, from a distance and without touching anything, with large screens and big interaction volumes, facilitates group work in collaborative spaces. In fact, besides touch, remote interaction also frees the user from the interface dimension. In front of a large representation, several people can share the same view and therefore interact easily around the same application, wherever they are in the room. With an electronic projection screen, life-size representations are possible that can create a feeling of presence and immersion. Interaction between people also energises co-working, especially if they all have the same tools. By making use of the possibilities offered by broadband telecommunication networks, these collaborative virtual environments can be shared by several remote teams. In this case, articial-sight technologies can reproduce each participants movements and then project them on the screen in the form of virtual clones or avatars.
Copyright France Tlcom - 2005

New game possibilities Articial-sight technologies also open up new possibilities in the eld of video games. By eliminating joystick, keyboard and mouse, the player can experience new sensations by becoming more immerged in the virtual game environment. He can for instance use his own body movements to animate a game character. This could even add a choreographic or athletic dimension to the performances that are usually required.

2/5

u How does it work ?


Articial-sight technologies can give a computer the ability to see and interpret a scene by means of a camera. This makes it possible to act from a distance and has the advantage of letting the user interact freely. For this, it is necessary to analyse his movements. Since hands represent the natural way in which we act on our environment, they are studied with particular care. A stereo camera simultaneously follows the spacial position of the two hands and the face. The head-hand axis can thereby be determined, or more specically, eye-nger, which corresponds to the direction in which the person is looking. By taking the face as the orientation point for the body, the hands are detected and followed when the user moves them towards the screen. The user can in a natural way point to objects with one hand and select them with the other hand. For a navigation application in a virtual 3D space, the second hand can also control a third axis for zooming in. The user does not need to learn anything new or make specic settings. Reproducing movements Seeing that there are hundreds of thousands of movements that we use uently, it seems difcult to develop a machine capable of identifying them and of reacting appropriately. However, articial sight does make it possible to reproduce movements without trying to analyse them. This technology can animate a virtual character, a kind of avatar, with the users own movements that are reproduced in an identical way. In video games, this functionality is particularly useful for animating characters. Powerful static tools Movement processing and analysis are made possible by computer programmes destined for statistical learning problems such as neurone networks or hidden Markov chains . This statistical tool is also used for voice-recognition systems. Recognising voice France Telecoms R&D has for many years been working on voice-recognition technologies that can identify words and phrases. For this, our researchers are endeavouring to develop heavy-duty tools that are independent from the speaker, in other words that can accept and understand different voice timbres or accents, even in a noisy environment. The principle behind voice-recognition technologies is to compare the received sound signal to references. These references could be isolated words or word strings. The word models can be built either directly from examples of these words, or by stringing together the smallest sound units called phonemes. In order to recognise word sequences, one may: either detect only the keywords (word spotting), and then use so-called ller models to take into account those parts of the signal that do not correspond to the keywords, or specify all the sequences that the user might say.

3/5

Copyright France Tlcom - 2005

The second approach is for the moment the one that France Telecom has chosen for multimodal interfaces. The objective of this research is of course the recognition of continuous language, which would allow the user to express himself freely. In this case, the machine must not only detect the words, but also the sentence structures. The grammar and parts of speech (subject, verb, adjectives, object, etc.) should be taken into account by the machine to full the users demands as closely as possible.

u
France Telecom researchers know-how in this eld led to the founding in August 2000 of the start-up Telisma, which develops voice-recognition software for mass-public applications.

u When ?
Researchers in the eld of articial sight or movement recognition have only just started and a lot of progress still remains to be made. However, applications based on movement reproduction, to animate a video game character for instance, have almost reached technological maturity. Speech-movement interfaces are currently being developed by France Telecom researchers, for instance as part of a pluridisciplinary project on multimodality. As far as co-working is concerned, the researchers are taking part in more general programmes, on telepresence and immersion. These projects could become reality in only a few years time. The TELIM (telepresence and immersion) project, launched in January 2005, combines all work related to collaborative interfaces and interpersonal communications. Its objectives are to study uses and disruptive uses in these elds, by looking at theoretical studies or prototypes developed within France Telecoms R&D Division. For instance, the augmented reality project SPIN-3D is a collaborative platform for visualising virtual objects three-dimensionally. Also in this context, a project for steering with the nger and eye of an avatar (or synthesis agent) is currently under development. Also in the eld of articial sight, a project for surveillance through activity analysis, destined for people who are elderly or suffering from a handicap or memory loss, however forms part
Copyright France Tlcom - 2005

4/5

of a much longer-term endeavour. Movement-recognition technologies in fact still have a long way to go towards the ne analysis of movements and making analogies between them. This very forward-looking domain could however give rise to a new tool for remote personal assistance.

5/5

Copyright France Tlcom - 2005

Das könnte Ihnen auch gefallen