Sie sind auf Seite 1von 6

Pmcwdingr of the 2004 IEEE InternatIona1 Confennee on RobotIco 6 Automation

New Orleans. IA Aprll2004

Face tracking and hand gesture recognition for human-robot interaction


E Lerasle* and J. Hayet* 'LAAS-CNRS I , av du Colonel Roche - 31077 Toulouse Cedex 4 - France 'ISR/DEEC - Univ. Coimbra Polo I1 - 3040 290 Coimbra - Portugal {lbrethes, pmenezes, lerasle, jbhayet)@laas.fr
L. Brkthes", P. Menezes",
Abstract-The interaction between man and machines has become an important topic for the mhotic communi& as it can generalise the use of robols. For active WR interaction scheme, the rohot needs to detect human faces in its vicinity and then interpret canonical gestum of the tracked person assuming this interlocutor has heen beforehand identified. In this context, we depict functions suitable to detect and recognise faces in video stream and then focus on face or hand tracking functions. An eficient colour segmentation hased on a watershed on the skin-like coloured pixels is proposed. A new measurement model is proposed to take into account both shape and colour cues in the particle filter lo track face or band silhouettes in video stream. An extension of the basic condensation algorithm is proposed to achieve recognition of the current hand posture and automatic switching between multiple templates in the tracking loop. Results of tracking and recognition using are illustrated in the paper and show the process robustness in cluttered environments and in various light conditions. The limits of the method and future works are also discussed.
I. INTRODUCTION A N D FRAMEWORK

the perceived faces are distorted in comparison to frontal face situations. . - - .

,~
~

"dlntillp

~. .~ ~. .. .,
~ ~ ~ ~

Man-machine interaction has become an important topic in the robotics community. In this context. advanced robots must integrate capabilities to detect humans' presence in their vicinity and interpret their motion. This permits to anticipate and take countermeasures against any possible collision or passage blockage. Here, persons are just considered as some "passers by" and no direct interaction is intended. For an active interaction, the robot must also be able to interpret gestures performed by the tracked person. This person should have been heforehand identified among all the possible robot tutors, and just then granted the right to interact w,ith the robot. This requires a face recognition step that in this work is performed hased on the indexation in a frontal upright face images database to identify the tracked person. In the case where the person is recognized, hisher hand location and posture are used as symbols to communicate with the robot. The visual functions of our system are depicted in figure 1 and detailed in the paper. The face tracking loop can he used both for passive or active H/R interaction purposes. In this last case, the face recognition module takes as input the enclosing region of the tracked face while the recognition is not achieved. In fact, out-of-plane rotated faces (even if limited) are known to be difficult to handle and cause the recognition to fail as
0-7803-8232-3/04/$17.00 02004 IEEE

Fig. I .

Dedicated funclions IO our active IUR inreraclion scheme

Human limbs (especially the hands) are highly deformable articulated objects with many degrees of freedom and can, through different postures and motions, he used lo express information. General tracking and accurate 3D pose estimation require elaborate 3D articulated rigid 1141 or. even deformable ([IO], [6]) models with difficult initialisation and time-consuming updating procedures. The aim here is to track a number of well-defined hand postures that represent a limited set of commands that the users can give to the rohot. To use a simple view-based shape representation, face and hand are therefore represented by coarse 2D rigid models, e.g. their silhouette contours, which These models, although simplistic, are modelled by splines [8]. permit to reduce the complexity of the involved computations and remain discriminatory enough to track a set of known hand postures in complex scenes as will he shown later. Examples of these models are prescnted in figure 2. The purpose of this paper is to show how a real-time system for facehand tracking and hand posture recognition can he constructed combining shape and colour cues. This is done by using colour segmentation and panicle filtering, that

1901

Fig. 2. Templalei for face or hand (its four configurarions)

Figure 3 shows an example of skin pixels labelling.

enables face detection and simultaneous tracking and posture recognition of the hand. One reason for choosing a particle filter as the tracking engine comes from its capability to work with the non-Gaussian noise models required to represent cluttered environments. The colour segmentation process will be described in section II. Considerations about face detection and frontal upright face recognition are described in section 111. Section IV depicts the well-known panicle filtering formalism and how to combine shape and colour cues in a new measurement model dedicated to this tracker. Applications of facehand tracking and hand posture recognition for active HiR interaction context are also presented and illustrated. Finally, section V summarises this approach and opens the discussion of future works.
11. COLOUR SEGMENTATION

Fig. 3. An example: (a) original inla~e,(b) map of the probabilir). of skin colour at every pixel. (c) f i s t sqmenration

C. Chrominance based segmentation

A. Algorithm overview
The goal is to achieve an unsupervised segmentation of re-. gions corresponding to the skin in the scene. This is performed in a two stage process where chrominance and luminance information are used sequentially. Several colour spaces have been proposed for skin detection ([9], [13]) while the 111213 space is frequently chosen because of its good performance in class separability [131. The requirements comprise a method well adapted to changing environmental illuminations and complex background. The algorithm that was developed to fulfil these requirements consists in the following four phases: I ) Skin-like pixels classification based on histogram over 1213 space. 2 ) Chrominance based segmentation on the labelled pixels. .3) Luminance based segmentation on the resulting regions of step #2. This is useful to segment regions with similar colours but different luminance values (like hand and sleeve in example 6.(b)). 4) Small regions removing. For completeness, we review hereafter the three first phases.

Our algorithm starts by clustering (by applying a watershed based technique) the colour space 1213 using the histogram of the previously skin labelled pixels. This chrominance histogram is, beforehand. smoothed using a morphological dilatation (figure 4.(b)). The 4 x 4 structuring element depends on the standard deviation of the 12 and I 3 components, typically : oh = 5.3 and 0 1 ~ = 2.5 for our image database.

3s

181

I"

1 0

(a)

(b)

(C)

Fig. 4. Histogram of figure 3: (a) hislogram over 1113 space, (b) h i s t o g m after dilatation, (c) histogram clustering

The watershed algorithm [2] is based on the morphology and is not limited to statistic information of the histogram. Moreover, it is efficient computationally since the skin-like pixels fall into a very small region of the colour space. The number of final clusters equals the number of markers used to initialise the watershed. Choosing local maxima of the colour histogram yields normally a large number of markers. Like in [I], a more restrictive criteria is used to locate only significant local maxima: Normalisedrontrast = Contrast height

B. Skin pixel classifcation A training phase is performed where skin regions are
selected manually separating them from the background interactively from the background and two-dimensional histogram over 1213 components are accumulated for skin regions HSkjn and background Hbg. These histograms are then summed up and normalised. The probability p(skin1C) of any image pixel with colour C being skin colour is given by the Bayes rule. This reduces here to the simple ratio of the two histograms [16]:

(2)

Normalising the contrast makes the criteria independent of the number of pixels. Figure 5 shows an example where the contrast and height of each maximum is plotted. Maxima corresponding to a normalised contrast higher than a given threshold (10%) are selected to be markers (shaded circles in the figure). These markers are then used to create clusters on the histogram using the watershed algorithm (figure 4.(c)). Finally, these clusters are used to segment the image.

1902

Fig. 7.
-,---d--4-

(a) Ha=-like features. (b) overlaying on a mining face (from [ I l l )

Fig. 5. Height and contrast for histogram peaks

D. Lirriiinance based segnieritatiori


The resulting partition is re-segmented using the luminance information. The procedure is similar to the one for the chrominance histogram. The luminance histogram is obtained for each region using the previous chrominance segmentation. Finally a watershed refinement is done based on the histogram of I , component.

E. Some e.raniples of colour segrneritotiori Figure 6 shows the segmentation results for three images.
We note that the skin regions (hand orland face) are correctly segmented even from close-to-skin coloured hackgrounds. In example 6-(b), the luminance segmentation allows to separate the hand from the sleeve. In example 6-(c), small skin-like coloured regions are removed during step #4 of the algorithm. The total cost for the colour segmentation process is about 50 ins on a P I / / - I .7 GHz laptop for an image size of 320 x
240.

between eyes and noselcheek or nose bridge as illustrated in figure 7(b). They are scaled independently in vertical and horizontal direction in order to generate an over complete set of features. A cascade of classifiers is a degenerated decision tree where at each stage a classifier is trained to detect almost all frontal faces while rejecting a certain fraction of nonFace patterns. By this way, background regions are quickly discarded while focusing on promising face-like regions. The discrete Adaboost algorithm [3] is designed to learn, a1 each stage, the features set which best separates the positive and negative training examples. Finally, skin pixels classification (section 11-B)is applied for each remaining region. A percenrage threshold on the skin labelled pixels allows to remove non face regions that occur spuriously. Figure 8 shows some examples where subwindows that encompass the detected faces are marked. In this implementation faces are detected at a rate of 15 frames per second and the error rate (miss+false alarm) is about 24% as shown in table LThe image dataset includes face under a very wide large range of conditions including : illumination, scale. pose and camera variation. Detection errors are essentially due to face patterns which are below a limit siLe (24 x 24 pixels) or low contrast in images.

Fig. 8.

Two cxamples of face dctertion

Fig. 6. Three exaniples of colour segmentation TABLE I FACEDETECTtOX RATES IN VIDEO STREAM R e d face detection rate (470 frainer)
Miss rate False alan,, rate

111. FACE DETECTION A N D TUTOR RECOGNITION

A. Face detection

.. .

Hjelmas[S] classifies different techniques used in face detection as featured-based and image-based. Face detection by explicit modelling of facial features is often troubled by the unpredictability of face appearance andor environmental conditions. Image-based approaches deal generally with a pattern recognition issue via a training procedure which classifies examples into face and non-face classes like [ I I ] . The method used for face detection is based on a boosted cascade of Haarlike features as shown in figure 7(a). These features are the result of the sum of the pixels which lie inside the white rectangles which is subtracted from the sum of pixels in the dark rectangles. These features yield to detect relative darkness

7b9c 123.1% I 0.9%

Extracted subwindows are then used to assign an initial estimate to the face tracker for passive interaction purpose or are beforehand proposed to the face recognition process for active interaction purpose.

B. Fore recognirion
Once face dcteclion is achieved, a simple recognition scheme is applied to he able to identify the tutor. The strategy we use is a classical two-fold one consisting in ( I ) leamirig the appearance of the tutor face so that we have a concise

1903

representation of it and ( 2 ) clossifiing newly detected faces according to the different classes we have learnt. This implies a first stage of normalization so that extracted i sub-windows W have the same square size, d x d. Well call these normalized windows. The key point here is the choice of a Harris interest pointsbased representation as in [4]. Such a modeling allows both data compacity and robustness to common image transforms. 1 ) Learning: Learning the tutor appearance is done in a .completely supervised way : a set of n, sub-windows {W;}l<isn, is assigned to the tutor class. Modeling the appearance of the face is done through a PCA approach, consisting in huilding the MI matrix :

e.

Here, /I: is a modified Hausdorff distance designed to take into account both points spatial vicinity through Euclidean norm and locul similat-it? through d,, Mahalanobis distance between local differential vectors. More about this metrics can he found in [4]. Distance to class 11 is finally defined as the minimum of d ( I ; R ; ) over i. It is premature to present here statistics of recognition rates on a large tutor database because some experiments are actually performed. Nevcrtheless. initial experiments on a small database (three tutors, each represented by 100 subwindows) show that the recognition rate is about 90%.

Iv. TRACKING USING PARTICLE FILTERS A. Condensation fonnalisni


The Condensation algorithm is described in [SI; The filters output at a given step r is an approximation of an entire probability distribution of likely target position (namely the state vector 7 ) represented as a weighted sample set { s ? : i = I ! ...: N } with weights IT;.The iterative process applied to the sample-sets is shown in figure IO. The state vector is + 1 = [x y. 8:s] respectively for the position,. orientation and scale of the target in the image.

Wi(l) Wi(2) W7il) W3i2)

... Wi(d)
. _ . W,(di

where W are the centered Wj. The average of the l:is noted i i b.The two first eigenvectors of W T W altogether with W allow to sum up the infonnation contained in all the database. Our strategy consists in using a small number of key irnages t o - model the tutor appearance. These images will be the .images fmm rhe database .that are.closest to, respectively, the average image, W ,and lo the directions from W along the two first eigenvectors. For a given tutor, we will call them Rj for , i = 1,2,3 (figure 9).

...
V

One line srep i the CONDENSATION alporithn~Blob cenrrer n rrprrsenr raople values and size depicl sample weights (from 171)

Fig. IO.
Fig. 9. Some examples among rhe 100 rubwindows for a given tutor and rhe assorialed rhree eigenimagcr

During the prediction step, the target dynamics are depicted by an auto-regressive model

Then, the tutor appearance model can be taken as : where n(A) is the set of interest points extracted from A , altogether with local differential invariants vectors at each point [ 151. 2) Error recognition: Once the models have been synthesized, comparing a newly detected sub-window I to the tutors database is done by (1) computing a sub-window reduced representation as seen before, (2) computing a similarity measure to the different classes. The first stage is identical to the learning phase : A set n(l) of interest points is extracted from the normalized sub-window. To compare it with the tutors database ones, noted R:, where 12 indexes the tutors and i E [1;3], the following distance is used : ~.
{rI(Ri)}i5js3,

where I is relative to the t I h image of the sequence and the W, term puto the process noise. The observation step consists in re-weighting the particles by evaluating the measurement model p ( ~ , l x : ~ ) ) .

B. Measurement model using colour segmenrariorz


This model p(z,lxji)) links the predicted system state with the measured output. Classically [SI, the likelihood function for re-weighting each sample depends on the sum of the squared distances between Fernplate points and corresponding closest edges points in the image. To reduce the involved computations, we adopt a new measurement model based on a image colour gradient and a Distance Transform (DT) image. Colour gradient c a n - b e estimated in various ways. The gradient may be estimated either on each channel. separately

d(l&) = max(h$(n(i):n(R:));h$(n(~):n(i)))
where

{-

h$(A.B) = Kthmin6(a,b)
oEA h t B

6 ( a , b ) =dv(a,b)lla-bll

1904

and then combines their outcome or as a vector to make full use of colour information. We follow this last principled way. Given the two matrix G and J:

Fig. 12. Face tracking i~ video stream: imager 1.60.90.150,180.210 According to [17], the gradient direction is deduced from the eigenvector associated with the higher eigenvalue h of matrix G while the corresponding amplitude is given by A. Figure 1 I.(a) shows an example of colour gradient image. A works [121, we have introduced a criteria that includes optical flow information to filter outliers due to the background. However, this did not solve the case where the tracked target is a moving person whose shirt is highly textured.

D. Hand rrncking niid posture recogrtirion I) Hand derection for tracker initialisation: Skin colour
Fig. 11. Eraniple of figure 3:(a) colour pmdient L171, (b) DT imapr. (c) DT image after masking

distance transformation 161 converts a binary image consisting of edges and non-edges pixels into an image where all nonedges pixels have a value corresponding to the distance to the nearest edge pixel (figure I 1 .(h)). The advantage of matching a template with the DT image rather than the edge image is that the resulting similarity measure will he a smoother function of the template pose parameters. This allows more variability between a given template and an object of interest in the image. The DT image is here performed using the previous colour gradient image. Colour gradients which are not included in ROIs issued from colour segmentation process are given a penalty in DT image (figure I I .(c)). This makes the model p(z,I.xy))relevant even if skin coloured regions are not completely extracted or undetected at all. The model p(z,l.r!") associated to sample i is given by equation (3). Index j refers to the M template poinls uniformly distributed along the spline while d " ) ( j ) refers to the pixel value of the DT image which lies under the "on" point j of the template. The lower this distance is, the better the match between image and template at this location.

areas are segmented in order to form potential hand areas (section 11). Most of the false alarms correspond to face skin regions (figure 6.(a)). In order to discriminate between face and hands, we propose two heuristics 10 make the differentiation in the tracking loop. First, orientalion is deduced from the central moments pPqof each area : @ = 1tan-'[-]. P?I,- P m The head can only tilt in the range [-40",40"] from vertical. Secondly, the height to width ratio of human faces falls within the range of golden ratio ftolernnce [SI.

Using these heuristics, it is possible to remove potential face skin region(s). Some improvements regarding Haar-like features has recently been introduced to make our hand detection process more selective. Finally, the remaining regions are then used to assign initial estimales to the hand tracker. 2 ) Harid posrrrre recognition in rrnrking loop: For tracking and 1-ecognition of hands, the stale vector becomes ? = [ x ~ y ~ ~ ~ s . / ] ' w h e r e / ~ { 1 isadiscretevariahle labelling .....4} the posture. Given the extended state X = ( x . l ) where x is the vector of continuous parameters specifying the target pose (section IV-A). With /,-I = j and ll = i. the process density can he written explicitly as [71:

y(X,IX,-I)=~j,(X,,X,-I).P(X,IX,-I)
where p ( x , I x , - ~ )is a density specifying the continuous dynamics for a particle. z,j(x,:x,-l) is a transition matrix which is independent of x,-I. This matrix represents the probabilities to switch from a given posture to another one according to a given language for example. This Bayesian mixed-state framework is called mixed-state Condensation tracker and the algorithm has heen first proposed by Isard et al. in [7] to deal with automatic switching @ween multiple motion models. Finally, current posture /, is deduced from MAP estimator hased on the sum of the weights of all particles s with same ! ' ) discrete variable I at frame r:

j=O

C. Face rrnckeer implemcntntion nrid results The face tracker has been implemented with OpenCV libraries within a P I / / - I .7 GHi laptop running Linux. Although not having special care in terms of code optimisation, the face tracker runs at about I O H z . The method performs quite well in the presence of some background clutter as can be seen on figure 12. It should he noted that the inclusion of colour segmcntalion solves some limitations of the simple contours based approach [SI. Thcre is an increase in performance notably for cluttered backgrounds. It should he recalled that in preliminary

1905

Then the estimate of the pose parameters is found from the weighted mean of that discrete sample set: w/tere P
it?

3) Hand tracker implementorion and results: Figure I3 shows a few snapshots from a sequence of 400 frames, where -the hand moves in front of a cluttered background while its
posture changes.

image, face tracking is achieved using particle filter while hand states are simultaneously recognised and tracked using panicle filter. Other visual functions suitable to detect and recognise faces are also depicted. For a richer interaction, a direct extension for our tacker will to consider multiple canonical motion models as classifiers for gesture recognition. Isard et al. [7] investigate such a tracker but the purpose is to follow the drawing action of a hand holding a pen and switches state according to the hand's motion. Funhermore, we want to adapt our tracker to be able to track multiple users simultaneously. Applying several independent single face trackers is not an adequate solution because the trackers can coalesce when some targets passe close to others. This multiple target tracking also applies to the two-handed gestures which is of great interest.

REFERENCES
Fig. 13. Hand fining in video stream: images 1.30.60.90:150.?10.250.350

Table I1 shows the results of a quantitative comparison with p r without colour segmentation for heavy cluttered office-like background. With colour segmentation, the system performs better, rarely misclassifying the postures.
TABLE I1 R E C O G N ~ T ~ O N FOR CLUTTERED BACKGROUND RATE
confip #I
rnnfilr #2 ... c. ...

[ I ] A. Albiol. L. Torres. and Ed. Delp. AD Unsupervised Color Image Segmentarion Algorirhm for Face Detection Applications. In In!. Con$ on Image Promrsing (ICIP'UI). pages 7-10. October 2001. I21 S. Beucher and E Meyer. Marfwmical Morphology in Image Pmcessjug, Chapter 12. Marcel Deker Inc., 1993.
131 Y. ..

Freund and R.E. Schaoire. Exoerimenrs with a new Boorrins Alporirhm. In b r . CO!$ on Machine Leamiag IML'%), paper 148156. San Francisco. 1996. [Jl I.B. Hayei. F. Lerarle. and M. Devy. vlsual Landmarks Detection and Recognition for Mobile Robot Navigation. I n Inf. Con$ on Cornpurer Virion orid Porrern Recognition (CVPR'O3). pages 313-318, Madison.
~

-""I.

? I N?

config. #3 confip. #4

I 1 I , I 1

without ~olour I wiih colour 45 9c I 99.5 % 41 n, I 91 8, . .. .. .~ 44% I 98 % 40 % I 89 9c

In table 111, results of automatic hand tracking and posture recognition is compared with a determined ground truth. While the hand pose is correctly determined in most frames without using colour information, the posture is often misclassified. After adding colour segmentation, a substantial improvement was seen in the estimation of both pose and posture.
TABLE IU
PERFORMANCE OF THE H A N D TRACKER WITH AND WITHOUT COLOUR

SEGMENTATION

corre~r parition correct posiiion and posture

I wirhour colour I wiih colour I 86 % I 99.5 0 I 50 9c 1 95 %

151 E. Hjelmar. Face Derecrion: a Survey. In!. Joumol of Compu~erW ~ i o n and Imoge Undersanding (CVIU'OI).83(3):?3&274. 2001. [6] D.P. Hutredocher. 1.1. Noh. and W.1. Rucldidge. Tracking Nanrigid Objects in Complex Scenes. In Inf CO!$ on Compurer Virion (ICCV'Y3). volume 1. pages 93-101. Berlin. May 1993. 171 M.A. bard and A. Blake. A Mixed-state Condensation Tracker with Automatic Model-swirching. In Im. Conf on Cornpurer Vision (ICCV'Y8). pager 107-112. Bombay, 1998. [XI M.A. Irard and A. Blake. Visual Trackinp by Srochasric Propapation of Conditional Densily. In European Cmf on Conpurer W ~ i (ECCV'Y6J. m pager 343-356. Cambridge, April 1996. 191 M.1. Jones and 1.M. Rehg. Statisrical Color Models with Applicarion to Skin Detecrion. In hir. CO,$ on Coriipurer Vision and Parrem Recognition (CVPR'YY). pages 274-280. 1999. [IO] 1.A Kakadiaris and D. Metaxar. Model-Based Estimation of 3D Human Motion with Occlusion Based on ache Mulri-Viewpoint Selecfion. In k r . Conf on Compurer Vision and Panern Recognition ICVPR'Y6), pages 81-87. San Francisco, lune 1996. [Ill R. Lienhan and 1. Maydr. An Enended Set of H a - l i e Features for Rapid Object Detection. In Inf. Conf oil Image Pmcessing (ICIP'OI). pages 73-76, Thessaloniki. 2001. [I21 P. Menezes. L. Brethes, F. Lerasle. P Danes. and I. Diaz. Visual tracking . of silhouerres for human-robot interacrion. In In1 COP$on Adwnced Robotic3 (ICAR'03). volume 2. pages 313-320, Cahnbra, 2003. 1131 Y. Ohia. T. b a d e . and T Sakai. Color Information far Region . Segmenrarion. Cornpurer Graphics orid Inrage Pmcessin (CGIP'BU).

V. CONCLUSION AND FUTURE WORKS


It is very difficult to perform tracking using only shape information. By introducing colour segmentation makes the tracker more robust in cluttered environment and in various light conditions. The originality of this segmentation method consists to apply sequentially an watershed algorithm based first on the chromaticity and then on the intensity of the skin colour pixels. Using measurement model based on a DT

10(13):22~-241, 1980. [ 4 K. Rohr Towards Model-bared Recognition of Human Movemenrs in I1 Image Sequences. Cornpurer fixion. Graphics and Image Pmcessing (CVGIP'94). 59(1):94-115. January 1994. [IS] C. Schmid, R. M o b , and C. Bauckhage. Comparing and Evaluating Interesr Poinrs. In Inr. Conf on Cornpurer Vision (ICCV'98). pages 313-320. Bombay. 1998. 1161 K. Schwerdr and J.L. Crowley. Robust Face Tracking using Color. In hrf CO,$ on Face and Gesrure Recognilion IFGR'W). pages 90-95. Grenoble. March 2Mx). 1171 S. Di Zenro. A Nore an rhe Gndienr of a Multi-Image. hrf. Journal of Cornpurer Graphics and I m q e Pmcessing (CVGIP'86). 33: 116125, 1986.

1906

Das könnte Ihnen auch gefallen