Sie sind auf Seite 1von 11

Computer Vision and Image Understanding 114 (2010) 641651

Contents lists available at ScienceDirect

Computer Vision and Image Understanding


journal homepage: www.elsevier.com/locate/cviu

Vision and RFID data fusion for tracking people in crowds by a mobile robot
T. Germa a,b,*, F. Lerasle a,b, N. Ouadah a,c, V. Cadenat a,b
a

CNRS, LAAS, 7, Avenue du Colonel Roche, F-31077 Toulouse, France


Universit de Toulouse, UPS, INSA, INP, ISAE, LAAS-CNRS, F-31077 Toulouse, France
c
CDTA/ENP, Cit 20 aot 1956, Baba Hassen, Alger, Algeria
b

a r t i c l e

i n f o

Article history:
Received 13 January 2009
Accepted 5 January 2010
Available online 22 January 2010
Keywords:
Radio frequency ID
Multimodal data fusion
Particle ltering
Person tracking
Person following
Multi-sensor fusion
Human visual servoing

a b s t r a c t
In this paper, we address the problem of realizing a human following task in a crowded environment. We
consider an active perception system, consisting of a camera mounted on a pan-tilt unit and a 360 RFID
detection system, both embedded on a mobile robot. To perform such a task, it is necessary to efciently
track humans in crowds. In a rst step, we have dealt with this problem using the particle ltering framework because it enables the fusion of heterogeneous data, which improves the tracking robustness. In a
second step, we have considered the problem of controlling the robot motion to make the robot follow
the person of interest. To this aim, we have designed a multi-sensor-based control strategy based on
the tracker outputs and on the RFID data. Finally, we have implemented the tracker and the control strategy on our robot. The obtained experimental results highlight the relevance of the developed perceptual
functions. Possible extensions of this work are discussed at the end of the article.
2010 Elsevier Inc. All rights reserved.

1. Introduction
Giving a mobile robot the ability of automatically following a
person appears to be a key issue to make it efciently interact with
humans. Numerous applications would benet from such a capability. Service robotics is obviously one of these applications, as it
requires interactive robots [16] able to follow a person to provide
continual assistance in ofce buildings, museums, hospital environments, or even in shopping centers. Service robots clearly need
to move in ways that are socially suitable for people. Such a robot
have to localize its user, to discriminate him/her from others passers-by and to be able to follow him/her across complex humancentered environment. In this context, tracking a given person in
crowds from a mobile platform appears to be fundamental. However, numerous difculties arise: moving cameras with limited
view eld, cluttered background, illumination variations, hard
real-time constraints, and so on.
The literature offers many tools to go beyond these difculties.
Our paper focuses on particle ltering framework as it easily enables to fuse heterogeneous data from embedded sensors. Despite
their sporadicity, these dedicated person detectors and their hardware counterpart are very discriminant when present.

* Corresponding author. Address: CNRS, LAAS, 7, Avenue du Colonel Roche,


F-31077 Toulouse, France.
E-mail addresses: tgerma@laas.fr (T. Germa), lerasle@laas.fr (F. Lerasle), noua
dah@laas.fr (N. Ouadah), cadenat@laas.fr (V. Cadenat).
1077-3142/$ - see front matter 2010 Elsevier Inc. All rights reserved.
doi:10.1016/j.cviu.2010.01.008

The paper is organized as follows. Section 2 depicts an overview


of the corresponding works done within our robotic context and
introduces our contributions. Section 3 describes our omnidirectional RFID prototype. This sensor is very discriminant when present
in order to detect the user wearing an RFID tag. Section 4 recalls some
PF basics and details our new importance function for multimodal
person tracking. The developed control strategy to achieve a person
following task in a crowded environment is detailed in Section 5,
while Section 6 presents the mobile robot which has been used for
our tests and the obtained results. Finally, Section 7 summarizes
our contributions and discusses future extensions.

2. Overview and related work


Particle lters (PF) [5] through different schemes are currently
investigated for person tracking in both robotics and vision communities. Besides the well-known CONDENSATION scheme, the
fairly seldom exploited ICONDENSATION [26] variant steers sampling towards state space regions of high likelihood by incorporating both the dynamics and the measurements in the importance
function. PF represent the posterior distribution by a set of samples, or particles, with associated importance weights. This
weighted particles set is rst drawn from an importance function
and the state vector initial probability distribution, and is then updated over time taking into account the measurement models.
Some approaches e.g. [34] show that intermittent and discriminant
cues based on person detection and recognition functionalities

642

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

must be considered in the importance function in order to: (i)


automatically re-initialize the tracker on the targeted person when
failures occur and (ii) simplify the data association problem in populated settings [9].
Primary, embedded detectors are generally restricted to stationary robots in order to (only) segment moving people from
the background [40,46]. Some works [10,17,33] consider foreground segmentation based on disparity maps given a stereoscopic
head [32], but they generally require signicant CPU resources.
Other techniques assume that people coarsely face the robot. In
those cases, face detection [6,23,42] can be applied to (re)-initialize successfully the tracker after temporary occlusions, out of camera view eld, or target losses. These multi-view face detectors
have received an increasing interest due to their computational
efciency. Such detectors have been extended to the full or upper
human body detection [2,39,47]. Some complementary approaches combine person detection and recognition [18,46] in order to distinguish the targeted person from others. Nevertheless,
despite many advances, a major problem sensitivity to pose
and illumination still exists and a complete and on-boarded reliable visual-based solution that can be used in general conditions is
not currently available. Clearly, using on-boarded monocular system to sense humans is very challenging compared to static and
deported systems. Thus, face detection and skin color detection
are only available when the person faces towards the robot, and
the robot can hardly follow behind or even walk next to the person. Full or upper human body detectors based on supervised
learning are inappropriate to cover all the persons range (from
0.5 to 5 m) and orientation1 encountered when sensing from a mobile robot. Consequently, recent trends lead to methods based on
multimodal sensor fusion. Their issue is generally to use the video
stream as the primary sensor and other sensor streams as secondary
ones.
Beyond visible spectrum vision, thermal vision allows to overcome some of the aforementioned limitations, since humans have
a distinctive thermal prole with respect to non-living objects.
Moreover, their appearance does not depend anymore on light
conditions Yet, up to now, there are very few published works on
using both thermal and visible cameras on mobile robots to detect/track humans (see a survey on thermal vision [22]). We can
here mention the well-known PF proposed by Cielniak et al. in
[12] which uses thermal vision for human detection and color
images for capturing the appearance. Unfortunately, in crowds,
sensing with thermal cameras leads to an abundance of additional
hot spots. It is then impossible to identify a given person as all humans (and also all living objects. . .) stick out as white regions on a
black background.
Some other multimodal systems devoted to person tracking utilize audio and visual sensors [8,7,33,34]. In crowds, the data association problem can be settled by speaker identication [28,45].
Nevertheless, sensing people with audio cues during the robot or
the customers movement is questionable. Indeed, the variability
generated by the speaker, the recording conditions, the background
noise especially in crowded environments, the inherent intermittence of the voice stream (as humans do not babble all the time)
are the main difculties which have to be overcome. Therefore, the
speaker identication problem appears to be a challenging problem
and still remains an open issue for further research.
Using laser range nders for person tracking is also frequent in
the robotics community. In contrast to cameras, lasers provide
accurate depth information, require little processing and are insensitive to ambient lighting. The classical strategy consists in extracting legs from a 2D laser scan at a xed height. To this end, two

The person can walk towards, away from, or past the robot, side-by-side, etc.

particular types of features are intensively studied: motion


[14,36] and geometry [4,6,29,39,47] features. Many multi-sensor
fusion systems integrate the data provided by a laser range nder
and a perspective [6,14,39] or omnidirectional [29,47] camera.
Anyway, systems involving laser scans suffer from several drawbacks. Leg detection in a 2D scan does not provide robust features
for discriminating the different persons in the robot vicinity, while
the detector fails when one leg occludes the other.
Recent person tracking approaches have focused on indoor positioning systems based on wireless networking indoor infrastructure
and ultrasound, infrared [37], or radio frequency badged humans
clothes [3,11,21,27,35]. Radio frequency (RF) signals are widely
used as they: (i) can penetrate through most of the building material, (ii) have an excellent range in indoor environments, and (iii)
have less interferences with other frequency components. Moreover, RFID tags are preferred to accelerometers for aesthetical
and ergonomical reasons [24,38,43]. Common applications involving RFID technologies [3,11,27,31,37] assume stationary readers
distributed throughout the settings, namely ubiquitous sensors.
Solely Schutz et al. [37] considered the multimodal people tracking
from a network of RF sensors and laser range nders placed
throughout an environment. Our approach privileges on-board
perceptual resources (monocular color vision and RF reader) in order to limit the hardware installation cost and therefore the indoor
setting support. We can here mention the approach proposed in
[21] which considers an on-board RF device for people detection.
However, the detection range was limited to 180 and no multimodal data fusion was done.

Fig. 1. RF multiplexing prototype to address eight antennas.

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

643

Fig. 2. Occurrence frequencies of angle htag given one (a), two (b) or three (c) detections.

RFID sensors enjoy the nice properties to provide explicit information about the person identity, even if the location information
is relatively coarse. Our multimodal person tracker combines the
accuracy and information richness advantages of active color vision with the identication certainty of RFID. This tracker, which
has not been addressed in the literature, is expected to be more
resilient to occlusions than vision-based only systems, since the
former benets from a coarse estimate of people location in
addition to the knowledge of his/her appearance. Furthermore,
the ID-sensor can act as reliable stimuli that triggers the vision
system. Finally, when several people lie in the camera view eld,2
this multimodal sensor data fusion will help in distinguishing the
targeted person from the others.
The contributions of the paper is threefold. The rst contribution
of this paper is the customization of an off-the-shelf RFID system to
make it able to detect tags in 360 view eld, by multiplexing eight
antennas. We have embedded this system on our mobile robot Rackham to detect passive RFID-tagged persons. This omnidirectional IDsensor, unaffected by lighting conditions or humans appearance,
appears as an ideal complement to trigger a PTU-mounted perspective camera. The second contribution concerns particle sampling
within the ICONDENSATION scheme. We propose a genuine importance function based on probabilistic saliency maps issued from visual and RF person detector and ID as well as a rejection sampling
mechanism to (re)-positions samples on the desired person during
tracking. This particle sampling strategy, which is unique in the literature, should improve our multi-sensor based tracker so that it becomes much more resilient to occlusions, data association, and
target losses than vision-based only systems. The last contribution
concerns a multi-sensor-based control making the mobile robot reliably follow in real time a person in a more difcult setting than other
previous works [10,19,29].

3. Person detection and identication based on RFID


3.1. Device description
The device consists of: (i) A CAENRFID3 A941 multiprotocol offthe-shelf reader which works at 870 MHz, with a programmable
emitting RF power from 100 to 1200 mW. (ii) Eight directive antennas to detect the passive tags worn on the customers clothes. (iii) A
prototype circuit in order to sequentially initialize each antenna
(Fig. 1). With a single antenna, only a tag angle relative to the antenna plane can be estimated. With our eight antennas, the tag can be
2
3

In this case, there are multiple observations in the image plane.


See http://www.caen.it/rd/.

detected all around the robot at any distance between 0.5 m (i.e.
approximately the robots radius) and 5 m. Given the placement of
the antennas and their own eld of view, the robot neighborhood
is divided into 24 areas (Fig. 3a), depending on the number of antennas simultaneously detecting the RFID tag.
To determine the observation model of the whole antenna set,
statistics are performed by counting frequencies depending on
the number of antennas (three at a maximum, Fig. 3a) that detect
the tag. The resulting normalized histograms are shown in Fig. 2
where the x-axis represents the azimuthal angle htag . Similar histograms can be observed for the distance dtag .4 The resulting sensor
model makes the simplifying assumption that both azimuth and distance histograms can be approximated by Gaussians respectively dened by lhtag ; rhtag and ldtag ; rdtag , where l and r are the mean
and standard deviation. Afterwards, we project these probabilities
for the current tag position to a saliency map of the oor. The size
of the saliency map is 300  300 pixels; thus the area of each pixel
represents 7 cm2 . Each pixel probability is calculated given the 8-antenna set outputs to approximate the RFID tag position (Fig. 3). The
three rightmost plots in Fig. 3 respectively shows the saliency maps
for the detection by one, two or three antennas. Given this observation model, evaluations allow to characterize the ID-sensor
performances.
3.2. Evaluations from feasibility study
The RF system has been mounted on our mobile robot Rackham
(Section 6) and evaluated in the presence of people. We have proceeded in the following way. We have generated statistics by
counting frequencies on a 81 m2 area around the robot. Obstacles
have been added one by one during the test runs. Their positions
have been randomly chosen and uniformly distributed in this area.
The corresponding ground-truth is based on the ratio between the
occluding zones induced by obstacles (assuming an average person-width of 40 cm) and the total area.
Given such various crowdedness situations, the RFID tag has
been moved around the robot assuming no self-occlusion by the
person wearing the tag during this evaluation. We have repeated
this sequence for different distances and we have counted for every
point in a discrete grid whether the tag worn by a xed person is
detected or not, depending on the crowdedness. Comparisons between experimental and theoretical detection rates are shown in
Fig. 4 (see the box-and-whisker plots).
The x-axis and y-axis respectively denote the number of occluding persons (that is crowdedness) and the detection rate. The
box plots and the thick stretches inside indicate the degree of
4

They are not presented here to save space, but they are available on request.

644

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

Fig. 3. Azimuthal view eld of eight antennas (a) and saliency map for tag detection respectively for 1 (b), 2 (c) and 3 (d) antennas.

Algorithm 1 Generic particle ltering algorithm (SIR)


hn
oiN
i
i
Require: xk1 ; wk1
; zk
i1

1: if k 0 then
1

Draw x0 ; . . . ; x0 ; . . . ; x0

2:

i.i.d. according to px0 , and set

w0 N1

3: end if
4: if k P 1 then [

hn
oiN
i
i
xk1 ; wk1

i1

being a particle description of

pxk1 jz1:k1 ]
5: for i 1; . . . ; 10 do
i

6:
i
xk

7:
i

Fig. 4. Detection rate versus crowdedness in the robot surrounding.

wk

dispersion (for 50% of the trials) and the median of the trials. Our
experimental curves are shown to be rather close to the theoretical
ones. As the system is disturbed by the occlusions, the number of
false-negative readings logically increases with the number of
obstacles. Nevertheless, the detection rate remains satisfactory,
even for overcrowded scenes (e.g. 70% in average for seven persons
standing around the robot). Furthermore, very few false-positive
readings (reections, detections with the wrong antennas. . .) are
observed in practice.5
4. Person tracking using vision and RFID

Propagate the particle xk1 by independently sampling




i
 q xk jxk1 ; zk
i

Update the weight wk associated to xk according to


 i   i i 
i p zk jxk p xk jxk1
 i i  ,
/ wk1
q xk jxk1 ;zk

P i
8:
Prior to a normalization step so that i wk 1
9: end for
10: Compute the conditional mean of any function of xk , e.g. the
MMSE estimate Epxk jz1:k xk , from the approximation


PN
i
i
of the posterior pxk jz1:k
i1 wk d xk  xk
11:

At any time or depending on an efciency criterion,


hn
oiN
i
i
resample the description xk ; wk
of pxk jz1:k into the
i1
hn i oiN
s
, by
equivalent evenly weighted particles set xk ; N1
i1

sampling in f1; . . . ; Ng the indexes s1 ; . . . ; sN according to


j

si

Psi j wk ; set xk and wk to xk


12: end if

and

1
N

4.1. Basics on particle lters and data fusion


Particle lters (PF) aim at recursively approximating the posterior probability density function (pdf) pxk jz1:k of the state vector xk
at time k conditioned on the set of measurements z1:k z1 ; . . . ; zk . A
linear point-mass combination

pxk jz1:k 

N
X
i1



i
i
wk d xk  xk ;

N
X

wk 1;

i1

is determined where d is the Dirac distribution. It expresses the


i
selection of a value or particle xk with probability or
i
weight wk ; i 1; . . . ; N. An approximation of the conditional
expectation of any function of xk , such as the MMSE6 estimate
Epxk jz1:k xk , then follows.
5
6

Passive tags induce few signal reections contrary to their active counterparts.
For Medium Mean Square Estimate.

The Sampling Importance Resampling (SIR), shown in Algorithm 1, is fully described by the prior px0 , the dynamics pdf
pxk jxk1 and the observation pdf pzk jxk . After initialization of
independent identically distributed (i.i.d.) sequence drawn from
px0 , the particles stochastically evolve, being sampled from an


i
importance function q xk jxk1 ; zk . They are then suitably
weighted to guarantee the consistency of the approximation (1).
i

To this end, step 7 assigns each particle xk a weight wk involving




i
its likelihood p zk jxk w.r.t. the measurement zk as well as the vali

ues of the dynamics pdf and importance function at xk . In order to


limit the well known degeneracy phenomenon [5], step 11 inserts
a resampling stage introduced by Gordon et al. [20] so that the particles associated with high weights are duplicated while the others
s1

collapse and the resulting sequence xk


ing to (1).

sN

; . . . ; xk

is i.i.d. accord-

645

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

The CONDENSATION for Conditional Density Propagation


[25] is the instance of the SIR algorithm such that the particles
are drawn according to the system dynamics, viz. when




i
i
q xk jxk1 ; zk p xk jxk1 . Then, in visual tracking, the original
algorithm [25] denes the particles likelihoods from contour primitives, but other visual cues have also been exploited [34]. On this
point, resampling may lead to a loss of diversity in the state space
exploration. The importance function must thus be carefully dei

ned. As CONDENSATION draws the particles xk from the system


dynamics but blindly w.r.t. the measurement zk , many of them


i
may be assigned a low likelihood p zk jxk and thus a low weight
in step 7, which signicantly worsen the overall lter performance.
An alternative, henceforth labeled Measurement-based SIR
(MSIR), merely consists in sampling the particles or just some
of their entries at time k according to an importance function
pxk jzk dened from the current image. The rst MSIR strategy
was ICONDENSATION [26], which guided the state space exploration by a color blob detector. Other visual detection functionalities
can be used as well, e.g. face detection/recognition (see below), or
any other intermittent primitive which, despite its sporadicity, is
very discriminant when present [34]. Thus, the classical importance function p based on a single detector can be extended to
consider the outputs from L detection modules, i.e.

L
 1
p xi
k zk ; . . . ; zk

L
X

l
jl p xi
with
k jzk ;

jl 1:

l1
i

In an MSIR scheme, if a particle xk drawn exclusively from the


i
image (namely p) is inconsistent with its predecessor xk1 from
the point of view of the state dynamics, the update formula leads
i
to a small weight wk . One solution to this problem, as proposed in
the genuine ICONDENSATION algorithm, consists in also sampling
some particles from the dynamics and some w.r.t. the prior so that:

 





i  i
i
i
q xk xk1 ; zk ap xk jzk bp xk jxk1 1  a  bp0 xk ;

4.3. Importance function based on visual and RF cues








Given Eq. (2), three functions p xk jzck ; p xk jzsk and p xk jzrk ,
respectively based on skin probability image, face detector and
RF identication are considered.


The importance function p xk jzck at location xk u; v is described by pxjzc hcz x, where cz x is the color of the pixel
located in x in the input image zc . h is the 3D normalized histogram
used for backprojection [30] indexed by R, G, B channels and represents the color distribution of the skin which is a priori learnt.


The function p xk jzsk relies on a probabilistic image based on
the well-known face detector pioneered by Viola et al. [41], and
improved in [42,44], which covers a range of 45 out-of plane
B
and
rotation. Let N B be the number of detected faces fFj gNj1
pi ui ; v i ; i 1; . . . ; N B the centroid coordinate of each such region. The face recognition technique, detailed in [18], involves
two steps during the learning stage. The rst one is composed of
PCA-based computation and multi-class SVM7 learning, while the
second one uses a genetic algorithm free-parameters optimization
based on NSGA-II. Finally, our on-line decision rule denes a posterior probability PC t jF; z of labeling face Fj to C t so that:

8
t
< 8t PC t jF; z 0 and PC ; jF; z 1 when 8tL < s;
t
PL
: 8t PC t jF; z Lp and PC ; jF; z 0 otherwise;
p

3
with a; b 2 0; 1. Besides the importance function, the measurement
function involves visual cues which must be persistent but are more
prone to ambiguity for cluttered scenes. An alternative is to consider multi-cue fusion in the weighting stage. Given L measurement


sources z1k ; . . . ; zLk and assuming the latter are mutually independent conditioned on the state, the unied measurement function
can then be factorized as follows:

  Y
L

  
 i
 i
p z1k ; . . . ; zLk xk /
p zlk xk :

hood in the weighting stage is more or less conventional. It is computed thanks to (4) by means of several measurement functions,
according to persistent visual cues, namely: (i) edges to model
the silhouette [25] and (ii) multiple color distributions to represent
the persons appearance (both head and torso) [34]. Despite its
simplicity, our measurement function is inexpensive while still
providing some level of person discrimination from the clothes
appearance. Otherwise, our importance function is unique in the
literature and is detailed below.

l1

4.2. Tracking implementation

where C ; refers to the void class, s is one of the free-parameters of


the system and C t refers to the face basis of the RFID-tagged person.
The function p at location x u; v is deduced using a weighted
Gaussian mixture proposal.8 Its expression is given hereafter:

pxjzs /

NB
X




PCjFj ; z  N x; pj ; diag r2uj ; r2v j ;

j1

where PCjFj ; z is the face ID probabilities for each detected face


Fj given beforehand learnt tracked person face. Given the RF outi
puts, the function pxk jzrk expresses as follows:

r
N hxi ; lhtag ; rhtag ;
p xi
k jz
k

The aim is to t the template relative to the RFID-tagged person


all along the video stream through the estimation of his/her image
coordinates u; v and its scale factor s of his/her head. All these
parameters are accounted for in the above state vector xk related
to the kth frame. With regard to the dynamics pxk jxk1 , the image
motions of humans are difcult to characterize over time. This weak
knowledge is modeled by dening the state vector as xk uk ; v k ; sk 0
and assuming that its entries evolve according to mutually independent random walk models, viz. pxk jxk1 Nxk ; xk1 ; R, where
N:; l; R is a Gaussian distribution with mean l and covariance


R diag r2u ; r2v ; r2s .
In both importance sampling and weight update steps, fusing
multiple cues enables the tracker to better benet from distinct
information, and decrease its sensitivity to temporary failures in
some of the measurement processes. The underlying unied likeli-

where hxi is the azimuthal position of the particle xk in the robot


k
frame, deduced from its horizontal position on the image and the
camera pan angle. lhtag and rhtag , described in Section 3, are respectively the mean and the covariance of the estimated position of the
sole targeted tag associated to the user in the robot frame depending on the antenna outputs.
The particle sampling is done using the importance function q
given in Eq. (3) and requires a process of rejection sampling. This
process constitutes an alternative when q is not analytically
modeled. The principle is described in Algorithm 2 with g an
instrumental distribution to make the sampling easier under the
restriction that q < Mg, where M > 1 is an appropriate bound
.
on q
g
7
8

For Support Vector Machine.


Indexes k and i are omitted for the sake of clarity and space.

646

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651







Fig. 5. (a) Original image, (b) skin probability image p xk jzck , (c) face detection p xk jzsk , (d) azimuthal angle from RFID detection p xk jzrk , (e) unied importance function (3)
(without dynamic), (f) accepted particles (yellow dots) after rejection sampling. (For interpretation of the references to color in this gure legend, the reader is referred to the
web version of this article.)

Algorithm 2. Rejection sampling algorithm


i

1: draw xk according to Mgxk


2: r

qxk jx
k1 ;zk
i

Mg xk

3: draw u according to U0;1


4: if u 6 r then
i

5: accept xk
6: else
7: reject it
8: end if

Fig. 5 shows an illustration of the rejection sampling algorithm


for a given image. Our importance function (3) combined with
rejection sampling ensures that the particles will be placed in the
relevant areas of the state space i.e. concentrated on the tracked
person or potential candidate areas.
5. A sensor-based control law for person following task
Now, we address the problem of making the robot follow the
tagged person. To this aim, we use the data provided by both the
tracker and the RFID system. We rst briey present the considered robotic system and the chosen control strategy, before detailing the different designed control laws.
5.1. Modeling the problem: the robot and the control strategy
Our robot Rackham depicted in Section 6 consists of a nonholonomic mobile base equipped with a RFID system and with a camera mounted on a pan/tilt unit (PTU). Four control inputs can then
be used to act on our robot: the linear and angular mobile base
velocities v r ; xr and the pan/tilt unit angular velocities xp ; xt .
Our goal is to compute these four velocities so that the robot can
efciently and safely achieve the person following. Different control strategies are available in the literature. In our case where
the camera and the RFID tag are used to detect and track the user,
it appears rather natural to consider visual servoing techniques
[15,13]. These techniques allow to control a large panel of robotic
systems using image data provided by one (or several) cameras.
However, although they are applied in a wide range of applications,
the literature reports only few works which address the problem of
human-based multi-sensor servoing.
Here, we focus on this problem and our idea is to use both RFID
and tracker data to build our control laws. We have chosen to separately design the necessary controllers to discouple at best the
different degrees of freedom of the camera. Analyzing the robot
structure shows that the two control inputs xr and xp appear to
have the same effect on the features movement in the image.
Although this property can be used to perform additional objectives, such as obstacle detection and avoidance, here, we have
simply chosen to x xp to zero, so that we control the features horizontal position in the image using a unique controller. Moreover,

using xr instead of xp allows to orientate the whole robotic system (and not only the camera) towards the targeted person,
improving the task execution.
5.2. Control design
Thus, we aim at designing three controllers v r ; xr ; xt to orientate the camera and to move the robot, so that the person to be followed is always kept in its line of sight and at a suitable social
distance from the vehicle. To this aim, we use the data provided
by the tracker, namely the coordinates of the head gravity center
in the image ugc ; v gc and its associated scale sgc which coarsely
characterizes the H/R distance. From these data, we dene the error function Eptv to be decreased to zero:

Eptv Eu Ev Es T ugc  ui

v gc  v i

sgc  si T ;

where ui and v i represent the image center coordinates, si the predened scale corresponding to an acceptable social distance denoted by dfollow . Eu represents the abscissa error in the image and
can then be regulated to zero by acting on the robot angular speed
xr . Ev corresponds to the ordinate error in the image and can be decreased to zero thanks to the PTU tilt velocity xt and Es is the scale
error regulated to zero using the robot linear velocity v r . We design
three PID controllers as follows:

8
R
dEu
>
>
< xr K pp Eu K ip R Eu dt K dp dt ;
xt K pt Ev K it Ev dt K dt dEdtv ;
>
>
: v K E K R E dt K dEs ;
r

pv s

iv

dv dt

The control gains K pp ; K ip ; K dp (respectively K pt ; K it ; K dt and


K pv ; K iv ; K dv ) are experimentally tuned to insure the system
stability.
However, these control laws can be used only when the target lies
in the image. When the latter is lost, they cannot be applied anymore
and we use the RFID information, namely the distance dtag and the
orientation htag to control the robot. The idea is then to make the
camera turn until the robot faces the tag, so that the tracker can retrieve the tagged person in the camera view eld. The corresponding
robot behavior is shown in Fig. 6. To this aim, we simply impose a
constant value x0r for the robot angular velocity xr . The PTU speed
xt is controlled so that the corresponding angle is brought back to
its reference position, that is the position reached after each initialization of the PTU. We nally impose a linear velocity whose value
depends on dtag and on htag . In this way, we try9 to keep on satisfying
the constraint on the social distance dfollow , despite the visual information loss. The robot is then kept in the closest possible users neighborhood to ease the visual signal recovery. When the person is detected
anew, the control strategy switches back to the three vision-based
controllers given above. Note that the control law smoothness is preserved when the switch between the vision-based controller and the
RFID-based controller occurs. Indeed, the linear velocity is progressively modied to reach the new desired value. As for xr and xt , the
9

The distance dtag provided by the RFID system is rather inaccurate.

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

647

Fig. 6. RFID based robot behavior.

Fig. 8. A typical run without human obstacles: the solid red/dash blue curves
represent robot/tagged person positions, while the thin purple lines show the
direction of the camera. (For interpretation of the references to color in this gure
legend, the reader is referred to the web version of this article.)

6.2. Experiments and discussion

Fig. 7. Rackham.

continuity is not explicitly handled because the robot and the PTU are
sufciently rapid systems.
6. Integration and live experiments
6.1. Rackham description and software architecture
Rackham is an iRobot B21r mobile platform. Its standard equipment has been extended with one digital camera mounted on a Directed Perception pan-tilt unit, one ELO touch-screen, a pair of
loudspeakers, an optical ber gyroscope, wireless Ethernet and
the RFID system previously described in Section 3 (Fig. 7). All these
devices enable Rackham to act as a service robot in utilitarian public areas. It embeds robust Human Robot interaction abilities and
efcient basic navigation skills.
We have developed three software modules, namely ICU which
stands for I see you, RFID and Visuserv, which respectively
encapsulate human recognition/tracking, RFID localization, and visual servoing. These modules have been implemented within the
LAAS architecture [1] using C/C++ interfacing scheme. The
OpenCV library10 is used for low-level features extraction e.g. edge
or face detection. The entire system operates at an average framerate
of 6 Hz.
10

see ttp://sourceforge.net/projects/opencvlibrary/

Experiments were conducted in our crowded robotic hall


4  5 m2 . The goal is to control Rackham to follow a tagged person while respecting his/her personal space. The task requirements
are: (i) the user must be always centered in the image and (ii) a
distance dfollow 2 m must be maintained between the tagged person and the robot. Up to now, the robot abilities of obstacle avoidance are rather coarse. Thus, we have chosen to set the robot
maximum driving and turning speeds at reduced values (respectively, 0.4 m/s and 0.6 rad/s) compatible with most targeted users
velocity.11 Numerous series of 10 runs have been carried out. The
scenario is as follows: a non-expert person enters the hall, picks
up a RFID tag on Rackham, then moves in the hall without paying
attention to the robot. The system performances are evaluated on
the whole experiments set and measured by:

The visual contact rate (VCR) dened by the ratio of the frames
where the targeted person was in the eld of view over the total
number of frames. This indicator indirectly measures the tracker
robustness to artifacts such as occlusions, sporadic target losses
due to the crowds.

The following error (FE) dened by Efollow r  dfollow , where r is


the current range to the tagged person. This error measures the
robot capability to follow the target at the desired distance
dfollow .
Fig. 8 shows a typical run where the user is alone with the robot. During this 4-step run, the sole vision or its multimodal counterpart will be considered. In such nominal conditions, Fig. 9 shows

11

If the user outpaces the robot and is lost, the vehicle is stopped for the safety sake.

648

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

Fig. 9. Snapshots of a trial. Notice that the pan-tilt unit azimuthal position is given by the red arch on the RFID map. The blue and green squares respectively depict the face
detection (person gazing the camera) and the MMSE estimate while the yellow dots represent the particles before the resampling step. (For interpretation of the references to
color in this gure legend, the reader is referred to the web version of this article.)

(a)

(b)

(c)

(d)

(e)

(f)

ICU

2
1
0
1

10

20

30

40

50

60

70

10

20

30

40

50

60

70

10

20

30

40

50

60

70

10

20

30

40

50

60

70

10

20

30

40

50

60

70

10

20

30

40

50

60

70

10

20

30

40

50

60

70

RFID

2
1
0
1

tag

50
0
50

dtag

400
200

0
0.4
0.2
0
0.2
0.4

0.6
0.4
0.2
0
0.2

0.2

0
0.2
0.4

Fig. 10. Synchronization of the data ow outputs between the different modules.

snapshots of the video streams with superimposed tracking outputs as well as the current RFID saliency map.

Fig. 10 shows the following signals issued from the modules


ICU, RFID, and Visuserv, namely: (i) the two ags ICU and RFID

649

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

Fig. 11. A typical run with human obstacles: the solid red/dash blue curves
represent robot/tagged person positions, while the thin purple lines show the
direction of the camera. The black arrow represent the passers-by paths. (For
interpretation of the references to color in this gure legend, the reader is referred
to the web version of this article.)

which are respectively set to 1 when the tagged user is detected


either in the image or in the RFID area; (ii) the angle htag and the
distance dtag ; and (iii) the three control inputs v r ; xr ; xt computed by the Visuserv module and sent to the robot.
After the mission initialization (a), the tracker focuses on the
tagged person thanks to the video stream (b). The four steps of
the scenario are then executed. Between points #1 and #2, the contact with the target is maintained thanks to the vision system. The
control law is then computed using the visual data provided by the
tracker. The target is centered in the image while the robot adjusts
its relative distance thanks to the visual template scale. Between
points #2 and #4 (c), the target disappears from the camera view
eld, which induces a visual tracker failure. The control law is then
computed using the RFID data dtag ; htag to make the robot face the

targeted person, and converge towards him/her until dtag reaches a


close neighborhood of dfollow . The person following task is then still
executed despite the visual target loss, while the RFID system triggers the camera in order to recover the target in the view eld (d).
Face detection/recognition allows to re-initialize the visual tracker
(e) while the person goes back to #1 (f).
During this path, the robot trajectory crosses the targets one. As
expected, to preserve the social distance, it moves backward. In
nominal conditions, the task is successfully performed. The average
following error in this set of runs was 0.08 m without real impact
of the RFID system.
In a second step, we have progressively added the number of
people in the robot vicinity to disturb the person following task
by sporadically occluding the tagged person. Fig. 11 shows a run
including typical situations that may happen during tracking.
Other people cross the path between Rackham and the tagged person, walk together with him and then cross the path again, even
walk right behind Rackham for a long time making the robot stop
and the tracker re-initialize.
Fig. 12 shows snapshots of this run while the entire video is
available at the URL http://www.laas.fr/~tgerma/CVIU. In almost
all cases, the multimodal tracker was able to cope with such adverse conditions while the vision only system fails to reset on
the correct person. The visual system succeeded in 12% of the missions while more than 85% of them were successfully performed
using the multimodal counterpart. Table 1 shows the associated visual contact rates when increasing the number of passers-by.
The average Visual Contact Rate remains almost constant for
both systems but these results highlight the multimodal tracker
efciency. In fact, the RFID system allows to keep the target in
the visual view eld for more than l 85% of the duration of
the video stream despite the presence of more than four passersby. The high value of the standard-deviation (noted r) is mainly

Fig. 12. Snapshots of a run in crowds. The rst line shows the current human robot situation.

Table 1
Visual contact rate when considering 14 passers-by.
Sensor system

Number of passers-by
1

Vision only
Vision + RFID

Total
2

4 and more

0.21
0.94

0.11
0.08

0.22
0.85

0.02
0.14

0.18
0.94

0.05
0.13

0.22
0.83

0.06
0.19

0.21
0.86

0.04
0.14

650

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651

due to the random motions of the passers-by as they were asked to


freely walk around the robot. During these missions the average
following error was Efollow 0:10 m.
7. Conclusion
Tracking provides important capabilities for human robot interaction and assistance of humans in utilitarian populated spaces.
The paper exhibits three contributions. A rst contribution concerns the customization of an off-the-shelf RFID system to detect
tags within a 360 view eld and the coarse distance estimation
thanks to the multiplexing of eight antennas. A second contribution concerns the development of a multimodal person tracker
which combines the accuracy advantages of monocular active
vision with the identication certainty of such a RFID-based sensor. Our technique uses the ICONDENSATION scheme, heterogeneous data driven proposals and rejection sampling mechanism
to (re)-concentrate the particles on the right person during the
sampling step. To our best knowledge, such multimodal data fusion within PF is unique in the robotics or vision literature. A third
contribution concerns the embeddability of this multimodal person tracker on a mobile robot and its coupling to the robot control
to follow a tagged person in crowds. The live experiments have
demonstrated that the person following task inherits the advantages of both sensor types, thereby being able to robustly track people and identify them.
Several directions are currently studied regarding the whole
system. First, we will design more compact antennas as embeddability is essential for autonomous robots. Second, the obstacle
detection and avoidance problem will be addressed through an enhanced multi-sensor-based control strategy. Further investigations
will also concern the algorithm extension to multiple persons as
several RFID tags can be detected at the same time by the reader.
The robot will then be able to interact with multiple humans
simultaneously, to track and avoid passers-by, and even to interpret humanhuman interaction in its vicinity.
Acknowledgments
The authors are very grateful to Lo Bernard and Antoine Roguez for their involvements in this work which was partially conducted within the EU STREP Project Commrob funded by the
European Commission Division FP6 under Contract FP6  045441.
References
[1] R. Alami, R. Chatila, S. Fleury, F. Ingrand, An architecture for autonomy,
International Journal of Robotic Research (IJRR98) 17 (4) (1998) 315337.
[2] M. Andriluka, S. Roth, B. Schiele, People-tracking by detection and people
detection by tracking, in: International Conference on Computer Vision and
Pattern Recognition (CVPR08), Anchorage, USA, June 2008.
[3] M. Anne, J. Crowley, V. Devin, G. Privat, Localisation intra-btiment multitechnologies: RFID, wi et vision, in: National Conference on Mobility and
Ubiquity Computing (UbiMob05), Grenoble, France, June 2005, pp. 2935.
[4] K. Arras, O. Mozos, W. Burgard, Using boosted features for detection of people
in 2D range scans, in: International Conference on Robotics and Automation
(ICRA07), Roma, Italy, April 2007.
[5] S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle lters for
on-line non-linear/non-gaussian bayesian tracking, Transaction on Signal
Processing 50 (2) (2002) 174188.
[6] N. Bellotto, H. Hu, Vision and laser data fusion for tracking people with a
mobile robot, in: International Conference on Robotics and Biomimetics
(ICRB06), Kunming, China, December 2006.
[7] H.J. Bohme, T. Wilhelm, J. Key, C. Schauer, C. Schroter, H.M. Gross, T. Hempel,
An approach to multi-modal humanmachine interaction for intelligent
service robot, Robotics and Autonomous Systems 44 (1) (2003) 8396.
[8] M. Bregonzio, M. Taj, A. Cavallaro, Multimodal particle ltering tracking using
appearance, motion and audio likelihoods, in: International Conference on
Image Processing (ICIP07), San Antonio, USA, October 2007.
[9] L. Brthes, F. Lerasle, P. Dans, M. Fontmarty, Particle ltering strategies for
data fusion dedicated to visual tracking from a mobile robot, Machine Vision
and Applications (MVA08) (2008), doi:10.1007/s00138-008-0174-7.

[10] D. Calisi, L. Iocchi, R. Leone, Person following through appearance models and
stereo vision using a mobile robot, in: International Conference on Computer
Vision Theory and Applications (VISAPP07), Barcelona, Spain, March 2007.
[11] B. Castano, M. Rodriguez, An articial intelligence and RFID system for people
detection and orientation in big surfaces, in: International Multi-Conference
on Engineering and Technological Innovation (IMETI08), Orlando, USA, June
2008.
[12] G. Cielniak, A. Lilienthal, T. Duckett, Improved data association and occlusion
handling for vision-based people tracking by mobile robots, in: International
Conference on Intelligent Robots and Systems (IROS07), San Diego, USA,
November 2007.
[13] P.I. Corke, Visual Control of Robots: High Performance Visual Servoing,
Research Studies Press Ltd., 1996.
[14] J. Cui, H. Zha, H. Zhao, R. Shibasaki, Multimodal tracking of people using laser
scanners and video camera, Image and Vision Computing (IVC08) 26 (2)
(2008) 240252.
[15] B. Espiau, F. Chaumette, P. Rives, A new approach to visual servoing in robotics,
IEEE Transactions on Robotics and Automation 8 (3) (1992) 313326.
[16] T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots,
Robotics and Autonomous Systems (RAS03) 42 (2003) 143166.
[17] D.M. Gavrila, Multi-cue pedestrian detection and tracking from a moving
vehicle, International Journal of Computer Vision (IJCV07) 73 (1) (2008) 41
59.
[18] T. Germa, F. Lerasle, T. Simon, Video-based face recognition and tracking from
a robot companion, International Journal of Pattern Recognition and Articial
Intelligence (IJPRAI09) 23 (March) (2009) 591616.
[19] R. Gockley, J. Forlizzi, R. Simmons, Natural person-following behavior for social
robots, in: International Conference on Human Robot Interaction (HRI07),
Washington, USA, March 2007, pp. 1724.
[20] N.J. Gordon, D.J. Salmond, A.F.M. Smith, Novel approach to nonlinear/nongaussian bayesian state estimation, Radar and Signal Processing IEE
Proceedings F 140 (2) (1993) 107113.
[21] D. Hahnel, W. Burgard, D. Fox, K. Fishkin, M. Philipose, Mapping and
localization with RFID technology, in: International Conference on Robotics
and Automation (ICRA04), April 2004, pp. 10151020.
[22] R. Hammoud, J. Davis, Advances in vision algorithms and systems beyond the
visible spectrum, Computer Vision and Image Understanding (CVIU07) 106 (2)
(2007) 145147.
[23] C. Huang, H. Ai, Y. Li, S. Lao, High-performance rotation invariant multi-view
face detection, Transaction on Pattern Analysis Machine Intelligence (PAMI07)
29 (4) (2007) 671686.
[24] T. Ikeda, H. Ishiguro, T. Nishimura, People tracking by cross modal association of
vision and acceleration sensors, in: International Conference on Intelligent
Robots and Systems (IROS07), San Diego, USA, November 2007, pp. 41474151.
[25] M. Isard, A. Blake, CONDENSATION conditional density propagation for visual
tracking, International Journal on Computer Vision 29 (1) (1998) 528.
[26] M. Isard, A. Blake, I-CONDENSATION: unifying low-level and high-level
tracking in a stochastic framework, in: European Conference on Computer
Vision (ECCV98), Freibourg, Germany, June 1998, pp. 893908.
[27] T. Kanda, M. Shiomi, L. Perrin, T. Nomura, H. Ishiguro, N. Hagita, Analysis of
people trajectories with ubiquitous sensors in a science museum, in:
International Conference on Robotics and Automation (ICRA07), Roma, Italy,
April 2007, pp. 48464853.
[28] B. Kar, S. Bhatia, P. Dutta, Audio-visual biometric based speaker identication,
in: International Conference on Computational Intelligence and Multimedia
Applications (ICCIMA07), Sivakasi, India, December 2007, pp. 9498.
[29] M. Kobilarov, G. Sukhatme, J. Hyams, P. Batavia, People tracking and following
with mobile robot using an omnidirectional camera and laser, in: International
Conference on Robotics and Automation (ICRA06), Orlando, USA, May 2006,
pp. 557562.
[30] J. Lee, W. Lee, D. Jeong, Object tracking method using back-projection of
multiple color histogram models, in: International Symposium on Circuits and
Systems (ISCAS03), June 2003.
[31] T. Mori, Y. Suemasu, H. Noguchi, T. Sato, Multiple people tracking by
integrating distributed oor pressure sensors and RFID system, in:
International Conference on Systems, Man and Cybernetics, The Hague,
Netherlands, October 2004, pp. 52715278.
[32] Rafael Muoz Salinas, Miguel Garca-Silvente, Rafael Medina-Carnicer,
Adaptive multi-modal stereo people tracking without background modelling,
Journal of Visual of Communication and Image Representation 19 (2) (2008)
7591.
[33] K. Nickel, T. Gehrig, H. Ekenel, R. Stiefelhagen, J. McDonough, A joint particle
lter for audiovisual speaker tracking, in: International Conference on
Multimodal Interfaces (ICMI05), Torento, Italy, 2005, pp. 6168.
[34] P. Prez, J. Vermaak, A. Blake, Data fusion for visual tracking with particles,
IEEE 92 (3) (2004) 495513.
[35] S.S. Takahashi, J. Wong, M. Miyamae, A ZigBee-based sensor node for tracking
peoples locations, in: ACM International Conference, Sydney, Australia, May
2008, pp. 3438.
[36] D. Schulz, W. Burgard, D. Fox, A. Cremers, Tracking multiple moving targets
with a mobile robot using particle lters and statistical data association, in:
International Conference on Robotics and Automation (ICRA01), Seoul, Korea,
May 2001.
[37] D. Schulz, D. Fox, J. Hightower, People tracking with anonymous and IDsensors using rao-blackwellised particle lters, in: International Joint
Conference on Articial Intelligence (IJCAI03), Acapulco, Mexico, August 2003.

T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
[38] J. Smith, K. Fishkin, B. Jiang, A. Mamishev, RFID-based techniques for
human-activity detection, Communications of the ACM 48 (9) (2005) 39
44.
[39] L. Spinello, R. Triebel, R. Siegwart, Multimodal detection and tracking of
pedestrians in urban environments with explicit ground plane extraction, in:
AAAI Conference on Articial Intelligence (AAAI08), Chicago, USA, July 2008,
pp. 14091414.
[40] Y. Tsai, H. Shih, C. Huang, Multiple human objects tracking in crowded scenes,
in: International Conference on Pattern Recognition (ICPR06), Hong Kong,
August 2006, pp. 5154.
[41] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple
features, in: International Conference on Computer Vision and Pattern
Recognition (CVPR01), 2001.
[42] P. Viola, M. Jones, Fast multi-view face detection, in: International Conference
on Computer Vision and Pattern Recognition (CVPR03), Madison, USA, June
2003.

651

[43] H. Wang, H. Lenz, A. Szabo, J. Bamberger, U. Hanebeck, WLAN-based


pedestrian tracking using particle lters and low-cost MEMS sensors, in:
Workshop on Positioning, Navigation and Communication (WPNC07),
Hannover, Germany, March 2007.
[44] Yan Wang, Yanghua Liu, Linmi Tao, Guangyou Xu, Real-time multi-view face
detection and pose estimation in video stream, Pattern Recognition, 2006, ICPR
2006, in: 18th International Conference, vol. 4, pp. 354357.
[45] L. Ying, S. Narayanan, C. Kuo, Adaptative speaker identication with
audiovisual cues for movie content analysis, Pattern Recognition Letters 25
(7) (2004) 776791.
[46] W. Zajdel, Z. Zivkovic, B. Krse, Keeping track of humans: have I seen this
person before? in: International Conference on Robotics and Automation
(ICRA05), Barcelona, Spain, April 2005, pp. 20932098.
[47] Z. Zivkovic, B. Krse, Part based people detection using 2D range data and
images, in: International Conference on Robotics and Automation (ICRA07),
Roma, Italy, April 2007.

Das könnte Ihnen auch gefallen