Beruflich Dokumente
Kultur Dokumente
Vision and RFID data fusion for tracking people in crowds by a mobile robot
T. Germa a,b,*, F. Lerasle a,b, N. Ouadah a,c, V. Cadenat a,b
a
a r t i c l e
i n f o
Article history:
Received 13 January 2009
Accepted 5 January 2010
Available online 22 January 2010
Keywords:
Radio frequency ID
Multimodal data fusion
Particle ltering
Person tracking
Person following
Multi-sensor fusion
Human visual servoing
a b s t r a c t
In this paper, we address the problem of realizing a human following task in a crowded environment. We
consider an active perception system, consisting of a camera mounted on a pan-tilt unit and a 360 RFID
detection system, both embedded on a mobile robot. To perform such a task, it is necessary to efciently
track humans in crowds. In a rst step, we have dealt with this problem using the particle ltering framework because it enables the fusion of heterogeneous data, which improves the tracking robustness. In a
second step, we have considered the problem of controlling the robot motion to make the robot follow
the person of interest. To this aim, we have designed a multi-sensor-based control strategy based on
the tracker outputs and on the RFID data. Finally, we have implemented the tracker and the control strategy on our robot. The obtained experimental results highlight the relevance of the developed perceptual
functions. Possible extensions of this work are discussed at the end of the article.
2010 Elsevier Inc. All rights reserved.
1. Introduction
Giving a mobile robot the ability of automatically following a
person appears to be a key issue to make it efciently interact with
humans. Numerous applications would benet from such a capability. Service robotics is obviously one of these applications, as it
requires interactive robots [16] able to follow a person to provide
continual assistance in ofce buildings, museums, hospital environments, or even in shopping centers. Service robots clearly need
to move in ways that are socially suitable for people. Such a robot
have to localize its user, to discriminate him/her from others passers-by and to be able to follow him/her across complex humancentered environment. In this context, tracking a given person in
crowds from a mobile platform appears to be fundamental. However, numerous difculties arise: moving cameras with limited
view eld, cluttered background, illumination variations, hard
real-time constraints, and so on.
The literature offers many tools to go beyond these difculties.
Our paper focuses on particle ltering framework as it easily enables to fuse heterogeneous data from embedded sensors. Despite
their sporadicity, these dedicated person detectors and their hardware counterpart are very discriminant when present.
642
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
The person can walk towards, away from, or past the robot, side-by-side, etc.
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
643
Fig. 2. Occurrence frequencies of angle htag given one (a), two (b) or three (c) detections.
RFID sensors enjoy the nice properties to provide explicit information about the person identity, even if the location information
is relatively coarse. Our multimodal person tracker combines the
accuracy and information richness advantages of active color vision with the identication certainty of RFID. This tracker, which
has not been addressed in the literature, is expected to be more
resilient to occlusions than vision-based only systems, since the
former benets from a coarse estimate of people location in
addition to the knowledge of his/her appearance. Furthermore,
the ID-sensor can act as reliable stimuli that triggers the vision
system. Finally, when several people lie in the camera view eld,2
this multimodal sensor data fusion will help in distinguishing the
targeted person from the others.
The contributions of the paper is threefold. The rst contribution
of this paper is the customization of an off-the-shelf RFID system to
make it able to detect tags in 360 view eld, by multiplexing eight
antennas. We have embedded this system on our mobile robot Rackham to detect passive RFID-tagged persons. This omnidirectional IDsensor, unaffected by lighting conditions or humans appearance,
appears as an ideal complement to trigger a PTU-mounted perspective camera. The second contribution concerns particle sampling
within the ICONDENSATION scheme. We propose a genuine importance function based on probabilistic saliency maps issued from visual and RF person detector and ID as well as a rejection sampling
mechanism to (re)-positions samples on the desired person during
tracking. This particle sampling strategy, which is unique in the literature, should improve our multi-sensor based tracker so that it becomes much more resilient to occlusions, data association, and
target losses than vision-based only systems. The last contribution
concerns a multi-sensor-based control making the mobile robot reliably follow in real time a person in a more difcult setting than other
previous works [10,19,29].
detected all around the robot at any distance between 0.5 m (i.e.
approximately the robots radius) and 5 m. Given the placement of
the antennas and their own eld of view, the robot neighborhood
is divided into 24 areas (Fig. 3a), depending on the number of antennas simultaneously detecting the RFID tag.
To determine the observation model of the whole antenna set,
statistics are performed by counting frequencies depending on
the number of antennas (three at a maximum, Fig. 3a) that detect
the tag. The resulting normalized histograms are shown in Fig. 2
where the x-axis represents the azimuthal angle htag . Similar histograms can be observed for the distance dtag .4 The resulting sensor
model makes the simplifying assumption that both azimuth and distance histograms can be approximated by Gaussians respectively dened by lhtag ; rhtag and ldtag ; rdtag , where l and r are the mean
and standard deviation. Afterwards, we project these probabilities
for the current tag position to a saliency map of the oor. The size
of the saliency map is 300 300 pixels; thus the area of each pixel
represents 7 cm2 . Each pixel probability is calculated given the 8-antenna set outputs to approximate the RFID tag position (Fig. 3). The
three rightmost plots in Fig. 3 respectively shows the saliency maps
for the detection by one, two or three antennas. Given this observation model, evaluations allow to characterize the ID-sensor
performances.
3.2. Evaluations from feasibility study
The RF system has been mounted on our mobile robot Rackham
(Section 6) and evaluated in the presence of people. We have proceeded in the following way. We have generated statistics by
counting frequencies on a 81 m2 area around the robot. Obstacles
have been added one by one during the test runs. Their positions
have been randomly chosen and uniformly distributed in this area.
The corresponding ground-truth is based on the ratio between the
occluding zones induced by obstacles (assuming an average person-width of 40 cm) and the total area.
Given such various crowdedness situations, the RFID tag has
been moved around the robot assuming no self-occlusion by the
person wearing the tag during this evaluation. We have repeated
this sequence for different distances and we have counted for every
point in a discrete grid whether the tag worn by a xed person is
detected or not, depending on the crowdedness. Comparisons between experimental and theoretical detection rates are shown in
Fig. 4 (see the box-and-whisker plots).
The x-axis and y-axis respectively denote the number of occluding persons (that is crowdedness) and the detection rate. The
box plots and the thick stretches inside indicate the degree of
4
They are not presented here to save space, but they are available on request.
644
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
Fig. 3. Azimuthal view eld of eight antennas (a) and saliency map for tag detection respectively for 1 (b), 2 (c) and 3 (d) antennas.
1: if k 0 then
1
Draw x0 ; . . . ; x0 ; . . . ; x0
2:
w0 N1
3: end if
4: if k P 1 then [
hn
oiN
i
i
xk1 ; wk1
i1
pxk1 jz1:k1 ]
5: for i 1; . . . ; 10 do
i
6:
i
xk
7:
i
wk
dispersion (for 50% of the trials) and the median of the trials. Our
experimental curves are shown to be rather close to the theoretical
ones. As the system is disturbed by the occlusions, the number of
false-negative readings logically increases with the number of
obstacles. Nevertheless, the detection rate remains satisfactory,
even for overcrowded scenes (e.g. 70% in average for seven persons
standing around the robot). Furthermore, very few false-positive
readings (reections, detections with the wrong antennas. . .) are
observed in practice.5
4. Person tracking using vision and RFID
P i
8:
Prior to a normalization step so that i wk 1
9: end for
10: Compute the conditional mean of any function of xk , e.g. the
MMSE estimate Epxk jz1:k xk , from the approximation
PN
i
i
of the posterior pxk jz1:k
i1 wk d xk xk
11:
si
and
1
N
pxk jz1:k
N
X
i1
i
i
wk d xk xk ;
N
X
wk 1;
i1
Passive tags induce few signal reections contrary to their active counterparts.
For Medium Mean Square Estimate.
The Sampling Importance Resampling (SIR), shown in Algorithm 1, is fully described by the prior px0 , the dynamics pdf
pxk jxk1 and the observation pdf pzk jxk . After initialization of
independent identically distributed (i.i.d.) sequence drawn from
px0 , the particles stochastically evolve, being sampled from an
i
importance function q xk jxk1 ; zk . They are then suitably
weighted to guarantee the consistency of the approximation (1).
i
sN
; . . . ; xk
is i.i.d. accord-
645
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
L
1
p xi
k zk ; . . . ; zk
L
X
l
jl p xi
with
k jzk ;
jl 1:
l1
i
i i
i
i
q xk xk1 ; zk ap xk jzk bp xk jxk1 1 a bp0 xk ;
8
t
< 8t PC t jF; z 0 and PC ; jF; z 1 when 8tL < s;
t
PL
: 8t PC t jF; z Lp and PC ; jF; z 0 otherwise;
p
3
with a; b 2 0; 1. Besides the importance function, the measurement
function involves visual cues which must be persistent but are more
prone to ambiguity for cluttered scenes. An alternative is to consider multi-cue fusion in the weighting stage. Given L measurement
sources z1k ; . . . ; zLk and assuming the latter are mutually independent conditioned on the state, the unied measurement function
can then be factorized as follows:
Y
L
i
i
p z1k ; . . . ; zLk xk /
p zlk xk :
hood in the weighting stage is more or less conventional. It is computed thanks to (4) by means of several measurement functions,
according to persistent visual cues, namely: (i) edges to model
the silhouette [25] and (ii) multiple color distributions to represent
the persons appearance (both head and torso) [34]. Despite its
simplicity, our measurement function is inexpensive while still
providing some level of person discrimination from the clothes
appearance. Otherwise, our importance function is unique in the
literature and is detailed below.
l1
pxjzs /
NB
X
PCjFj ; z N x; pj ; diag r2uj ; r2v j ;
j1
r
N hxi ; lhtag ; rhtag ;
p xi
k jz
k
646
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
Fig. 5. (a) Original image, (b) skin probability image p xk jzck , (c) face detection p xk jzsk , (d) azimuthal angle from RFID detection p xk jzrk , (e) unied importance function (3)
(without dynamic), (f) accepted particles (yellow dots) after rejection sampling. (For interpretation of the references to color in this gure legend, the reader is referred to the
web version of this article.)
qxk jx
k1 ;zk
i
Mg xk
5: accept xk
6: else
7: reject it
8: end if
using xr instead of xp allows to orientate the whole robotic system (and not only the camera) towards the targeted person,
improving the task execution.
5.2. Control design
Thus, we aim at designing three controllers v r ; xr ; xt to orientate the camera and to move the robot, so that the person to be followed is always kept in its line of sight and at a suitable social
distance from the vehicle. To this aim, we use the data provided
by the tracker, namely the coordinates of the head gravity center
in the image ugc ; v gc and its associated scale sgc which coarsely
characterizes the H/R distance. From these data, we dene the error function Eptv to be decreased to zero:
Eptv Eu Ev Es T ugc ui
v gc v i
sgc si T ;
where ui and v i represent the image center coordinates, si the predened scale corresponding to an acceptable social distance denoted by dfollow . Eu represents the abscissa error in the image and
can then be regulated to zero by acting on the robot angular speed
xr . Ev corresponds to the ordinate error in the image and can be decreased to zero thanks to the PTU tilt velocity xt and Es is the scale
error regulated to zero using the robot linear velocity v r . We design
three PID controllers as follows:
8
R
dEu
>
>
< xr K pp Eu K ip R Eu dt K dp dt ;
xt K pt Ev K it Ev dt K dt dEdtv ;
>
>
: v K E K R E dt K dEs ;
r
pv s
iv
dv dt
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
647
Fig. 8. A typical run without human obstacles: the solid red/dash blue curves
represent robot/tagged person positions, while the thin purple lines show the
direction of the camera. (For interpretation of the references to color in this gure
legend, the reader is referred to the web version of this article.)
Fig. 7. Rackham.
continuity is not explicitly handled because the robot and the PTU are
sufciently rapid systems.
6. Integration and live experiments
6.1. Rackham description and software architecture
Rackham is an iRobot B21r mobile platform. Its standard equipment has been extended with one digital camera mounted on a Directed Perception pan-tilt unit, one ELO touch-screen, a pair of
loudspeakers, an optical ber gyroscope, wireless Ethernet and
the RFID system previously described in Section 3 (Fig. 7). All these
devices enable Rackham to act as a service robot in utilitarian public areas. It embeds robust Human Robot interaction abilities and
efcient basic navigation skills.
We have developed three software modules, namely ICU which
stands for I see you, RFID and Visuserv, which respectively
encapsulate human recognition/tracking, RFID localization, and visual servoing. These modules have been implemented within the
LAAS architecture [1] using C/C++ interfacing scheme. The
OpenCV library10 is used for low-level features extraction e.g. edge
or face detection. The entire system operates at an average framerate
of 6 Hz.
10
see ttp://sourceforge.net/projects/opencvlibrary/
The visual contact rate (VCR) dened by the ratio of the frames
where the targeted person was in the eld of view over the total
number of frames. This indicator indirectly measures the tracker
robustness to artifacts such as occlusions, sporadic target losses
due to the crowds.
11
If the user outpaces the robot and is lost, the vehicle is stopped for the safety sake.
648
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
Fig. 9. Snapshots of a trial. Notice that the pan-tilt unit azimuthal position is given by the red arch on the RFID map. The blue and green squares respectively depict the face
detection (person gazing the camera) and the MMSE estimate while the yellow dots represent the particles before the resampling step. (For interpretation of the references to
color in this gure legend, the reader is referred to the web version of this article.)
(a)
(b)
(c)
(d)
(e)
(f)
ICU
2
1
0
1
10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
10
20
30
40
50
60
70
RFID
2
1
0
1
tag
50
0
50
dtag
400
200
0
0.4
0.2
0
0.2
0.4
0.6
0.4
0.2
0
0.2
0.2
0
0.2
0.4
Fig. 10. Synchronization of the data ow outputs between the different modules.
snapshots of the video streams with superimposed tracking outputs as well as the current RFID saliency map.
649
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
Fig. 11. A typical run with human obstacles: the solid red/dash blue curves
represent robot/tagged person positions, while the thin purple lines show the
direction of the camera. The black arrow represent the passers-by paths. (For
interpretation of the references to color in this gure legend, the reader is referred
to the web version of this article.)
Fig. 12. Snapshots of a run in crowds. The rst line shows the current human robot situation.
Table 1
Visual contact rate when considering 14 passers-by.
Sensor system
Number of passers-by
1
Vision only
Vision + RFID
Total
2
4 and more
0.21
0.94
0.11
0.08
0.22
0.85
0.02
0.14
0.18
0.94
0.05
0.13
0.22
0.83
0.06
0.19
0.21
0.86
0.04
0.14
650
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
[10] D. Calisi, L. Iocchi, R. Leone, Person following through appearance models and
stereo vision using a mobile robot, in: International Conference on Computer
Vision Theory and Applications (VISAPP07), Barcelona, Spain, March 2007.
[11] B. Castano, M. Rodriguez, An articial intelligence and RFID system for people
detection and orientation in big surfaces, in: International Multi-Conference
on Engineering and Technological Innovation (IMETI08), Orlando, USA, June
2008.
[12] G. Cielniak, A. Lilienthal, T. Duckett, Improved data association and occlusion
handling for vision-based people tracking by mobile robots, in: International
Conference on Intelligent Robots and Systems (IROS07), San Diego, USA,
November 2007.
[13] P.I. Corke, Visual Control of Robots: High Performance Visual Servoing,
Research Studies Press Ltd., 1996.
[14] J. Cui, H. Zha, H. Zhao, R. Shibasaki, Multimodal tracking of people using laser
scanners and video camera, Image and Vision Computing (IVC08) 26 (2)
(2008) 240252.
[15] B. Espiau, F. Chaumette, P. Rives, A new approach to visual servoing in robotics,
IEEE Transactions on Robotics and Automation 8 (3) (1992) 313326.
[16] T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots,
Robotics and Autonomous Systems (RAS03) 42 (2003) 143166.
[17] D.M. Gavrila, Multi-cue pedestrian detection and tracking from a moving
vehicle, International Journal of Computer Vision (IJCV07) 73 (1) (2008) 41
59.
[18] T. Germa, F. Lerasle, T. Simon, Video-based face recognition and tracking from
a robot companion, International Journal of Pattern Recognition and Articial
Intelligence (IJPRAI09) 23 (March) (2009) 591616.
[19] R. Gockley, J. Forlizzi, R. Simmons, Natural person-following behavior for social
robots, in: International Conference on Human Robot Interaction (HRI07),
Washington, USA, March 2007, pp. 1724.
[20] N.J. Gordon, D.J. Salmond, A.F.M. Smith, Novel approach to nonlinear/nongaussian bayesian state estimation, Radar and Signal Processing IEE
Proceedings F 140 (2) (1993) 107113.
[21] D. Hahnel, W. Burgard, D. Fox, K. Fishkin, M. Philipose, Mapping and
localization with RFID technology, in: International Conference on Robotics
and Automation (ICRA04), April 2004, pp. 10151020.
[22] R. Hammoud, J. Davis, Advances in vision algorithms and systems beyond the
visible spectrum, Computer Vision and Image Understanding (CVIU07) 106 (2)
(2007) 145147.
[23] C. Huang, H. Ai, Y. Li, S. Lao, High-performance rotation invariant multi-view
face detection, Transaction on Pattern Analysis Machine Intelligence (PAMI07)
29 (4) (2007) 671686.
[24] T. Ikeda, H. Ishiguro, T. Nishimura, People tracking by cross modal association of
vision and acceleration sensors, in: International Conference on Intelligent
Robots and Systems (IROS07), San Diego, USA, November 2007, pp. 41474151.
[25] M. Isard, A. Blake, CONDENSATION conditional density propagation for visual
tracking, International Journal on Computer Vision 29 (1) (1998) 528.
[26] M. Isard, A. Blake, I-CONDENSATION: unifying low-level and high-level
tracking in a stochastic framework, in: European Conference on Computer
Vision (ECCV98), Freibourg, Germany, June 1998, pp. 893908.
[27] T. Kanda, M. Shiomi, L. Perrin, T. Nomura, H. Ishiguro, N. Hagita, Analysis of
people trajectories with ubiquitous sensors in a science museum, in:
International Conference on Robotics and Automation (ICRA07), Roma, Italy,
April 2007, pp. 48464853.
[28] B. Kar, S. Bhatia, P. Dutta, Audio-visual biometric based speaker identication,
in: International Conference on Computational Intelligence and Multimedia
Applications (ICCIMA07), Sivakasi, India, December 2007, pp. 9498.
[29] M. Kobilarov, G. Sukhatme, J. Hyams, P. Batavia, People tracking and following
with mobile robot using an omnidirectional camera and laser, in: International
Conference on Robotics and Automation (ICRA06), Orlando, USA, May 2006,
pp. 557562.
[30] J. Lee, W. Lee, D. Jeong, Object tracking method using back-projection of
multiple color histogram models, in: International Symposium on Circuits and
Systems (ISCAS03), June 2003.
[31] T. Mori, Y. Suemasu, H. Noguchi, T. Sato, Multiple people tracking by
integrating distributed oor pressure sensors and RFID system, in:
International Conference on Systems, Man and Cybernetics, The Hague,
Netherlands, October 2004, pp. 52715278.
[32] Rafael Muoz Salinas, Miguel Garca-Silvente, Rafael Medina-Carnicer,
Adaptive multi-modal stereo people tracking without background modelling,
Journal of Visual of Communication and Image Representation 19 (2) (2008)
7591.
[33] K. Nickel, T. Gehrig, H. Ekenel, R. Stiefelhagen, J. McDonough, A joint particle
lter for audiovisual speaker tracking, in: International Conference on
Multimodal Interfaces (ICMI05), Torento, Italy, 2005, pp. 6168.
[34] P. Prez, J. Vermaak, A. Blake, Data fusion for visual tracking with particles,
IEEE 92 (3) (2004) 495513.
[35] S.S. Takahashi, J. Wong, M. Miyamae, A ZigBee-based sensor node for tracking
peoples locations, in: ACM International Conference, Sydney, Australia, May
2008, pp. 3438.
[36] D. Schulz, W. Burgard, D. Fox, A. Cremers, Tracking multiple moving targets
with a mobile robot using particle lters and statistical data association, in:
International Conference on Robotics and Automation (ICRA01), Seoul, Korea,
May 2001.
[37] D. Schulz, D. Fox, J. Hightower, People tracking with anonymous and IDsensors using rao-blackwellised particle lters, in: International Joint
Conference on Articial Intelligence (IJCAI03), Acapulco, Mexico, August 2003.
T. Germa et al. / Computer Vision and Image Understanding 114 (2010) 641651
[38] J. Smith, K. Fishkin, B. Jiang, A. Mamishev, RFID-based techniques for
human-activity detection, Communications of the ACM 48 (9) (2005) 39
44.
[39] L. Spinello, R. Triebel, R. Siegwart, Multimodal detection and tracking of
pedestrians in urban environments with explicit ground plane extraction, in:
AAAI Conference on Articial Intelligence (AAAI08), Chicago, USA, July 2008,
pp. 14091414.
[40] Y. Tsai, H. Shih, C. Huang, Multiple human objects tracking in crowded scenes,
in: International Conference on Pattern Recognition (ICPR06), Hong Kong,
August 2006, pp. 5154.
[41] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple
features, in: International Conference on Computer Vision and Pattern
Recognition (CVPR01), 2001.
[42] P. Viola, M. Jones, Fast multi-view face detection, in: International Conference
on Computer Vision and Pattern Recognition (CVPR03), Madison, USA, June
2003.
651