Beruflich Dokumente
Kultur Dokumente
G.J. Burghouts, P. van Slingerland, R.J.M. ten Hove, R.J.M. den Hollander and K. Schutte
1. Introduction
Early detection of threats is crucial to guarantee safety and
security in a wide variety of circumstances. Particular cases
that are studied in this paper are theft of cargo from a truck
by stealing or inspecting the truck, or attacking the truck
driver at a parking lot [4]. These threats are complex events,
because they involve sequences of human activities [3]. For
instance, first the thief waits a while, loiters, approaches the
truck while other people are away, inspects its exterior and
breaks it open. Threats also involve particular walking tra-
jectories, and interactions with other persons (collaborator,
pedestrians, or truck driver) and object (truck). We char-
acterize the interactions by threat stages, such as the ap-
proaching of the driver or truck, or waiting, or close prox-
imity. In this paper, we aim to recognize the high-level con-
cept of threats, by models that use these various features,
of which some are low-level (e.g., track, motion patterns,
appearance), whereas other features are intermediate level
(stages, activities, kinematics). The hierarchy of features is
shown in Figure 1, where at the high level two methods are
presented to perform the threat detection: a rule engine that Figure 1. Threat detection by a rule engine and an SVM method,
captures world knowledge about threats, and an SVM that using low-level and intermediate level features.
2. Low-Level Features
expected to be sufficiently large. In a second step, a model
This section discusses the low-level features of our hierar-
transforms the vertical position and the relative size of the
chical system: tracks of persons in the scene, STIP features
box to a confidence value that this box is a driver, see Fig-
to capture motion patterns within the bounding boxes of
ure 3. The mapping provides some robustness to deal with
these tracks, and the appearance-based (colour) likelihood
variations of the yellow blobs, e.g., it may become smaller
that a bounding box corresponds to the driver.
when bounding boxes are oversized or the yellow vests are
The tracks are associations of the object bounding boxes
not in frontal view. For actual truck protection systems we
in a sequence of frames. The bounding boxes result from
expect driver identification to be performed by other means
object detection by a combination of motion detection and
such as utilising his smartphone position capabilities.
change detection. Heuristics are used to handle common
detection issues such as fragmented detections and merged
detections. For details, we refer to [7]. We used the tracks
provided by University of Reading for the PETS 2014 work-
1
shop.
weight
0.8
0.4
need to capture a person’s detailed motion patterns as well. 0.2
For this purpose we use Space-Time Interest Point (STIP) 0
0
features [8], which are known to be distinctive for action 20
40
recognition [5] and a valuable addition to track- and object- 60 100
80
based features [6]. We compute the STIPs for each frame, height (%) 80
20
40
60
100 0
and for each bounding box of each track we aggregate the area (%)
STIPs within that box. For details, we refer to [4].
In general, the threat level will depend on whether the Figure 3. The vertical position and relative size of the yellow blob
truck driver is involved in the action. For example, there is (horizontal axes) and the corresponding driver confidence (verti-
no threat when the driver inspects his truck, but there poten- cal).
tially is a threat when other people approach and touch the
truck. It is therefore useful to discern the driver from other
persons in the scene. Since the driver actors wore yellow 3. Intermediate-Level Features
vests during the experiments (see left image of Figure 2),
Using the low-level features of the previous section as input,
we describe here a low-level algorithm that detects whether
this section discusses the intermediate-level features of our
persons are wearing a yellow vest as opposed to differently
hierarchical system. STIPs are used to estimate activities
coloured clothing. The algorithm returns for each bounding
like walking and falling. Tracks are used to obtain features
box a driver confidence, determined by the presence of a
related to kinematics and localization. Kinematics and lo-
yellow blob at the expected position. The colour yellow is
calization are combined into new intermediate level features
observed in RGB-space as high 𝑅 and 𝐺 components and a
called threat stages related to theft, inspection or entering of
low 𝐵 component. As a first step, we measure yellowness
the truck, or attacks to the driver.
of a pixel by the quantity min(𝑅, 𝐺) − 𝐵 and subsequently
transform it to the range [0, 1], see middle image of Figure
3.1. Activities and Kinematics
2. This yields a colour weighted image, where the weights
is boosted by taking into account the size of the blob, see First, the STIPs are used to detect activities. Within each
right image of Figure 2. For boxes related to the driver, the scenario, the main activities observed are: walk, run, loiter,
yellow blob is expected in the upper half of the box, and it is turn, open, enter, exit, hit/push/attack, fall, stand up, pickup,
376
bend, and give. To detect these activities, we use a Ran- (stage) is present at that time): tracks in this stage are typ-
dom Forest based bag-of-features approach [5] to transform ically people far away or walking by. We assign one stage
the low-level STIP features into visual words and represent to any frame of any track for both the attack-driver and the
each (one-second) track fragment as a frequency count of steal-from-truck cases.
the words which are classified by an SVM. The classifier
serves as the detector for each activity. For each one-second stage distance speed direction
fragment of the track, we obtain a posterior probability for wait near loiter -
each of the activities, which is used as a feature for threat approach near walk/run towards
detection (Section 4). For details we refer to [5, 6]. at at - -
Second, the tracks are used to derive kinematic and lo- leave near walk/run away
calization information about the person. Kinematics and
Table 1. Relation between threat stages and the lower-level fea-
localization play an important role in the interpretation of tures “distance” (to the driver/truck), “speed”, and “direction”
the scene [9]. Using the low-level tracks as input, we com- (with respect to the driver/truck).
pute the speed, orientation and travelled distance in both
image and real-world coordinates. Furthermore, we obtain
the distance and direction with respect to the truck (as well 4. High-Level Threat Detection
as specific truck parts: doors, cargo, and screen), the driver
(i.e. the track with the highest driver confidence level), the This section discusses two methods for high-level threat de-
bicycle house, and the road. tection: an SVM and a rule-based method. For both meth-
ods, three threat classes are modelled: normal, threat with
3.2. Threat stages respect to the driver (e.g., attack) and a threat with respect
Threats follow sequences of particular stages. For instance, to the truck (e.g., inspect). We summarize the SVM method
a typical stealing threats involves the following stages: a below, for details we refer to [7]. We focus on the rule-
criminal loiters near the truck (wait). When the driver is based method, because it is novel. Both methods are based
out of sight, he walks towards the truck (approach), un- on the rationale to classify a track as normal vs. threat.
til he is sufficiently close to the truck to touch and steal
4.1. SVM-based detection
from it (at(tack)), after which he moves away from the truck
(leave). Similarly, a typical attacking scenario reads: a The trajectory, kinematics, presence in zones, activities and
criminal loiters near the driver (wait). When the driver is threat stages yield a feature vector per frame for each track.
close, he runs towards the driver (approach), until he is suf- The threat stages are encoded for each stage segment (see
ficiently close to the driver to hit him (at(tack)), after which previous section) by a number from 0 to 5, ordered by de-
he moves away from the driver (leave). gree of threat likelihood. The activities are encoded for each
These two typical examples motivate the distinction be- second of a track and they are represented by 12 values be-
tween a threat with respect to the driver and a threat with tween 0-1, where each value is the probability that the par-
respect to the truck. Furthermore, they motivate the use ticular activity is present. The trajectory and kinematic fea-
of four stages: wait, approach, at, and leave. This gives tures are stored per frame of a track. At each frame of a
rise to eight stages in total. Assuming that a track does track, all above features are concatenated into a feature vec-
not correspond to the driver, its stage with respect to the tor. Each track is encoded by a bag-of-features model. This
driver/truck is determined from three lower-level features: model quantizes each feature vector of the track into a visual
the distance to the driver/truck (categorized as “at”, “near”, word, by means of a Random Forest. The forest is obtained
or “far from”), the speed (categorized as “loiter”, “walk”, by training using the threat labels that were annotated for
or “run”), and the direction with respect to the driver/truck each track in the dataset. We expect the Random Forest part
(categorized as “towards”, “past”, or “away from”).1 The of the bag-of-world model to provide feature selection capa-
relation between the stages and these three features is indi- bilities, allowing us to provide a rather large feature vector.
cated in Table 1. Together with the three threat class labels, a threat-SVM
Finally, we introduce two additional stages. The first model is obtained. We use the 𝜒2 kernel which showed
is the ”driver” stage, which overrides the aforementioned better performance than the radial basis function. For each
states if the driver likelihood exceeds the threshold level track, the SVM is applied to obtain posterior probabilities
(0.1). The second is a stage labelled “normal”, which is for each of the three classes.
assigned if none of the other stages is assigned (or when
considering a threat with respect to the driver and no driver 4.2. Rule-based threat detection
1 The categories are defined manually, e.g. loiter: 0-3 km/h, walk: 3-6 The rule-based system estimates the threat level of a track
km/h, etc. based on a combination of threat stages and STIP activities.
377
This system (including the selection of features and thresh- An example of an outcome of the rule-based system is
old values) was trained and optimized for a different sub-set given in Figure 4, for a track of 46 seconds which shows
of sequences (i.e., for the final demo of the ARENA project theft from the truck. First, the person is waiting, before ap-
[2]) than those studied in this paper. proaching the truck, then waiting again. Second, he stands
For threats with respect to the truck, the method applies next to the truck, where he shows behaviour of entering and
the following four main steps: opening, while staying in close proximity of the truck. All
Step 1 (segmentation of a track): Recall that each frame these threat stages are identified by the method.
of the track is assigned one of six stages. Based on these
stages, the track is segmented into shorter tracklets, such
that frames with the same stage are grouped together. To
avoid too much fragmentation, tracklets shorter than 0.5
seconds are merged again with adjacent tracklets, until all
tracklets have a duration of at least 0.5 seconds.
Step 2 (initial threat level): For each tracklet, we com-
pute the fraction 𝑥𝑠 of frames in stage 𝑠 = ( wait, approach,
at, leave, normal, driver ). The sum of these fractions is
equal to 1 as each frame is in exactly one stage. An ini-
tial threat level can then be defined as: 𝑇 = 16 𝑥wait +
1 1 1
3 𝑥approach + 2 𝑥at + 6 𝑥leave . These heuristically set weights
are motivated by the idea that approaching the truck is more
threatening than waiting near it; being at the truck is more
threatening than approaching it; leaving the truck is sus-
picious but also actually too late to raise an alarm which
should not come at the cost of false positives; and nor-
mal/driver behaviour is not considered threatening at all.
Furthermore, the weights are chosen such that the stage fea- Figure 4. Rule-based detection of stealing from truck for sequence
tures by themselves don’t raise alarms larger than 𝑇 = 0.5 14 01. Only tracklets with a threat level larger than 15% are dis-
played. The most relevant stage or STIP activity is indicated be-
(higher values can be reached in step 4 though).
tween brackets.
Step 3 (exceptions for normal behaviour): If the domi-
nant stage of the tracklet is driver, i.e. if 𝑥driver is the largest For the rule-based system, the STIP activity
of the computed fractions 𝑥𝑠 , then the (entire) tracklet is as- “hit push attack” is by far the most important feature
sumed to be the driver. At the same time, if none of the for threats with respect to the driver. For the truck, the
thirteen STIP activities show a confidence larger than 0.01, distance to the truck as well as the STIP activities “enter”,
the tracklet is assumed to be something other than a human “open” or “give” are important. The stage features, in
being (but e.g. a door or lamppost accidentally tracked). In which the driver likelihood plays an important role, are
both cases, the tracklet is considered non-threatening and helpful to make a pre-selection of the tracklets indicating
we reset 𝑇 = 0. which should be analysed in more detail or not. This
Step 4 (STIP-confirmed threats): If the dominant stage generic approach may be useful to enhance the computa-
of the tracklet is “at” (the truck), we analyse the STIP ac- tional efficiency for more extensive applications (where the
tivities “enter”, “open” and “give”. If the confidence of one number of tracks is large).
or more of these selected activities exceeds 0.1, and 𝑐max
is the largest activity confidence observed for the tracklet, 5. Threat Detection Results
then the threat level is set to 𝑇 = min{0.5 + 0.5 𝑐max , 1}.
Finally, the threat level of the original track is the largest This section compares the performance of the SVM and the
threat level observed for its tracklets. For threats with re- rule-based method (Section 4).
spect to the driver, we proceed as above but then using
5.1. Experimental setup
stages relative to the driver. Furthermore, in step 4, we anal-
yse the STIP activity “hit push attack” (rather than “enter”, Our dataset is the ARENA dataset that is provided on the
“open” and “give”), regardless of the dominant stage. The PETS 2014 workshop website [1]. We consider all videos
latter is to avoid false negatives for attacks to the driver, as for the 24 sequences and the 4 cameras. Out of the 24
these often appear as a single bounding box for multiple sequences, 2 sequences are normal (without threats or ab-
fighting people, and especially during an attack the driver’s normalities), 6 sequences contain some abnormalities, 6
distinguishing vest is not always clearly visible. sequences involve potentially criminal behaviour and 10
378
sequences involve criminal behaviour. We have used the attack to person: 13 targets out of 1070 total
TPs
each track a label by manual annotation, which serves as 6
0
Figure 6 displays the ROCs with respect to the truck. In- 0 200 400 600 800 1000
FPs
terestingly, here the SVM method performs better for lower
false alarm rates than the rule-based method. This finding
for truck threats is contrary to the outcome of the driver Figure 6. ROCs for threats with respect to the truck.
threats. Two properties of the truck threats are different
than the driver threats: more positive samples for training
(52 vs. 13), and a less clear type of threat. Where the driver 6. Conclusions
threat has a clear characteristic of attacking the driver, for
which there is an obvious feature, the truck threat is posed We have found that the SVM and the rule-based method
in various forms, i.e., inspecting the truck, fiddling with the each have their strengths and weaknesses for threat detec-
exterior, opening the door and breaking it open. This can tion. The SVM method requires sufficient training samples;
happen in many different ways, so it is harder to make a we found that 13 samples for the driver threats were not
good description with rules. On the other hand, with several sufficient. When an event such as a threat to the driver
samples for each of the threat variations, the SVM method can be characterised well by distinguishing intermediate-
is able to learn an adequate threat model. A complicating level features, a rule-based system is most suitable. With
379
the rule-based method, very good results were obtained for [9] G. Sanromà, L. Patinob, G. Burghouts, K. Schutte, and J. Fer-
detecting the threats to the driver. The SVM performed ryman. A unified approach to the recognition of complex ac-
badly for this type of threat. However, for handling uncer- tions from sequences of zone-crossings. Image and Vision
tainty in the lower-level features and for interpreting varia- Computing, 2014.
tions of threats, such as threats to the truck, the SVM is to
be favoured when sufficient training samples are available.
For threats to the truck, we found that 52 threat samples re-
sulted in a much better performance than could be achieved
with rules. For threats that appear in many variations, rules
are not adequate. We learned that intermediate-level fea-
tures such as activities and the distance to the driver and the
truck are important elements to take into account. Future
work will focus on more complex temporal and spatial re-
lations and practical criteria to switch dynamically between
the SVM and the rule-based approach. Future research, and
evaluation measure, should emphasize the need for early
threat detection as current reaction times are really short. To
increase the performance of the SVM method in an adaptive
manner, a user feedback loop could be implemented as an
extension, to obtain more threat samples, which has been
shown to be an important criterion to take into account.
Acknowledgements
We are grateful for University of Reading for providing the
tracks. This work was supported in part by the EU FP7
project ARENA (Architecture for the recognition of threats
to mobile assets using networks of multiple affordable sen-
sors), within the security research program.
References
[1] Arena dataset. http://www.cvg.rdg.ac.uk/PETS2014/a.html.
[2] Arena project, eu fp7 security. http://www.arena-fp7.eu.
[3] J. Aggarwal and M. Ryoo. Human activity analysis: A review.
ACM Computing Surveys (CSUR), 43(3):16, 2011.
[4] M. Andersson, J. Patino, G. Burghouts, A. Flizikowski,
M. Evans, D. Gustafsson, H. Petersson, K. Schutte, and J. Fer-
ryman. Activity recognition and localization on a truck park-
ing lot. In AVSS, pages 263–269. IEEE, 2013.
[5] G. Burghouts and K. Schutte. Spatio-temporal layout of hu-
man actions for improved bag-of-words action detection. Pat-
tern Recognition Letters, 2013.
[6] G. Burghouts, K. Schutte, H. Bouma, and R. den Hollander.
Selection of negative samples and two-stage combination of
multiple features for action detection in thousands of videos.
Machine Vision and Applications, 25(1):85–98, 2013.
[7] G. Burghouts, K. Schutte, R.-M. ten Hove, S. van den Broek,
J. Baan, O. Rajadell, J. van Huis, J. van Rest, P. Hanckmann,
H. Bouma, G. Sanroma, M. Evans, and F. J. Instantaneous
threat detection based on a semantic representation of activi-
ties, zones and trajectories. Signal, Image and Video Process-
ing (SIVP), 2014.
[8] I. Laptev. On space-time interest points. Int. J. Comput. Vi-
sion, 64(2-3):107–123, Sept. 2005.
380