Sie sind auf Seite 1von 6

2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)

Complex Threat Detection: Learning vs. Rules, using a Hierarchy of Features

G.J. Burghouts, P. van Slingerland, R.J.M. ten Hove, R.J.M. den Hollander and K. Schutte

TNO, P.O. Box 96864


2509 JG The Hague, The Netherlands
gertjan.burghouts@tno.nl

Abstract learns a threat model from labelled examples.


This paper builds on our recent work [7], which intro-
Theft of cargo from a truck or attacks against the driver duced the hierarchical setup and compared it to existing
are threats hindering the day to day operations of truck- methods and approaches. The novelty of the current paper
ing companies. In this work we consider a system, which is that we add a low-level feature (driver detection based
is using surveillance cameras mounted on the truck to pro- on his/her appearance), a intermediate level feature (threat
vide an early warning for such evolving threats. Low-level stages) and an alternative threat detection method (rule en-
processing involves tracking people and calculating motion gine). We compare the rule engine to the SVM method in
features. Intermediate-level processing provides kinemat- the experiments, which are both submitted to the complex
ics and localisation, activity descriptions and threat stage
estimates. At the high level, we compare threat detection
performed with a statistical trained SVM based classifier
against a rule based system. Results are promising, and
show that the best system depends on the scenario.

1. Introduction
Early detection of threats is crucial to guarantee safety and
security in a wide variety of circumstances. Particular cases
that are studied in this paper are theft of cargo from a truck
by stealing or inspecting the truck, or attacking the truck
driver at a parking lot [4]. These threats are complex events,
because they involve sequences of human activities [3]. For
instance, first the thief waits a while, loiters, approaches the
truck while other people are away, inspects its exterior and
breaks it open. Threats also involve particular walking tra-
jectories, and interactions with other persons (collaborator,
pedestrians, or truck driver) and object (truck). We char-
acterize the interactions by threat stages, such as the ap-
proaching of the driver or truck, or waiting, or close prox-
imity. In this paper, we aim to recognize the high-level con-
cept of threats, by models that use these various features,
of which some are low-level (e.g., track, motion patterns,
appearance), whereas other features are intermediate level
(stages, activities, kinematics). The hierarchy of features is
shown in Figure 1, where at the high level two methods are
presented to perform the threat detection: a rule engine that Figure 1. Threat detection by a rule engine and an SVM method,
captures world knowledge about threats, and an SVM that using low-level and intermediate level features.

978-1-4799-4871-0/14/$31.00 ©2014 IEEE 375


event detection task of the PETS workshop. In this paper
we evaluate how well both methods are able to detect and
localize two different types of threats in the PETS dataset
[1], respectively to the driver (attack) and the truck (theft).
The paper is organized as follows. We follow the struc-
ture of the hierarchy of our method (Figure 1). Section 2
describes the low-level features, and Section 3 the interme-
diate level features. Section 4 describes the threat detection
methods, which are evaluated and compared in Section 5. Figure 2. Example recording of driver (left), the quantity
MIN(R,G)-B (middle), and the weighted result of this image
Section 6 concludes with findings.
(right).

2. Low-Level Features
expected to be sufficiently large. In a second step, a model
This section discusses the low-level features of our hierar-
transforms the vertical position and the relative size of the
chical system: tracks of persons in the scene, STIP features
box to a confidence value that this box is a driver, see Fig-
to capture motion patterns within the bounding boxes of
ure 3. The mapping provides some robustness to deal with
these tracks, and the appearance-based (colour) likelihood
variations of the yellow blobs, e.g., it may become smaller
that a bounding box corresponds to the driver.
when bounding boxes are oversized or the yellow vests are
The tracks are associations of the object bounding boxes
not in frontal view. For actual truck protection systems we
in a sequence of frames. The bounding boxes result from
expect driver identification to be performed by other means
object detection by a combination of motion detection and
such as utilising his smartphone position capabilities.
change detection. Heuristics are used to handle common
detection issues such as fragmented detections and merged
detections. For details, we refer to [7]. We used the tracks
provided by University of Reading for the PETS 2014 work-
1
shop.
weight

0.8

Next to the positional features obtained by tracking, we 0.6

0.4
need to capture a person’s detailed motion patterns as well. 0.2
For this purpose we use Space-Time Interest Point (STIP) 0
0
features [8], which are known to be distinctive for action 20
40
recognition [5] and a valuable addition to track- and object- 60 100
80
based features [6]. We compute the STIPs for each frame, height (%) 80
20
40
60
100 0
and for each bounding box of each track we aggregate the area (%)
STIPs within that box. For details, we refer to [4].
In general, the threat level will depend on whether the Figure 3. The vertical position and relative size of the yellow blob
truck driver is involved in the action. For example, there is (horizontal axes) and the corresponding driver confidence (verti-
no threat when the driver inspects his truck, but there poten- cal).
tially is a threat when other people approach and touch the
truck. It is therefore useful to discern the driver from other
persons in the scene. Since the driver actors wore yellow 3. Intermediate-Level Features
vests during the experiments (see left image of Figure 2),
Using the low-level features of the previous section as input,
we describe here a low-level algorithm that detects whether
this section discusses the intermediate-level features of our
persons are wearing a yellow vest as opposed to differently
hierarchical system. STIPs are used to estimate activities
coloured clothing. The algorithm returns for each bounding
like walking and falling. Tracks are used to obtain features
box a driver confidence, determined by the presence of a
related to kinematics and localization. Kinematics and lo-
yellow blob at the expected position. The colour yellow is
calization are combined into new intermediate level features
observed in RGB-space as high 𝑅 and 𝐺 components and a
called threat stages related to theft, inspection or entering of
low 𝐵 component. As a first step, we measure yellowness
the truck, or attacks to the driver.
of a pixel by the quantity min(𝑅, 𝐺) − 𝐵 and subsequently
transform it to the range [0, 1], see middle image of Figure
3.1. Activities and Kinematics
2. This yields a colour weighted image, where the weights
is boosted by taking into account the size of the blob, see First, the STIPs are used to detect activities. Within each
right image of Figure 2. For boxes related to the driver, the scenario, the main activities observed are: walk, run, loiter,
yellow blob is expected in the upper half of the box, and it is turn, open, enter, exit, hit/push/attack, fall, stand up, pickup,

376
bend, and give. To detect these activities, we use a Ran- (stage) is present at that time): tracks in this stage are typ-
dom Forest based bag-of-features approach [5] to transform ically people far away or walking by. We assign one stage
the low-level STIP features into visual words and represent to any frame of any track for both the attack-driver and the
each (one-second) track fragment as a frequency count of steal-from-truck cases.
the words which are classified by an SVM. The classifier
serves as the detector for each activity. For each one-second stage distance speed direction
fragment of the track, we obtain a posterior probability for wait near loiter -
each of the activities, which is used as a feature for threat approach near walk/run towards
detection (Section 4). For details we refer to [5, 6]. at at - -
Second, the tracks are used to derive kinematic and lo- leave near walk/run away
calization information about the person. Kinematics and
Table 1. Relation between threat stages and the lower-level fea-
localization play an important role in the interpretation of tures “distance” (to the driver/truck), “speed”, and “direction”
the scene [9]. Using the low-level tracks as input, we com- (with respect to the driver/truck).
pute the speed, orientation and travelled distance in both
image and real-world coordinates. Furthermore, we obtain
the distance and direction with respect to the truck (as well 4. High-Level Threat Detection
as specific truck parts: doors, cargo, and screen), the driver
(i.e. the track with the highest driver confidence level), the This section discusses two methods for high-level threat de-
bicycle house, and the road. tection: an SVM and a rule-based method. For both meth-
ods, three threat classes are modelled: normal, threat with
3.2. Threat stages respect to the driver (e.g., attack) and a threat with respect
Threats follow sequences of particular stages. For instance, to the truck (e.g., inspect). We summarize the SVM method
a typical stealing threats involves the following stages: a below, for details we refer to [7]. We focus on the rule-
criminal loiters near the truck (wait). When the driver is based method, because it is novel. Both methods are based
out of sight, he walks towards the truck (approach), un- on the rationale to classify a track as normal vs. threat.
til he is sufficiently close to the truck to touch and steal
4.1. SVM-based detection
from it (at(tack)), after which he moves away from the truck
(leave). Similarly, a typical attacking scenario reads: a The trajectory, kinematics, presence in zones, activities and
criminal loiters near the driver (wait). When the driver is threat stages yield a feature vector per frame for each track.
close, he runs towards the driver (approach), until he is suf- The threat stages are encoded for each stage segment (see
ficiently close to the driver to hit him (at(tack)), after which previous section) by a number from 0 to 5, ordered by de-
he moves away from the driver (leave). gree of threat likelihood. The activities are encoded for each
These two typical examples motivate the distinction be- second of a track and they are represented by 12 values be-
tween a threat with respect to the driver and a threat with tween 0-1, where each value is the probability that the par-
respect to the truck. Furthermore, they motivate the use ticular activity is present. The trajectory and kinematic fea-
of four stages: wait, approach, at, and leave. This gives tures are stored per frame of a track. At each frame of a
rise to eight stages in total. Assuming that a track does track, all above features are concatenated into a feature vec-
not correspond to the driver, its stage with respect to the tor. Each track is encoded by a bag-of-features model. This
driver/truck is determined from three lower-level features: model quantizes each feature vector of the track into a visual
the distance to the driver/truck (categorized as “at”, “near”, word, by means of a Random Forest. The forest is obtained
or “far from”), the speed (categorized as “loiter”, “walk”, by training using the threat labels that were annotated for
or “run”), and the direction with respect to the driver/truck each track in the dataset. We expect the Random Forest part
(categorized as “towards”, “past”, or “away from”).1 The of the bag-of-world model to provide feature selection capa-
relation between the stages and these three features is indi- bilities, allowing us to provide a rather large feature vector.
cated in Table 1. Together with the three threat class labels, a threat-SVM
Finally, we introduce two additional stages. The first model is obtained. We use the 𝜒2 kernel which showed
is the ”driver” stage, which overrides the aforementioned better performance than the radial basis function. For each
states if the driver likelihood exceeds the threshold level track, the SVM is applied to obtain posterior probabilities
(0.1). The second is a stage labelled “normal”, which is for each of the three classes.
assigned if none of the other stages is assigned (or when
considering a threat with respect to the driver and no driver 4.2. Rule-based threat detection
1 The categories are defined manually, e.g. loiter: 0-3 km/h, walk: 3-6 The rule-based system estimates the threat level of a track
km/h, etc. based on a combination of threat stages and STIP activities.

377
This system (including the selection of features and thresh- An example of an outcome of the rule-based system is
old values) was trained and optimized for a different sub-set given in Figure 4, for a track of 46 seconds which shows
of sequences (i.e., for the final demo of the ARENA project theft from the truck. First, the person is waiting, before ap-
[2]) than those studied in this paper. proaching the truck, then waiting again. Second, he stands
For threats with respect to the truck, the method applies next to the truck, where he shows behaviour of entering and
the following four main steps: opening, while staying in close proximity of the truck. All
Step 1 (segmentation of a track): Recall that each frame these threat stages are identified by the method.
of the track is assigned one of six stages. Based on these
stages, the track is segmented into shorter tracklets, such
that frames with the same stage are grouped together. To
avoid too much fragmentation, tracklets shorter than 0.5
seconds are merged again with adjacent tracklets, until all
tracklets have a duration of at least 0.5 seconds.
Step 2 (initial threat level): For each tracklet, we com-
pute the fraction 𝑥𝑠 of frames in stage 𝑠 = ( wait, approach,
at, leave, normal, driver ). The sum of these fractions is
equal to 1 as each frame is in exactly one stage. An ini-
tial threat level can then be defined as: 𝑇 = 16 𝑥wait +
1 1 1
3 𝑥approach + 2 𝑥at + 6 𝑥leave . These heuristically set weights
are motivated by the idea that approaching the truck is more
threatening than waiting near it; being at the truck is more
threatening than approaching it; leaving the truck is sus-
picious but also actually too late to raise an alarm which
should not come at the cost of false positives; and nor-
mal/driver behaviour is not considered threatening at all.
Furthermore, the weights are chosen such that the stage fea- Figure 4. Rule-based detection of stealing from truck for sequence
tures by themselves don’t raise alarms larger than 𝑇 = 0.5 14 01. Only tracklets with a threat level larger than 15% are dis-
played. The most relevant stage or STIP activity is indicated be-
(higher values can be reached in step 4 though).
tween brackets.
Step 3 (exceptions for normal behaviour): If the domi-
nant stage of the tracklet is driver, i.e. if 𝑥driver is the largest For the rule-based system, the STIP activity
of the computed fractions 𝑥𝑠 , then the (entire) tracklet is as- “hit push attack” is by far the most important feature
sumed to be the driver. At the same time, if none of the for threats with respect to the driver. For the truck, the
thirteen STIP activities show a confidence larger than 0.01, distance to the truck as well as the STIP activities “enter”,
the tracklet is assumed to be something other than a human “open” or “give” are important. The stage features, in
being (but e.g. a door or lamppost accidentally tracked). In which the driver likelihood plays an important role, are
both cases, the tracklet is considered non-threatening and helpful to make a pre-selection of the tracklets indicating
we reset 𝑇 = 0. which should be analysed in more detail or not. This
Step 4 (STIP-confirmed threats): If the dominant stage generic approach may be useful to enhance the computa-
of the tracklet is “at” (the truck), we analyse the STIP ac- tional efficiency for more extensive applications (where the
tivities “enter”, “open” and “give”. If the confidence of one number of tracks is large).
or more of these selected activities exceeds 0.1, and 𝑐max
is the largest activity confidence observed for the tracklet, 5. Threat Detection Results
then the threat level is set to 𝑇 = min{0.5 + 0.5 𝑐max , 1}.
Finally, the threat level of the original track is the largest This section compares the performance of the SVM and the
threat level observed for its tracklets. For threats with re- rule-based method (Section 4).
spect to the driver, we proceed as above but then using
5.1. Experimental setup
stages relative to the driver. Furthermore, in step 4, we anal-
yse the STIP activity “hit push attack” (rather than “enter”, Our dataset is the ARENA dataset that is provided on the
“open” and “give”), regardless of the dominant stage. The PETS 2014 workshop website [1]. We consider all videos
latter is to avoid false negatives for attacks to the driver, as for the 24 sequences and the 4 cameras. Out of the 24
these often appear as a single bounding box for multiple sequences, 2 sequences are normal (without threats or ab-
fighting people, and especially during an attack the driver’s normalities), 6 sequences contain some abnormalities, 6
distinguishing vest is not always clearly visible. sequences involve potentially criminal behaviour and 10

378
sequences involve criminal behaviour. We have used the attack to person: 13 targets out of 1070 total

tracks that have been computed by the University of Read- SVM


12 rule engine
ing (see Section 2) which were made available for the
benchmark. The dataset involves two types of threats: 10
threats with respect to the driver and threats with respect
to the truck. We consider three threat class labels: nor- 8
mal, threat to driver, and threat to truck. We have given

TPs
each track a label by manual annotation, which serves as 6

our ground truth. This experimental setup leads to a com-


bined assessment of the localization (where) and classifi- 4

cation (threat type vs. normal) of the two threat detection


2
methods. We consider the leave-one-sequence-out. This is
a sound setup to evaluate the ability of the methods to gen-
0
eralize, since each sequence involves a different scenario, 0 200 400 600 800 1000
FPs
different actors and clothing, and a different variation of a
threat or normal situation.
Figure 5. ROCs for threats with respect to the driver.
5.2. Threats with respect to the driver
Figure 5 displays the ROCs of both methods for threats with factor for the rule-based method is that there are different
respect to the driver. The rule-based method performs better distracting factors around the truck, such as flapping sails,
when few alarms are allowed. Out of 13 threats, 9 are de- which give rise to false positives. Using our current set of
tected with just 6 false alarms. On the contrary, at this same features, it is not trivial to establish a simple and robust
operating point, the SVM method has 160 false alarms. The rule to exclude these distractors, whereas the SVM auto-
reason that the rule-based method performs so well is that matically learns about this in an implicit manner.
the attacks to the driver can be described by clear rules and
features. The true positives mostly involve tracks where the stealing from vehicle: 52 targets out of 1070 total
driver is hit, pushed, or attacked. Therefore, the STIP ac- 50 SVM
tivities “hit”, “push” and “attack” are very distinctive fea- rule engine
45
tures to detect such threats. Because the rule-based system
40
emphasizes these features, it performs better than the SVM
method, which requires a lot of samples to learn a good 35

threat model from the many different features. There are 30


TPs

not many training samples available for the threats; there 25


are only 13 training samples. The SVM has a higher recall 20
but at the cost of many false alarms, which will not be a
15
realistic operating point in practice.
10

5.3. Threats with respect to the truck 5

0
Figure 6 displays the ROCs with respect to the truck. In- 0 200 400 600 800 1000
FPs
terestingly, here the SVM method performs better for lower
false alarm rates than the rule-based method. This finding
for truck threats is contrary to the outcome of the driver Figure 6. ROCs for threats with respect to the truck.
threats. Two properties of the truck threats are different
than the driver threats: more positive samples for training
(52 vs. 13), and a less clear type of threat. Where the driver 6. Conclusions
threat has a clear characteristic of attacking the driver, for
which there is an obvious feature, the truck threat is posed We have found that the SVM and the rule-based method
in various forms, i.e., inspecting the truck, fiddling with the each have their strengths and weaknesses for threat detec-
exterior, opening the door and breaking it open. This can tion. The SVM method requires sufficient training samples;
happen in many different ways, so it is harder to make a we found that 13 samples for the driver threats were not
good description with rules. On the other hand, with several sufficient. When an event such as a threat to the driver
samples for each of the threat variations, the SVM method can be characterised well by distinguishing intermediate-
is able to learn an adequate threat model. A complicating level features, a rule-based system is most suitable. With

379
the rule-based method, very good results were obtained for [9] G. Sanromà, L. Patinob, G. Burghouts, K. Schutte, and J. Fer-
detecting the threats to the driver. The SVM performed ryman. A unified approach to the recognition of complex ac-
badly for this type of threat. However, for handling uncer- tions from sequences of zone-crossings. Image and Vision
tainty in the lower-level features and for interpreting varia- Computing, 2014.
tions of threats, such as threats to the truck, the SVM is to
be favoured when sufficient training samples are available.
For threats to the truck, we found that 52 threat samples re-
sulted in a much better performance than could be achieved
with rules. For threats that appear in many variations, rules
are not adequate. We learned that intermediate-level fea-
tures such as activities and the distance to the driver and the
truck are important elements to take into account. Future
work will focus on more complex temporal and spatial re-
lations and practical criteria to switch dynamically between
the SVM and the rule-based approach. Future research, and
evaluation measure, should emphasize the need for early
threat detection as current reaction times are really short. To
increase the performance of the SVM method in an adaptive
manner, a user feedback loop could be implemented as an
extension, to obtain more threat samples, which has been
shown to be an important criterion to take into account.

Acknowledgements
We are grateful for University of Reading for providing the
tracks. This work was supported in part by the EU FP7
project ARENA (Architecture for the recognition of threats
to mobile assets using networks of multiple affordable sen-
sors), within the security research program.

References
[1] Arena dataset. http://www.cvg.rdg.ac.uk/PETS2014/a.html.
[2] Arena project, eu fp7 security. http://www.arena-fp7.eu.
[3] J. Aggarwal and M. Ryoo. Human activity analysis: A review.
ACM Computing Surveys (CSUR), 43(3):16, 2011.
[4] M. Andersson, J. Patino, G. Burghouts, A. Flizikowski,
M. Evans, D. Gustafsson, H. Petersson, K. Schutte, and J. Fer-
ryman. Activity recognition and localization on a truck park-
ing lot. In AVSS, pages 263–269. IEEE, 2013.
[5] G. Burghouts and K. Schutte. Spatio-temporal layout of hu-
man actions for improved bag-of-words action detection. Pat-
tern Recognition Letters, 2013.
[6] G. Burghouts, K. Schutte, H. Bouma, and R. den Hollander.
Selection of negative samples and two-stage combination of
multiple features for action detection in thousands of videos.
Machine Vision and Applications, 25(1):85–98, 2013.
[7] G. Burghouts, K. Schutte, R.-M. ten Hove, S. van den Broek,
J. Baan, O. Rajadell, J. van Huis, J. van Rest, P. Hanckmann,
H. Bouma, G. Sanroma, M. Evans, and F. J. Instantaneous
threat detection based on a semantic representation of activi-
ties, zones and trajectories. Signal, Image and Video Process-
ing (SIVP), 2014.
[8] I. Laptev. On space-time interest points. Int. J. Comput. Vi-
sion, 64(2-3):107–123, Sept. 2005.

380

Das könnte Ihnen auch gefallen