Harris

Tracking people across crowded scenes.
Christopher Harris
Articial Intelligence
2010/11
'
&
$
%
The candidate conrms that the work submitted is their own and the appropriate credit has been given
where reference has been made to the work of others.
I understand that failure to attribute material which is obtained from another source may be considered
as plagiarism.
(Signature of student)
Summary
This project investigates solutions to the tracking problem, specically how to make the tracker
more robust and withstand occlusion. Additional detection responses in the form of pedestrian head
detections are associated to the targets and used to weight the model in the right direction, serving to
robustly guide targets when their body is occluded.
ii
Contents
1 Introduction 1
1.1 Inspiration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Minimum Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background Research 3
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Description of problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Tracking Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Background Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Tracking by Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Tracking by Detection framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Data association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Occlusion and Missing Detections/Detection errors . . . . . . . . . . . . . . 10
2.5 State estimation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.1 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5.2 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 On-line appearance models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iii
2.7.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Preparation of Solution 13
3.1 Existing Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Detection-Tracker Association . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Occlusion handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2.1 Appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2.2 Condence density . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Particle Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3.1 Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3.2 Update Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Implementation Design Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Person Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.2 Person Head Detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.3 Contribute to initializing trackers . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3.1 Weight particles . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3.2 Associate head detections . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Evaluation tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.1 Ground-Truthing GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.2 Track evaluator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Delivery of Solution 21
4.1 Existing framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Proposed framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Problems Encountered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Changes to implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5 Revised Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6 Evaluation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6.1 Ground Truther GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.6.2 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7 Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7.1 Project Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.7.1.1 1 - Associate tracks together. . . . . . . . . . . . . . . . . . . . . 24
4.7.1.2 2 - Deal with false detections and missing detections . . . . . . . . 25
4.7.1.3 3 - Handle mild Occlusions . . . . . . . . . . . . . . . . . . . . . 25
iv
5 Evaluation 27
5.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Bibliography 29
A Personal Reection 33
B Appendix B 34
B.1 Existing implementation code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
C Code 35
v
Chapter 1
Introduction
1.1 Inspiration
Studying Articial Intelligence, I seek to mimic and exceed human intellect and reasoning. People
tracking is a problem we do well on a daily basis for any number of reasons, for instance keeping an
eye on a pedestrian while driving to avoid collision or identifying potentially abnormal and suspect
behaviour and even basic behaviour like not colliding with a fellow walker by. Software implementa-
tions of tracking people are useful for many similar reasons, with many applications towards automatic
security systems and threat detection. As quoted by Alfred N. Whitehead Civilization advances by
extending the number of important operations that we can do without thinking about them. [27].
Computer Vision in general is a core part of AI and the inner workings of a tracking solution surpass
the category of just computer vision as they include complex statistical estimation process and general
machine learning techniques applicable to many areas of AI and computing in general, making this
project a worthy investigation.
1.2 Aim
The aim of this project is to produce robust tracking software, capable of tracking multiple pedestrians
in crowded video footage.
1.3 Objectives
The objectives of the project are to:
Produce a system for tracking multiple pedestrians
Develop and evaluate different methods of dealing with occlusions
1
1.4 Minimum Requirements
The minimum requirements of this project are to:
Associate input detections into tracked trajectories
Deal with false detections and missing detections
Handle mild occlusion.
The possible extensions are:
Optimize any slow parts of code in faster C++
2
Chapter 2
Background Research
2.1 Introduction
2.1.1 Motivation
With the advancements in technology, cameras used to record people have become widespread, from
the multitude of CCTV cameras set up by the police, private surveillance systems used in super-
markets and shopping malls to sports games all producing vast quantities of video data. With all
this video data It therefore becomes convenient to have automatic methods of processing it without
human interaction, particularly since a human can only monitor so much at one time and also since
the information that can be derived from empirical data by software can infer things a human cannot.
Many cameras are set up with the primary intention of monitoring peoples behaviour. In order
for software to extract any information about the behaviour of people in the scene its useful to rst be
able to reliably track pedestrians and their trajectories.
2.1.2 Description of problem
While humans are generally adept in the task of tracking a pedestrian around a scene, It remains a
complex problem to solve for a computer. Firstly a wide variance in pedestrian appearance from
clothes, colour, size and articulation of body parts makes identication challenging. Secondly, pedes-
trian movement does not follow a specic trajectory, people may change their direction at any time
of their own accord or when inuenced by obstacles and other people. The hard to predict movement
of pedestrians presents a challenge when the pedestrian is occluded either by background or another
person, there needs to be methods for keeping track of which pedestrian is which and where they are
likely to be.
3
Constraining the problem to being done off-line reduces these problems slightly since the software
is able to make accurate leaps across missing data because it can see for instance when and where a
pedestrian reappears. Likewise the problem is aided by adding more data in the form of stereo cameras
or multiple views of a scene, making an occlusion from all angles of view unlikely. Implementations
that can be run On-line must have other ways of dealing with these issues. This project attempts to
solve the tracking problem in its most pure, unconstrained version, that is on-line tracking of multiple
pedestrians with a single camera in a crowded scene.
2.1.3 Resources
The problem of tracking pedestrians is an active area of computer vision research, with groups produc-
ing regular papers and bodies of work on the topic. The University of Reading hold yearly workshops
titled PETS: Performance Evaluation of Tracking and Surveillance [9], out of which come papers on
tracking solutions and annual datasets to challenge tracking implementations with.
The groups video analysis and content extraction (VACE) program and the classication of events,
activities, and relationships (CLEAR) offer evaluation statistics to measure precision, accuracy and
robustness.
There are also a number of publicly and privately funded projects such as SUBITO: Surveillance
of Unattended Baggage and the Identication and Tracking of the Owner, which focuses on threat
detection in the form of people who leave luggage unattended. All these active investigations into the
problem indicate that it is far from being completely solved and there is still room for improvement.
Many academic research groups involved in computer vision such as PAMI, USC, CVPR, regularly
publish papers about pedestrian tracking. Considering these reputable peer reviewed sources as a
good start. Their papers and conferences will serve as a good introduction into tracking and what
the current state of research area is. Other Resources namely Google Scholar and Citeceer were also
useful by expanding the diversity of sources available.
2.2 Tracking Methods
Many different approaches have been taken to tracking pedestrians, detailed below is a range of dif-
ferent methods used, elaborating in more detail about the most relevant, successful methods.
2.2.1 Background Segmentation
A form of background segmentation seems the simplest method of tackling the problem of tracking
people. Frame differencing is used to subtract successive images from each other revealing pixels that
have changed colour and are therefore likely to represent motion. In other methods some threshold
function is applied against a model of the background returning foreground objects. Both of these
methods contain inherent problems. Frame differencing fails when the foreground is a similar colour
to the background and can only be used under controlled lighting or unwanted differences will show
4
up. Background modelling in its simplest, non adaptive form requires manual initialization to capture
a static background image, making it susceptible to errors in the background accumulating over time,
for instance due to changes in the cameras orientation and illumination. This makes it unsuitable for
long term unsupervised tracking.
Adaptive models of the background do a better job at coping with these slight variations by adapt-
ing to background changes by forming a representation of the mean background image over time.
Stauffer and Grimson use an adaptive background model to successfully segment moving foreground
in a simple scene and return patterns of activity by the resulting trails [25]. Their application of a
Gaussian mixture models form a history of different background pixels over time to class a pixel as
being part of it or not.
Figure 2.1: Hard to distinguish targets in close proximity using background segmentation.
Background segmentation techniques constrain the solution to use only when a background image
is in view. When dealing with lots of moving objects in a busy scene the method begins to break
down as many targets overlap and the majority of the foreground is moving therefore there is no
background that can be modelled. Using Background modelling also makes identifying different
individual targets moving in close proximity difcult ( see Figure 2.1 reprinted from [17]). These
problems are recognised by Stauffer with the statement The tracking system has the most difculty
with scenes containing high occurrences of objects that visually overlap [25]. Background modelling
also presents the challenge of identifying stationary targets as they could be classed as background.
Although widespread use in literature for specic applications, these problems bias this project to
preferring a different tracking technique, more robust to multiple targets.
2.2.2 Optical Flow
Lucas Kanade tracking, is a type of optical ow introduced in [15] where by feature points are com-
puted and then compared from one frame to another. An implementation is used in [29] They use an
object detector to identify a bounding box, localising the patch from where in the image to take feature
points and track. After feature points have been found in the next frame, based on the resulting motion
eld, the bounding box displacement and scale changes are estimated. To make their implementation
5
more adaptive they generate new feature points to track every frame. However adaptive they describe
this method to be, trying to track a class with as wide a variance in appearance as pedestrians, limits
the effectiveness of this ne grained, pixel level tracking. Hence why they only use it for short term
tracking as part of a larger tracking framework.
2.2.3 Tracking by Detection
Tracking by detection arises from the recent advancements of object detection methods, detailed in
[8]. Detector output from an image provides the tracker with a good estimate of where a target may be,
enabling automatic initialization of new tracker targets, as well as reinforcement to current trackers
following targets in subsequent frames. The resulting detections provide a good base for a tracking
framework to be built on.
2.3 Object Detectors
The literature surrounding object detectors is extensively varied so described here is a selection of
approaches used in relevant recent academic literature for pedestrian detection. A wider survey can
be found here [8].
Since pedestrians have such a wide variance due to lighting, pose and clothing, it is important to use
robust methods of detection, insensitive to these inter-class variations but sensitive to other variations.
2.3.1 Operation
Many object detectors operate using a sliding window, iterating over the image extracting features
that are found inside windows of different scales and positions. This allows features of varying size
and position to be localized within the image. The features obtained from each window get fed into a
classier that has been trained to detect the desired object, the classier in turn outputs a score of the
similarity between the learned features and the features presented in the window, a probability that
the window contains the object. This method gives good results by being able to adapt to features
of many scales but is computationally expensive by the amount of windows that need to be searched
through.
The results from sliding window detectors generally contain many overlapping positive windows,
a NMS (non-maximum-suppression) algorithm is usually preformed to smooth out these overlapping
windows and nd the peaks in the hope that these will correspond to single objects.
6
2.3.2 Features
The features used largely determine the robustness of the tracker, with pedestrian appearance varying
so widely due to things such as lighting, their particular body articulation and clothing colour, it
is important to select features that depend as little as possible on these variations. For this reason
colour information is rarely used for detecting classes of objects, shape cues are generally much more
robust. HOG, or Histograms of Oriented Gradients detectors are a popular method of human detection.
They use information regarding the orientation of edges to describe the shape of an object, using
edge intensities make the features invariant to translations, rotations and differences in illumination.
Naturally making them suited for pedestrian detection. See Figure 2.2 reprinted from [6].
Figure 2.2: (f) The hog representation of a pedestrian.
Haar wavelets are another feature that has been used in pedestrian detection, they represent the
difference in average intensity between specic local regions. See Figure 2.3 reprinted from [21].
Figure 2.3: Haar representation of the pedestrian class.
Papageorgiou and Poggio detail the use of Haar wavelets for their ability to capture visually plau-
sible features of the shape and interior structure of objects that are invariant to certain transformations
[21].
In Histograms of Oriented Gradients, Dalal and Triggs show convincing results that HOG out pre-
form other feature descriptors for the case of human detection [6]. Other surveys agree and make it
7
clear that HOG detector performance can match more complex detectors and outperform many others
[7].
2.3.3 Classication
The aforementioned feature points are used vote for an object by means of classication, many types
of classier exist, most detectors favouring a linear SVM (Support Vector Machines) [6]. Maji et al
[16] present a slightly altered more efcient version of Intersection Kernel support vector machines
that they mention outpreforms linear SVMs at for an increase in runtime for evaluation.
2.3.4 Learning
.
In one of the most famous object detection papers, Viola and jones [26] extract simple features
from an image; the sum of pixels which lie within specic regions of a rectangle. Since the num-
ber of combinations of these simple features is so vast, viola and jones introduce a method to aid
learning, called Boosting. The boosting algorithm enables strong precise classiers to be made up
from a combination of weak imprecise classiers. Large publicly available datasets exist such, with
annotated pedestrians that can be used to train classiers.
2.3.5 Extensions
An interesting adaptation of a HOG detector described in fastHOG (Prisacariu and Reid) can be used
to achieve a speed up of over 67x [22] compared to the standard approach. HOG along with ISM de-
tectors operate on a sliding window principle, sequentially calculating feature scores for all possible
window position and scales in an image. This exhaustive search approach can benet dramatically
from a parallelised approach and architecture where available. Taking advantage of the parallel nature
of the GPU, made accessible by Nvidia CUDA [19], the algorithm described is able to distribute the
processing in windows across the GPUs many cores, each core assigned a subset of windows, effec-
tively checking multiple windows at once before returning all the results back to the CPU, speeding
up the overall processing time.
Other implementations train detectors in multiple parts of a whole object and use the combination
of these parts to detect objects, this allows for some robustness against occlusion, as detailed by Wu
and Nevatia [28].
Other person detectors out of the scope of this problem domain combine extra information using
stereo vision, object motion and other descriptors to further aid the task of object detection.
8
2.4 Tracking by Detection framework
Even the best object detectors however, often fail in certain circumstances. Partial or total occlusion,
noise or similar looking background objects all present difculties for object detectors and can result
in false positives and missing detections. Since the nal output from person detectors cannot be taken
as completely reliable they must be subject to additional reasoning to be part of a robust tracking
solution.
The resulting output of person detectors will contain good detections of people, along with false
alarms and gaps in data this requires solutions to both data association and state estimation problems.
2.4.1 Data association
The nal output from an object detector is just a bounding box or ellipse; the size and location of the
detection. To use these detections for tracking a person or persons it is necessary to associate which
detections correspond to the same people across multiple frames, forming a path followed by that
person.
In this problem of data association there is a distinction to be made in regards to how much detec-
tion data is available.
Some methods take on a global optimization approach, and use all of the precomputed detections
to nd the best tting path an object takes. In Pedestrian tracking by associating tracklets using de-
tection residuals [23] they take a two stage approach to the problem of multi object tracking. They
rst generate track segments from high condence detections, then go back over using the low con-
dence detections to link up the segments together. Gavrila and Munder [11] use the long established
Hungarian algorithm [14] over a cost matrix of predicted tracks and measurements. This Hungarian
algorithm is optimal but computationally expensive.
Since in these scenarios all the possible data is available they often produce good results but in-
herently constrain the tracking problem to use off-line.
Other methods are characterized as being of a more markovian approach, that is, using only data
available at the present moment and not searching through the past or future, allowing the problem to
be tackled on-line and enable real time application.
In [3], detections are associated to the target being tracked by greedily associating scores from
a probability density function. The probability that a detection should be associated to a target is
worked out by evaluating their differences in distance, size and appearance. Each possible detection
to tracker combination is scored and then all scores above a threshold are iterated through using a
greedy algorithm to decide the assignment. Greedy algorithms are a simple effective solution to
assignment problems and work well for all but complex problems.
9
2.4.2 Occlusion and Missing Detections/Detection errors
Even the best object detectors are subject to errors, false detections may occur where the background
contains an object with features resembling what it is was trained for such a stop sign with steep
gradient changes the way pedestrians have. Detections may also be missed due to unfamiliar poses
or articulation of the body. These show that the nal output of object detectors alone is not accurate
enough to solve the tracking problem. To compensate for these errors models of the trackers move-
ment and on-line appearance models are used to guide the tracker when no detections are available.
In [5] the robust occlusion handling comes partly from the addition of taking into account lower
condence scores from the detector. Both HOG [6] and ISM [?] output single detections after nding
local maximum and reducing the set of detections to high condence detections, This prevents many
false positives in things such as background clutter but can also throw away some potentially useful
data, as in the case where only part of a person is visible. To make sure the detector condence
is reliable and should be used, explicit inter-object occlusion reasoning is performed, where by the
detector condence density around an image location is considered likely to be foreground if there is
a tracker with a associated detection near by. This rational is used to weight the condence density
inuence on the observation model.
2.5 State estimation models
When dynamic data is noisy and has discrepancies, such as of missing detections and imprecise lo-
cations from a person detector, it is important to be able to model the target sufciently enough to
estimate its movement and propagate it accordingly to allow the tracker to be robust to noise and
missing data as possible.
It is necessary to have two models to make inferences about a dynamic system [1]; a model of the
evolution of the state over time and a model to relate the noisy measurements to the current state.
To estimate the state of a dynamic system, the posterior probability density function of the state;
the likelihood of that state at the new time step, must be calculated based on all received data so far.
When certain conditions about the distribution of the targets state, there are optimal solution to
the problem such as the Kalman lter, otherwise the solution must be approximated for instance by a
particle ltering approach.
2.5.1 Kalman Filter
If the state distribution were Gaussian, the kalman lter would be the most optimal way of representing
the state, since the Kalman lter models the posterior density by assuming is Gaussian at every time
step.
Kalman provide highly efcient state estimates but are restricted to Gaussian distributions.
10
2.5.2 Particle Filter
Particle lters, introduced with the Condensation algorithm [12] are the backbone of many recent
pedestrian trackers. They represent the uncertainty in a state by a set of weighted particles. This
allows for representing of multiple hypothesis at the same time until the correct one is chosen. Making
it suitable for multi modal state distributions. A particle lter contains two models, one model for
prediction and one model to evaluate the likelihood of a state.
(Particle lter are explained in more specic detail as part of the chosen implementation)
2.6 On-line appearance models
Aside for the pedestrian detector, trained off-line on the general model of a person that generally
consist of shape cues, local classiers are built to represent the specic appearance of an instance of
a pedestrian that is being tracked, often using colour and texture information. This helps pedestrians
to be disambiguated from each other and to recover correctly from inter occlusions without mistaking
tracks.
In, Online learning of robust object detectors during unstable tracking [29], they make use of LBP
(local-binary-patterns) [20] as features to train their on-line classier.
Kalal et al present a novel on-line classier, made robust by learning from mistakes [13] They
present a method for self assessing the trackers reliability based on the trajectory identifying its mis-
takes and learning from them.
2.7 Evaluation
A number of groups concerned with the problem of people tracking have devised evaluation criteria
to standardize the metrics which allow for easy comparing and exchange of results.For example the
ClearMOT [2] evaluation criteria dene error metrics. see gure 2.4 Such as (a) when a tracker
exceeds a certian threshold away from a pedestrian and (b) when a tracker confuses pedestrians and
switches target.
11
Figure 2.4: A sample of the ClearMOT evaluation criteria for pedestrian tracking.
These metrics can be useful but often rely on the descriptiveness of the ground truth data, simple
location annotations will not be able to tell any information about target switches.
2.7.1 Data Sets
A number of publicly available datasets are published for tracking evaluation, PETS publish yearly
datasets [10] which are commonly used and there fore make them good for comparing benchmarks.
12
Chapter 3
Preparation of Solution
This section outlines the design decisions taken, after thorough research and background reading it
was decided that the following implementation would best tackle the problem and solve the projects
minimum requirements.
3.1 Existing Design
Since there exists a large amount of work on tracking already, often combining a complex chain of
different methods. To save reinventing the wheel and to aid the specic focus of this project, an
existing implementation is chosen as a base on which to try and improve. Writing a complex program
such as this from scratch would be difcult given the time constraints of this project and hence why
using an existing implementation is also benecial to the reliability of the program since it has already
been subject to testing and evaluation. A matlab re-implementation of Brienstines successful paper
[4] was done by Jan Sochman of the University of Leeds [24], utilising many of the state of the art
tracking techniques detailed in the background research.
The paper and thus the code uses a framework of tracking by detection employing a particle lter
to model the state uncertainty of the targets. The observation model guiding the particles based on
output detections, an on-line trained appearance model and additional continuous overlapping outputs
from the detector before NMS.
3.1.1 Detection-Tracker Association
To solve the data association problem of matching up detections to trackers, a matching score S for
each pair of tracker and detection is calculated. The score is built from values for, the distance in
13
position between detection d and tracker tr evaluated under the normal distribution PN, what they
dene as a gating function accounting for velocity. and a score ctr from a classier trained on tr
evauluated on d
Once the matching scores have been calculated a greedy algorithm iterates over the score matrix
choosing the best matching pairs.
1 assoc = [];
2 while 1
3 [score idx] = max(S(:));
4 if score < params.assoc thr
5 break;
6 end
7 [r c] = ind2sub(size(S), idx);
8 assoc(end+1, :) = [r, c];
9 S(:, c) = 0;
10 S(r, :) = 0;
11 end
Figure 3.1: Greedy algorithm extracting the best matches from the score matrix.
The choice of a greedy algorithm allows for similar results as the optimal Hungarian algorithm but
since it is simpler and therefore has a lower computational cost, helping the efciency of the program.
3.1.2 Occlusion handling
Targets becoming occluded is arguably the biggest problem faced when tracking, Breitensteins im-
plementation uses two notable techniques to help guide the tracker when normal detection fails.
3.1.2.1 Appearance
The paper notes a sound evaluation preformed to identify the most discriminative features for on-line
classier use. They test colour histogram features in RGB (red-green-blue), HS (hue-saturation), RGI
(red-green-intensity) and lab space. They also test the texture features LBP (local-binary patterns)and
Haar wavelets. Concluding that the combination of RGI and LBP outperform any other combination
of features tested.
Thus, the implementation trains a boosted classier for each target at run time. Using a combina-
tion of colour histogram features from RGI and texture features from LBP, features are sampled from
detector output for an associated tracker where no other trackers overlap. These features form a pos-
itive training set. Using AdaBoost these positive training examples are used together with a negative
training set from nearby targets to adapt the classier and select the most discriminative feature set.
14
3.1.2.2 Condence density
The papers main contribution is the idea of utilising condence density; the detector output before
NMS is applied as not to hide any potentially useful data. If there is no associated pedestrian detection,
the low condence detections are looked through and evaluated to see if any match. This is only done
subject to inter object occlusion reasoning, only trusting the low score detections if there is another
tracker with associated detection nearby (suggesting that the person in question is being occluded and
therefore hidden from the pedestrian detector)
3.1.3 Particle Filtering
Like most recent on-line trackers the backbone of this tracking framework is a single particle lter
initialised for each target to estimate its state. The state here is dened as the the targets location and
velocity. Representing uncertainty in the state by a weighted set of particles, allows for representing
multi modal distributions, which bodes well for pedestrian tracking where multiple hypothesis need
to be considered until the best one can be determined. As detailed Background Research two models
are used in PF operation.
3.1.3.1 Prediction Model
Referred to in the paper as Motion Model is used to propagate the particles every time step (every
image), according to a constant velocity motion model.
(x, y)
t
= (x, y)
t1
+(u, v)
t1
t +
(x,y)
(3.1)
(u, v)
t
= (u, v)
t1
+
(x,y)
(3.2)
Figure 3.2: Motion Model
1 function X = motion predict(oldX, track length)
2 eta pos = 1;
3 eta vel = 3;
4 X.pos = oldX.pos + oldX.vel + normrnd(0, eta pos, [1, 2]);
5 X.vel = oldX.vel + normrnd(0, eta vel / track length, [1 2]);
6 return
Figure 3.3: Matlab implementation of 3.2
15
The motion model adds process noise (
(x,y)
and etapos), adding variance to the distribution that
decreases as the target is tracked successfully. This enables the distribution to be more varied at rst
when less is known about the target and more rened as the model gets better.
3.1.3.2 Update Model
or Observation Model as termed in the paper is used to weight each particle belonging to a tracker,
scoring its likelihood of representing the target. Their algorithm combines information from 3 differ-
ent sources to form a particles weight.
1. Detection Term, which calculates the distance between the particle p and the associated detec-
tion d
1 dist = norm(Xcentre drcentre);
2 w = w + params.beta
*
normpdf(dist, 0, params.sigma det);
Figure 3.4: Detection term adding to a particles weight. (snippet from appearancemodel.m, Appendix B)
When an associated detection is found, this term has the most inuence on a particles weight as it
is most reliable, robustly guiding the tracker.
2. Detector condence term. This term calculates the detector condence density at the particle
position.
1 idxs = find(sizes == fit size);
2 row = conf maps.dr(idxs, 1);
3 col = conf maps.dr(idxs, 2);
4
5 [tmp nearest idx] = min(sqrt(sum((repmat(X.pos, length(row), 1) [row ...
col]).2, 2)));
6
7 conf = conf maps.ds(idxs(nearest idx));
Figure 3.5: Detector condence term adding to a particles weight. (snippet from appearancemodel.m, Ap-
pendix B)
This code nds the closest condence density detection and scores it.
3. On-line classier term. This refers to TODO
16
1 w = w + params.eta
*
classif eval(classif, img, ih, X.pos, track size, ...
params);
Figure 3.6: On-line classier term adding to a particles weight. (snippet from appearancemodel.m, Appendix
B)
A classier built up from positive and negative training features is built up and evaluated for each
particle to give a matching score to add to its weight.
3.2 Implementation Design Choices
Building on the implementation described, this project aims to offer further improvement to the track-
ing pipeline and increase robustness of the tracker over occlusions. After comparing Breitensteins
method with the background research and after studying the various datasets, the following changes
have been decided.
3.2.1 Person Detector
Originally detections came from a person detector described in the paper by Malik et al. [16]. The
detector uses multi-scale histogram of oriented edge energy features, which they claim to be similar
to HOG but with a simpler design and lower dimensionality. They also claim their method outper-
forms the original HOG ISM detector of Dalal and Triggs [6]. After evaluating accuracy of different
detectors their claims do not stand up to scrutiny and a different detector was chosen. See Evalu-
ation.Background research show HOG detectors to be the leading state of the art person detectors.
On this note it was decided to substitute the existing detector for the GPU implementation of a HOG
detector, fastHOG by Prisacariu and Reid [22]. Utilising the parallel nature of the GPU they cite
speed-ups of over 67% compared to regular HOG methods. Running both detectors on the same two
datasets present the following processing times 3.7.
Total processing time Average time per image
Dataset (portion) Size fastHOG Malik fastHOG Malik
SUBITO (ESLANG\TLC41) 800 17m28s 179m29s 1.3s 13.5s
PETS06 (S1-T1-C\4) 3020 67m56s 902m57s 1.4s 17.9s
Figure 3.7: Processing time results for fastHOG and Malik detectors
The change to using a CUDA based detector puts a small constraint onto what type of hardware
17
the program my be run on. Nothing very specialised is needed. only a desktop pc with a recent Nvidia
graphics card, a fairly ubiquitous platform. Making the impact negligible to the systems portability.
The detector uses the same classication data that Dalal used in [6] to classify pedestrians. From
the INRIA people dataset, a reliable and varied selection of pedestrians, making it perfectly suitable
to this task. The change to this detector is also in accordance to the project extension of speeding
slow areas up by implementing them in a faster compiled language such as c++, the chosen detector
replaces the slower bottleneck that was the Matlab based previous detector in favour of a parallised
c++ implementation.
3.2.2 Person Head Detections
In the interest of making the tracker more resilient to partial occlusions, and drawing inspiration from
other part based frameworks such as Wu and Nevatia [28], it was decided to attempt to augment the
existing person detection results with extra part detection results. With crowds and dense scenes of
people, their full body can be partially occluded for extended periods of time limiting the effectiveness
of full person detectors. In many crowded scenes the only discernible part of the pedestrian in view
is their head. See Figure 3.8. It is to this end that it was decided to add results from pedestrian head
detector.
Figure 3.8: SUBITO dataset: Three pedestrians with inter-occlusion, heads still visible.
Although head detections will provide extra detection information, the human head is of a more
general shape and therefore detections are more prone to false detections. It is for this reason it is
choose not merge detection results as in [28]. But to treat head detections as a source of information
secondary to person detections. For the same reason initialisation of trackers will no be based solely
on head scores. Overall this extra detection information is to be incorporate in two ways.
18
3.2.3 Contribute to initializing trackers
Since the head detection results alone are deemed not reliable enough to be used to initialize trackers
themselves, it does not stop this extra information being put to good use. In the current implemen-
tation trackers are initialized by detections scoring over a threshold. In the proposed change, person
detection and head detection results are to be associated in a similar way that trackers and detections
are. If a person detection has a correctly associated head detection we add combine their scores and
compare this with the threshold.
3.2.3.1 Weight particles
In the current implementation, person detection results are the main inuence on affecting the particle
weights and guiding the trackers state. This paper aims to add associated head detections to robustly
guide the tracker in the event there are no person detections.
3.2.3.2 Associate head detections
Head detections occur in a different position and scale to trackers and person detections, so to associate
head detections, it will be required to form a center point representing its projection of the middle of
a person. This point can then be used to calculate the distance from a tracker or person detection.
3.3 Evaluation tools
The impact changes have had to the implementation can observed visually by watching the output of
the tracker and making sure trackers and people match up and this is a useful for identifying the source
of tracking failure. However to form a sound conclusion of the accuracy of tracking results, results
must be backed up empirically, to this end several tools will be built to aid quantitative evaluation.
3.3.1 Ground-Truthing GUI
Few datasets geared towards pedestrian tracking come annotated with ground truth data, even some
that do have data in obscure formats. This presents a problem for the evaluation of trackers since
there are no true positions of people to quantitatively evaluate against. Without these ground truths
there is no method of automatically comparing effectiveness of the tracker. Viper is a tool supposedly
capable of annotating images for this person but it a has complicated xml format and would not work
when tested. It is for this reason that as part of this project an interface will be built to mark and save
annotations dening where people are.
19
3.3.2 Track evaluator
Given ground truth coordinates of pedestrian location and tracker output of coordinates, an automatic
evaluation script will be made to do a frame by frame analysis checking how many instances there
are of correctly tracked people and how many instances there are of erroneous trackers and people
gone untracked. Having this quantitative evaluation metrics will help solve the project objective of
evaluating different ways of dealing with occlusions
3.4 Software Design
The software design methodology to be followed for this project will for the most part be a form of
Rapid Application Development, through prototyping [18]. This method allows for a exible design
process and is complimented by a high level languages like Matlab which is itself suited to prototyp-
ing. Prototyping ideas subject to initial requirements and then observing the results and then rening
them allows to get something running quickly and to gain knowledge from that to further the devel-
opment. With starting from a complex framework it important to start small to see how it works and
then improve. Attempting to plan all the required changes and then execute them in one go would
likely fail since starting with an already complex implementation makes large integration a challenge.
3.5 Schedule
The duration of the project is 15 weeks. I followed the following schedule.
4 weeks Background reading, research into the topic.
1 weeks Gain familiarity with existing implementation.
1 weeks Change the detector.
2 weeks Make evaluation tools.
3 weeks Make suggested implementation changes.
1 weeks Evaluate changes.
3 weeks Write up project.
20
Chapter 4
Delivery of Solution
4.1 Existing framework
Figure 4.1: Existing implementation pipeline.
21
4.2 Proposed framework
Figure 4.2: Proposed implementation pipeline.
4.3 Problems Encountered
Software engineering problems occurred early on that set back the progress of the project early on.
The fastHOG detector required a number of dependencies that were not installed on the School of
Computing machines, it was also designed to be run on a 32bit platform where as these machines
were all running 64bit software. The compilation and linking of dependencies combined with cross
compiling for 64bit took substantial time to get to grips with. Another source of frustration proved to
be the varying coordinate systems, matlab putting the origin in the top left, as opposed to the more
conventional bottom left.
4.4 Changes to implementation
The only change made from the initial proposal was the use of the head detections as secondary
to the pedestrian detections. Initially it was decided the particles would only have their weights
inuenced by the head detections if pedestrian detections were not available. After using careful
scores to associating head detections to trackers, and discarding any erroneous ones. The resulting
head detections seemed just as robust and accurate as the pedestrian detections, so as not to waste
data, these were used to inuence the weight every time they are associated.
22
4.5 Revised Schedule
As previously mentioned problems with the detector delayed the implementation of the project and
thus the schedule was needed to be revised.
4 weeks Background reading, research into the topic.
1 weeks Gain familiarity with existing implementation.
3 weeks Change the detector.
2 weeks Make evaluation tools.
2 weeks Make suggested implementation changes.
1 weeks Evaluate changes.
2 weeks Write up project.
4.6 Evaluation Tools
4.6.1 Ground Truther GUI
Below is a screenshot of the matlab ground truther.
Figure 4.3: Screenshot of matlab grountruther GUI with previous frames annotations.
Since there are often large numbers of images to ground truth, everything that could be automated,
was, requiring the least input from the user possible. Shown by the code below, to save manually ini-
tialising new bounding boxes each frame, assuming if a pedestrian is in one they wont have vanished,
the program reloads the bounding boxes so all the user has to do is move them into the changed po-
sition. hotkeys for creating new bounding boxes and for deleting old ones were also made. When the
gure is closed the detections are automatically saved and the next frame opened.
23
1 %load previos frames rectangles.
2 for ti=1:4:length(pos)
3 h = imrect(gca,[pos(ti), pos(ti+1), pos(ti+2), pos(ti+3)]);
4 hlist{length(hlist)+1} = h;
5 end
6 % Wait for user to close window, then save positions
7 waitfor(fighandle)
8 for ti=1:4:length(pos);
9 fprintf(fid,'%s, %4.0f, %4.0f, %4.0f, %4.0f\n',img list{t}, ...
pos(ti), pos(ti+1), pos(ti+2), pos(ti+3));
10 end
11 fclose(fid);
12 % next image
13 end
14 if evnt.Character == 'p'
15 h = imrect;
17 if evnt.Character == 'k'
18 set(get(gco,'Parent'),'Visible','off');
Figure 4.4: Snippits of groundtruther code.
4.6.2 Automatic Evaluation
A Greedy algorithm was made that takes into account the size difference and percentage overlap of
the bounding boxes to compare returns a list of those that match and how many dont. Helper scripts
parse either lists of detections or lists of trackers into this greedy algorithm to compare against ground
truths and aggregate the results.
4.7 Evidence
4.7.1 Project Requirements
4.7.1.1 1 - Associate tracks together.
The software write all the tracker positions in every frame step to a .mat le, assigning each tracker
a unique ID, scripts were created to play back this data and see the associated targets play out, or to
watch only one specic target. The output tracks can serve as the basis for any behavior classication
or extraction algorithms.
24
4.7.1.2 2 - Deal with false detections and missing detections
False detections are well dealt with and will rarely ever effect trackers since the association scores
guard against it and they never form trackers. The change to use of a more accurate person detector
further helps this case.
Missing detections are well dealt with by the addition of the Head detection data, and further
improved by a more accurate detector. See gure 4.5
Figure 4.5: SUBITO dataset: Pedestrian in non standard articulation. Missed by pedestrian detector, found by
head detector.
4.7.1.3 3 - Handle mild Occlusions
Shown in Figure 4.6 below, is an example from tracking output of where head detections provide
robustness to trackers. Red boxes represent trackers, yellow boxes associated pedestrian detections
and blue boxes associated head detections. Focusing on the tracker with ID.2, it is shown that it is too
occluded by the other two people to be picked up by the pedestrian detector. Yet the head detection
succeeds in identifying it. Stabilising the trackers position, keeping an accurate representation of the
targets state in this heavy occlusion (more than 50% occluded) exceeding the requirement.
25
Figure 4.6: SUBITO dataset: Pedestrian inter-occlusion, head det correctly associated (blue).
26
Chapter 5
Evaluation
5.1 Tools
Datasets where manually ground truthed by annotating images with bounding boxes indicating where
a pedestrian is. As this is a repetitive task for a human to preform over many hundreds of images,
every 4th frame was used. Utilising the evaluation tools constructed as part of the implementation,
a dataset will be manually annotated with pedestrian positions. Using these ground truths combined
with the evaluator built will calculate metrics for: correctly tracked targets, falsely tracked frames, and
pedestrians without trackers. These are the best metrics possible using simple annotating. To evaluate
switched identities requires pedestrian ID information in the ground truth.
Below is the output of the two detectors is analysed by invoking the matlab script
comparedets2('data/eslangtest2/gt','data/eslangtest2/images') While fasthog and
Malik detections reside in data/eslangtest2/cdets and data/eslangtest2/jdets respectively.
Detector
fastHOG Malik
Detection Overlaps GT 721 633
False Detections 97 282
Missing Detections 65 153
Figure 5.1: Evaluation metrics produced by matlab evaluation script comparing SUBITO ground truths to
fastHOG and Malik detector outputs.
27
Ideally multiple datasets would have been ground truthed to test the performance of the object
detectors on a wide variety of scenarios but due to time restrictions and length of time it takes to
manually ground truth a dataset, only a sample the SUBITO dataset was used.
Implimentation Evaluation Criteria
Detection Overlaps GT False Detections Missing Detections
Original w/o conf 693 500 93
fastHOG w/o conf 720 59 60
Original + conf 721 454 65
fastHOG + conf 721 53 65
fastHOG + head 724 38 61
fastHOG +head+conf 726 59 60
Figure 5.2: Quantitative evaluation on the ground truthed SUBITO dataset for all implimentations.
While this table of results is only representive for a small sample of a single dataset the initial
results look promising, the changed detector results in far fewer false detections and therefore fewer
false initialisations of trackers. The combination of the fastHOG detector and the addition of head
detections, make for very strong detections and few false detections. The results show the addition
of the condance density to the FastHOG + head dects, let it detect a couple more detections but far
more false detections.
5.2 Failures
The most identiable point of failure comes from the trackers movement after no detections have been
associated for a while as in the case of heavy occlusion, when this happens the particles try and guide
the tracker in an estimated direction the pedestrian was moving, eventually causing the tracker to drift
away from the true location of the target. Under heavy occlusions such as the case in FIGURE, neither
detections nor the appearance model can tell if the target is there and as a result the tracker drifts.
Another source of failure results from the detector itself, a caveat of tracking by detection, the
tracking framework cannot be expected to track a target if it is not detected in the rst place, as in the
case where body articulation is abnormal or heavy occlusion obscures them for view for a large part
of their path.
28
5.3 Overall Performance
Based on the speed-up gained and the strong results shown in 5.2 I conclude this a successful rst
look into the problem.
Mild to heavy occlusions are handled well by both the appearance model and the additional head
detections. Tracking initialisation very rarely produces trackers for non targets. The biggest source of
problem noticed is the tracker drifting off the model far enough to not be associated and then a new
tracker being initialized, leaving the old one to persist as an error until it dies out.
With further time the project would investigate different sampling methods and alterations to the
particle lter so a large variance can still be modelled but the models dont drift as easily.
29
Bibliography
[1] M.S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle lters for
online nonlinear/non-Gaussian Bayesian tracking. Signal Processing, IEEE Transactions on,
50(2):174188, 2002.
[2] K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: the CLEAR
MOT metrics. Journal on Image and Video Processing, 2008:1, 2008.
[3] M.D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multi-
person tracking-by-detection from a single, uncalibrated camera. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2010.
[4] M.D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Online multi-
person tracking-by-detection from a single, uncalibrated camera. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2010.
[5] M.D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool. Robust tracking-
by-detection using a detector condence particle lter. In Computer Vision, 2009 IEEE 12th
International Conference on, pages 15151522. IEEE, 2010.
[6] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In
Cordelia Schmid, Stefano Soatto, and Carlo Tomasi, editors, International Conference on Com-
puter Vision & Pattern Recognition, volume 2, pages 886893, INRIA Rh one-Alpes, ZIRST-
655, av. de lEurope, Montbonnot-38334, June 2005.
[7] P. Doll ar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. 2009.
[8] M. Enzweiler and D.M. Gavrila. Monocular pedestrian detection: Survey and experiments.
IEEE transactions on pattern analysis and machine intelligence, pages 21792195, 2009.
[9] J. Ferryman and A. Shahrokni. An overview of the PETS 2009 challenge. 2009.
[10] J. Ferryman and A. Shahrokni. Pets2009: Dataset and challenge. In Performance Evaluation of
Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on, pages
16. IEEE, 2009.
30
[11] D.M. Gavrila and S. Munder. Multi-cue pedestrian detection and tracking froma moving vehicle.
International Journal of Computer Vision, 73(1):4159, 2007.
[12] M. Isard and A. Blake. Condensation conditional density propagation for visual tracking. Inter-
national journal of computer vision, 29(1):528, 1998.
[13] Z. Kalal, K. Mikolajczyk, and J. Matas. Forward-Backward Error: Automatic Detection of
Tracking Failures. In 2010 International Conference on Pattern Recognition, pages 27562759.
IEEE, 2010.
[14] H.W. Kuhn. The Hungarian method for the assignment problem. Naval research logistics quar-
terly, 2(1-2):8397, 1955.
[15] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique with an application
to stereo vision. In Proceedings of the 7th international joint conference on Articial intelligence
- Volume 2, pages 674679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc.
[16] Subhransu Maji, Alexander C. Berg, and Jitendra Malik. Classication using intersection kernel
support vector machines is efcient. IEEE Computer Vision and Pattern Recognition 2008.
[17] A. Mittal and N. Paragios. Motion-based background subtraction using adaptive kernel density
estimation. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the
2004 IEEE Computer Society Conference on, volume 2, pages II302. IEEE, 2004.
[18] J.D. Naumann and A.M. Jenkins. Prototyping: the new paradigm for systems development. MIS
Quarterly, pages 2944, 1982.
[19] C. Nvidia. Compute Unied Device Architecture Programming Guide. NVIDIA: Santa Clara,
CA, 83:129, 2007.
[20] T. Ojala. Multiresolution gray-scale and rotation invariant texture classication with local binary
patterns. IEEE Transactions on pattern analysis and machine intelligence, pages 971987, 2002.
[21] C. Papageorgiou and T. Poggio. A trainable system for object detection. International Journal
of Computer Vision, 38(1):1533, 2000.
[22] Victor Prisacariu and Ian Reid. fasthog - a real-time gpu implementation of hog. Technical
report.
[23] V.K. Singh, B. Wu, and R. Nevatia. Pedestrian tracking by associating tracklets using detection
residuals. 2008.
[24] Jan Sochman. matlab re-implimentation of Online multi-person tracking-by-detection from a
single, uncalibrated camera.
31
[25] C. Stauffer and W.E.L. Grimson. Learning patterns of activity using real-time tracking. Pattern
Analysis and Machine Intelligence, IEEE Transactions on, 22(8):747757, 2000.
[26] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. 2001.
[27] A.N. Whitehead. An introduction to mathematics. Forgotten Books, 1924.
[28] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by
bayesian combination of edgelet part detectors. In Computer Vision, 2005. ICCV 2005. Tenth
IEEE International Conference on, volume 1, pages 9097. IEEE.
[29] Krystian Mikolajcz Zdenek Kalal, Jiri Matas. Online learning of robust object detectors during
unstable tracking. 3rd On-line Learning for Computer Vision Workshop 2009, 2009.
32
Appendix A
Personal Reection
Overall I believe the production of a solution to my problem has gone reasonably well, but there
are many things that could be improved upon if a similar project where to be undertaken again. As
advice for future students, I stress obtaining familiarity with your environment before you start any
real programming; In C++ dependencies linking issues prevented the detector working for too long.
Also experiment with your debugging facilities, it wasnt until half way through my project that I
learned matlab had the facility for conditional breakpoints and that once you can make a script pause
at a certain point in its execution, it opens up the ability to explore through all the variables at that
current point, giving great insight and understanding to the program ow. If I had known this at
the beginning I would have aided debugging and I could have had time to try and incorporate more
advanced techniques.
A mistake I made was also rening the implementation and not keeping a log of results along the
way. Every time a new iteration of experimentation is run, I would advise writing some brief notes
and screen captures if relevant. These will drastically help the write-up and evaluation later on. As
well as allowing you to come back after breaks and pick up where you left off.
The Use of Latex comes highly recommended, for the sake of a getting used to a few commands
and mark-up you gain the freedom of hardly having to worry about formatting at all, leaving you more
time to worry about content than presentation.
33
Appendix B
Appendix B
B.1 Existing implementation code
Provided by [24]
34
Appendix C
Code
1 function [pos] = groundtruth(imdir);
2 % interpolation spacing
3 inspace = 4;
4 ftype = '
*
.jpg';
5
6 d = dir(fullfile(imdir, ftype));
7 img list = {d.name};
8 global hlist;
9 global hcount;
10 global pos;
11 pos = [];
12
13 for t=1:inspace:length(img list)
14 hlist =[];
15 % get current frame data
16 img = imread(fullfile(imdir, img list{t}));
17 fid = fopen(strcat(fullfile(imdir, img list{t}),'.gt'),'w');
18 %key releace callback
19 fighandle = figure('KeyReleaseFcn',@cb), imshow(img);
20 set(fighandle,'CloseRequestFcn',@my closefcn)
25 end
30 end
35
31 fclose(fid);
32 % next image
33 end
34
35 function cb(src,evnt)
36 % P for Paint, draw a rectangle bounding box.
38 h = imrect;
40 end
42 % K for Kill, Delete box.
44 end
45 end
46 function my closefcn(src,evnt)
47 pos = [];
48 for i=1:length(hlist);
49 if isa(hlist{i}, 'handle') && ...
strcmp(get(hlist{i},'Visible'),'on')
50 posit = getPosition(hlist{i});
51 pos = [pos posit];
52 end
53 end
54
55 delete(fighandle)
56 end
57 end
36
1 function frame = vistracks(tracksf,imgdir,showdets,watch)
2
3 if strcmp('',imgdir)
4 imgdir = ...
'/media/6e9668e62f3a4b17a4317bbe83c1997f/FYP/data/eslangtest2/';
5 end
6
7 figure(109);
8 if exist('watch', 'var')
9 watch = 1;
10 end
11 if exist('showdets', 'var')
12 showdets = 0;
13 end
14
15 tracks = load(tracksf);
16 tracks = tracks.tracks;
17
18 d = dir(fullfile(imgdir, '
*
.jpeg'));
19 imglist = {d.name};
20 llist = length(tracks);
21
22 trackindex = 1;
23 for i=1:length(imglist)
24 figure(109);
25
26 hold off;
27 imshow(fullfile(imgdir, imglist{i}));
28 title(num2str(tracks(trackindex,6)));
29 hold on;
30 while trackindex<llist+1 && tracks(trackindex,6) == i
31 if watch<0 | | watch==tracks(trackindex,1)
32 rectangle('Position', [tracks(trackindex,2) ...
tracks(trackindex,4) ...
tracks(trackindex,3)tracks(trackindex,2) ...
tracks(trackindex,5)tracks(trackindex,4)], 'EdgeColor', 'r');
33 end
34 trackindex = trackindex +1;
35 if showdets ==1
36 showdet(imgdir,imglist{i})
37 end
38 end
39
40 drawnow;
41 end
42
43
44 % rect = get(ha,'Position');
45 % frame = getframe(ha,rect);
46 %close(ha);
47 end
48
49 function showdet(imgdir,imagefn)
50 cdfile = fullfile(imgdir,'cdets', strcat(imagefn,'.dets'));
51 fn = fopen(cdfile);
52 cdts = textscan(fn,'%n %n %n %n %n %n %n %n', 'delimiter', ...
'\n','CollectOutput',1);
53 cdets = cdts{1}(:,3:6);
54 fclose(fn);
55
56 for i=1:length(cdets(:,1))
57 de = cdets(i,:);
58 rectangle('Position', [de(3), de(4), de(1), de(2)], 'EdgeColor', 'y');
59 end
60
61 end
Figure C.1: Tracks playback script.
37
1 function [overlap falsedet missingdet] = comparebb(gt,dets)
2
3
4 overlaps = [];
5 falsep = [];
6 gtoverlap = zeros(length(gt(:,1)));
7
8
9 fileoverlaps= [];
10
11 if length(dets)>0
12 areaintsec = rectint(gt,dets);
13 %dets is col, gt is row
14 %areaintsec(i,j) is the intersection area of the rectangles specified by the
15 %ith row of A and the jth row of B.
16
17
18 %find area of gt and dt boxes
19 areagt = abs(gt(:,3)) .
*
abs(gt (:,4));
20 areadt = abs(dets(:,3)) .
*
abs(dets (:,4));
21
22
23 %turn the area intersection into a percentage intersect
24 %return what percent of the GT box is intersected by the detection
25 areapct = [];
26 for p=1:length(areaintsec(:,1))
27 areapct = [areapct (areaintsec(p,:) ./ areagt(p))];
28 end
29 areapct = reshape(areapct,length(areaintsec(1,:)),length(areaintsec(:,1)));
30
31
32 while max(max(areapct) > 0.4) == 1
33 [r,c] = find(areapct==max(max(areapct(:))));
34 %check the detection isnt engulfing the grount trouth simply becuase
35 %its big, cant be less then gt size / det size = 0.4.. (double the size)
36 correctforoverscale = areagt(c(1)) / areadt(r(1));
37 if correctforoverscale > 0.35
38 % if its vaild, add one score to the positive detections
39 fileoverlaps = [fileoverlaps 1];
40 % and prevent that detection from being used again
41 areapct(:,c(1)) = 1;
42 else
43 % if it was too big, prevent it from being used again and continue
44 % the loop over the other detections
45 areapct(r(1),c(1)) = 1;
46 end
47
48 end
49 else
50 areaintsec = [];
51 end
52
53
54 [dsn dsm] = size(dets);
55 [gsn gsm] = size(gt);
56 overlap = sum(fileoverlaps);
57 falsedet = dsn overlap;
58 missingdet = gsn overlap;
59
60
61 end
Figure C.2: Greedy bounding box evaluator.
38
1
2 function [pos] = groundtruth(imdir);
3 % interpolation spacing
4 inspace = 4;
5 ftype = '
*
.jpg';
6
7
8 d = dir(fullfile(imdir, ftype));
9 img list = {d.name};
10
11 global hlist;
12 global hcount;
13 global pos;
14 pos = [];
15
16 for t=1:inspace:length(img list)
17
18 hlist =[];
19
20 % get current frame data
21 img = imread(fullfile(imdir, img list{t}));
22 fid = fopen(strcat(fullfile(imdir, img list{t}),'.gt'),'w');
23
24 %key releace callback
25 fighandle = figure('KeyReleaseFcn',@cb), imshow(img);
26 set(fighandle,'CloseRequestFcn',@my closefcn)
27
32 end
33
36
39 end
40 fclose(fid);
41 % next image
42 end
43
44
45
46 function cb(src,evnt)
47 % P for Paint, draw a rectangle bounding box.
49 h = imrect;
51 end
52
53 if evnt.Character == 'u'
54 % U for Update, Update list of positions.
55 pos = [];
60 end
61 end
62 a = pos
63 end
64
66 % K for Kill, Delete box.
68 end
69
70 if evnt.Character == ' '
71 % my closefcn('','')
72 close(fighandle);
73
74 end
75 end
76
77
78 function my closefcn(src,evnt)
79 pos = [];
84 end
85 end
86
87 delete(fighandle)
88 end
89
90 end
Figure C.3: Ground truther GUI.
39
1
2 function [w confs] = appearance model(X, track id, img, ih, track size, ...
3 pet dets, hed dets, ...
4 assoc, headassoc, trackers dist, classif, params)
5
6 w = 1e20;
7 confs = zeros(1, 3);
8
9 Xcentre = X.pos + [track size, track size/2];
10
11 % associated detection term
12 if isempty(assoc)
13 assoc det idx = assoc(find(assoc(:, 1) == track id), 2);
14 if isempty(assoc det idx)
15 drcentre = pet dets.dr(assoc det idx, 1:2) + ...
16 [pet dets.sizes(assoc det idx) pet dets.sizes(assoc det idx)/2];
18
19 confs(1) = params.beta
*
20 w = w + confs(1);
21 end
22 else
23
24 assoc det idx = [];
25 end
26
27 if isempty(headassoc)
28 assoc det idx = headassoc(find(headassoc(:, 1) == track id), 2);
29 if isempty(assoc det idx)
30 %
31 drcentre = hed dets.dr(assoc det idx, 1:2) + ...
32 [hed dets.dr(assoc det idx,4)/2+(params.hcscale
*
track size) ...
hed dets.dr(assoc det idx,3)/2];
33 %
35
36 confs(2) = params.hbeta
*
37 w = w +confs(2);
38 end
39 else
40 assoc det idx = [];
41 end
42
43
44 % online classifier term
45
46
47 confs(3) = params.eta
*
classif eval(classif, img, ih, X.pos, ...
48 track size, params);
49 w = w +confs(3);
Figure C.4: Head detection appearencemodel.m.
40

Harris

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Harris

Hochgeladen von

Copyright:

Verfügbare Formate

Tracking people across crowded scenes.

Das könnte Ihnen auch gefallen