MTech Thesis 163190012 IITB LIBRARY Signed

Driving Scene Recognition
A Thesis
Submitted in partial fulfillment of
the requirements for the degree of
Master of Technology
by
Vivek Barsopiya
(Roll No. 163190012)
Industrial Engineering and Operations Research
Indian Institute of Technology Bombay
Mumbai 400076 (India)
Acceptance Certificate
Industrial Engineering and Operations Research
Indian Institute of Technology, Bombay
The thesis entitled “Driving Scene Recognition” submitted by Vivek Barsopiya (Roll No.
163190012) may be accepted for being evaluated.
Date: 2 July 2018
i
Approval Sheet
This thesis entitled “Driving Scene Recognition” by Vivek Barsopiya is approved for the
degree of Master of Technology.
Examiners
Supervisor (s)
Chairman
Date:
Place:
ii
Declaration
I declare that this written submission represents my ideas in my own words and where
others’ ideas or words have been included, I have adequately cited and referenced the
original sources. I declare that I have properly and accurately acknowledged all sources
used in the production of this report. I also declare that I have adhered to all principles of
academic honesty and integrity and have not misrepresented or fabricated or falsified any
idea/data/fact/source in my submission. I understand that any violation of the above will
be a cause for disciplinary action by the Institute and can also evoke penal action from
the sources which have thus not been properly cited or from whom proper permission has
not been taken when needed.
Vivek Barsopiya
Date: 2 July 2018 (Roll No. 163190012)
iii
Acknowledgements
I would like to extend thanks to the many people who so generously contributed to the
work presented in the thesis . Special mention goes to my enthusiastic supervisor,
Prof. Manjesh Hanawal , Prof. Narayan Rangaraj and Dr. Tilak Raj Singh for
giving me opportunity of working under his guidence. His Direction, motivation , a fac
tionate guidance and support has been the source of inspiration to bring the thesis this
shape. I thank all the other faculty member of Industrial Engineering And Opertaions
Research , who made me realize the virtue of learning through sustained hard work.
I would like to thank all those, whose name I missed but have contributed in any form for
building up of the thesis up till now.
Vivek Barsopiya
Indian Institute of Technology Bombay
2 July 2018
iv
Abstract
Vehicles are equipped with RADAR and monocular camera sensors. These sensors can be
used to improve driving safety and user’s experience by detecting the driving situations to
alert the user or change the car’s parameters. We developed real time and Ofline models
trained on a dataset with scenarios limited to Lead car takes Left Turn, Lead Car takes
Right Turn, Other. Combination of C3D, 2D CNN and LSTM layers gave us a frame
wise test accuracy of 98.5% and 96.5 % in Ofline and Realtime Scene recognition with
kappa of 94% and 88% respectively.
v
Table of Contents
Acknowledgements iv
Abstract v
List of Figures viii
List of Tables x
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Driving Scene Recognition . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Literature Review 5
2.1 Model categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Time Distributed 2D CNN + Recurrent Layer . . . . . . . . . . . 5
2.1.2 CNN Slow Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 3D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Action prediction Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Action Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Action Localization . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Supervision categories . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 video level annotation . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Frame Level Bounding Box Annotation . . . . . . . . . . . . . . 11
2.3.3 Frame Level Bounding Box + Supervised Attention Maps . . . . 11
vi
3 Diferent Approaches & Limitations 13
3.1 Ensemble Of Models + Algorithm . . . . . . . . . . . . . . . . . . . . . 13
3.2 3D CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Time Distributed 2D CNN + Dense LSTM . . . . . . . . . . . . . . . . . 16
4 Experimentation & Results 18

4.1 GTA 5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.1 GTA 5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 3D CNN model . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.3 Xception + ConvLSTM . . . . . . . . . . . . . . . . . . . . . . 19
4.1.4 Xception + BiDirectional ConvLSTM . . . . . . . . . . . . . . 20
4.2 RealRoadScene Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 3D CNN + Xception + BiDirectional ConvLSTM . . . . . . . . 21
4.2.2 3D CNN + Xception + UniDirectional ConvLSTM . . . . . . . 23
5 Conclusion 25
6 Future Work 26
6.1 Hierarchical Classification . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Translation Variant CNN layers . . . . . . . . . . . . . . . . . . . . . . . 27
6.3 Guided Attention Inference Network . . . . . . . . . . . . . . . . . . . . 28
vii
List of Figures
1.1 characteristics of Ultra Sonic ,Camera and RADAR sensor. adopted from
[1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 characteristics of LiDAR sensor. adopted from [2] . . . . . . . . . . . . . 2
2.1 Red Block is a feature cube of a frame from CNN, Green Block is a LSTM
Cell, Blue Block is Output.adopted from [3] . . . . . . . . . . . . . . . . 6
2.2 saliency maps used to localize activity. adopted from[7] . . . . . . . . . . 7
2.3 Applying 2D convolution on an image results in an image. b) Applying2D
convolution on a video volume (multiple frames as multiple channels)
also results in an image. c) Applying 3D convolution on a video volume
results in another volume, preserving temporal information of the input
signal.adopted from [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 saliency maps used to localize activity.adopted from [4] . . . . . . . . . 10
2.5 Video Classification and Recognition when frame level label are missing.
adopted from [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Video Detection and Localization when frame level label and bounding
box information are present. adopted from [11] . . . . . . . . . . . . . . 11
2.7 Guided attention to correct disappointing CAM maps and learn correct
correlations. adopted from [5] . . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Snapshot of Vision Based Collision system . . . . . . . . . . . . . . . . 14
3.2 (Left)Input 3D tensor (Right) Forward pass in ConvLSTM. adopted from
[6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 (left)A frame with Label Left (Right) Mirrored frame with label Right . 18

4.2 Train Model and input output setup . . . . . . . . . . . . . . . . . . . . . 19
4.5 C3D + Xception + BiDirectional network . . . . . . . . . . . . . . . . 22
4.6 C3D + Xception + BiDirectional network . . . . . . . . . . . . . . . . 22
viii
4.7 CAM Maps visualizing ConvLSTM’s class attention for Xception with
Bidirectional convolution model . . . . . . . . . . . . . . . . . . . . . . 23
4.8 C3D + Xception + UniDirectional network . . . . . . . . . . . . . . . . 23
4.9 CAM Maps visualizing ConvLSTM’s class attention for Xception with
Bidirectional convolution model . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Yellow blocks are end classes . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Each class attribute is projected to the visual space, In the visual space
each class is represented by a Gaussian distribution . . . . . . . . . . . . 27
6.3 SCNN implementation. adopted from [16] . . . . . . . . . . . . . . . . . 28
ix
List of Tables
4.1 RealRoadScene Data action Class Distribution . . . . . . . . . . . . . . . 21
4.2 Augmented RealRoadScene Data action class Distribution . . . . . . . . 21
4.3 Model Prediction Accuracy test dataset Simple . . . . . . . . . . . . . . 24
x
Chapter 1
Introduction
1.1 Problem
Advanced driverassistance systems(ADAS) are systems developed to automate, adapt
and enhance vehicle systems for safety and better driving. ADAS reduces road fatalities,
by minimizing the human error. Low cost ADAS working on RADAR feed only senses
depth information which limits the assistance capabilities. However one can improve
the ADAS system capabilities by adding a Monocular camera to sense environment with
visual and Depth information.
1.2 Motivation
ADAS system uses imaging devices like RADAR, LiDAR, Ultrasound, Mono/Stereo
Camera to sense environment and warn user or take control of car. RADAR and Camera
and Ultrasonic sensor fusion is more robust and informative to as compared to LiDAR
while having a lower cost. figure 1.1 and 1.2 compares the price and performance of
these sensors. LiDAR is most accurate sensor whereas other sensors are noisy. Cameras
are most dense and most informative sensor and closely resembles to human vision sys
tem. This motivates to replace LiDAR with cheaper and arguably better Camera sensor.
1.2. MOTIVATION CHAPTER 1. INTRODUCTION
Figure 1.1: characteristics of Ultra Sonic ,Camera and RADAR sensor. adopted from [1]
Figure 1.2: characteristics of LiDAR sensor. adopted from [2]
2
1.3. OBJECTIVE CHAPTER 1. INTRODUCTION
1.3 Objective
There are several tasks in Autonomous driving using LiDAR Sensor. We initiated with
Driving Scene Recognition task using Monocular camera. The objective of this thesis is
to Provide a generalized solution for Driving Scene Recognition by using a gray scale
monocular camera.
1.3.1 Driving Scene Recognition
Recognize the relevant scenarios that may occur while driving to either alert driver
or apply emergency brakes. Few example of scenarios that may occur while driving1
1. Lead Car turns Left
2. Lead Car turns Right
3. Lead Car Blocks the path
4. Pedestrian Blocks the path
5. Animal Blocks the path
6. Lead Car collides with Living Thing
7. Lead Car collides with NonLiving Thing
8. None of the above(Clear)
1.4 Thesis Outcome
I have given a theoretical framework for solving the Scene Recognition problem, for
diferent ground truth annotations and kind of prediction task required. Trained a Ofline
and Realtime Driving Scene Recognition model with very high accuracy using fusion of
models. Created a webserver for user to interact with the system.
1. Theoretical Framework
(a) Discussed di ferent loss functions and its usage depending on the task and

ground truth annotation available.
(b) Discussed the characteristics of layers available for spatiotemporal dataset,
layer’s usage and layer’s limitations.
2. Quantifiable Outcomes
(a) Trained Ofine and Real time Driving Scene Recognition with 98.5
1
We restricted our self to scenarios "Lead Car turns Left","Lead Car turns Right" and "None of the
above(Clear)" because of unavailability of labeled dataset. But our models are made for generic action
recognition task which makes them capable to classify new classes.
3
1.5. THESIS OUTLINE CHAPTER 1. INTRODUCTION
(b) Deployed this model on Webserver for users to interact and assist them in

generating ground truth annotations.
1.5 Thesis Outline
Chapter 1: Introduction
We explain the problem, motivation, objectives and outcomes.
Chapter 2: Literature Review
We discuss Spatiotemporal feature extraction layers namely 3DCNN , ConvLSTM,
Time Distributed 2DCNN. We also discuss three primitive video learning tasks for
action prediction namely Action Recognition, Action Detection, Action Localiza
tion and finally we discuss on the type of Ground truth annotations along with the
work of the few authors in recognition, detection and localization tasks on these
kinds on ground truths.
Chapter 3: Layer /Model Characteristics and Limitations
We characterize layers and limitations based on spatiotemporal features, trainable
parameters, temporal footprint and Attention Mechanism.
Chapter 4: Experimentation and Results
We model a network by fusing several models to overcome the limitations of indi
vidual approach.
4
Chapter 2
Literature Review
Driving Scene Recognition problem can be simply considered as a Action Recognition
and Localization in temporal and spatial domain. There is a enormous research work in
the area of extracting spatiotemporal features. Based on extensive literature survey for
models using only Video Feed and Flow Information, the literature can be mainly bundled
based on Action Prediction Task and Models used. section 2.1 and 2.2 discusses about
spatiotemporal models and tasks involved in action prediction. Section 2.3 explains few
papers mainly focusing on self supervised models and their loss functions.
2.1 Model categorization
Video is a combination of spatial and temporal signal. We need to extract Spatiotemporal
features for video analytics tasks. There are few network templates / DNN blocks that
are used in several innovative ways by researchers to improve scores on common public
datasets, I have listed few of them below.
2.1.1 Time Distributed 2D CNN + Recurrent Layer

There are many interesting properties that one can get from combining convolution
neural networks (CNN) and recurrent neural networks (RNN). That combination makes
use of the best of both worlds, the spatial and temporal worlds. The CNNs are good at
dealing with spatially related data while the RNNs are good at temporal signals. Thus the
problems in which a CNNRNN pair may be required are when dealing with sequence to
sequence problems in one of the following input to output setups.
5
2.1. MODEL CATEGORIZATION CHAPTER 2. LITERATURE REVIEW
Figure 2.1: Red Block is a feature cube of a frame from CNN, Green Block is a LSTM
Cell, Blue Block is Output.adopted from [3]
One to Many : Mapping single input to a output sequence.In applications such as au
tomatic image captioning, such systems need to describe the content(sequence of
words) of an image(single input).
Many to One model : In this case we can have a video(sequence of images) as input
and we need to classify the video into a single static class category. A particular
application would be humanaction recognition on trimmed/untrimmed clips, from
a sequence of video frames we need to identify human action such as walking,
seated, jumping or running. It could be possible to have such a system identify on
going crimes such as detection of riots(requires low temporal footprint) just from
video feeds in real time. It Maps spatial sequence inputs to single static output
class.
Many to Many fixed length model : We want to label individual frames of a video in
put. task is to learn sequence of labels for a sequence of images(video). This
is applicable when a untrimmed video clip consists of a actor performing multi
ple actions such as running, walking, and then seating,thus labels are assigned to
subclips or frames. It Maps spatial sequence inputs to equal length sequence of
spatial/nonspatial static output classes.
In all the cases the CNN acts like the trainable feature detector for the spatial sig
nal. It learns powerful convolution features which operates on a static spatial input(frame)
while the RNN receives a sequence of such highlevel representations to generate a de
/ Action
scription of the content or map to some static class of outputs. In image captioning
detection there is normally an attention mechanism built in to allow the system to attend to
certain image parts while generating the captions. Instead of using a typical vanilla RNN
one can instead use longshorttermmemory (LSTM) or gated recurrent unit (GRU) or
6
2.1. MODEL CATEGORIZATION CHAPTER 2. LITERATURE REVIEW
ConvLSTM networks for eliminating the issue of long term dependences. Which makes
a CNNLSTM or CNNGRU pairs extremely powerful in dealing with spatialtemporal
signals.
2.1.2 CNN Slow Fusion
A. Kaparthy[7] uses a modified 2DCNN layer to extract spatiotemporal features
using Slow Fusion Model. The Slow Fusion model slowly fuses temporal information
throughout the network such that higher layers get access to progressively more global
information in both spatial and temporal dimensions. This is implemented by extending
the connectivity of all convolution layers in time and carrying out temporal convolutions
in addition to spatial convolutions to compute activations, as seen in the figure 2.2, the first
convolution layer is extended to apply every filter of temporal extent of 4 frames on an
input clip of 10 frames through valid convolution with stride 2 and produces 4 responses
in time. The second and third layers above iterate this process with filters of temporal
extent T = 2 and stride 2. Thus, the third convolution layer has access to information
across all 10 input frames.
Figure 2.2: saliency maps used to localize activity. adopted from[7]
2.1.3 3D CNN
Compared to 2D ConvNet, 3D ConvNet[8] has the ability to model temporal in
formation better owing to 3D convolution and 3D pooling operations. In 3D ConvNets,
convolution and pooling operations are performed spatiotemporally while in 2D Con
vNets they are done only spatially. Figure 2.3 illustrates the di ference, 2D convolution
applied on an image will output an image, 2D convolution applied on multiple images
(treating them as di ferent channels ) also results in an image. Hence, 2D ConvNets lose
7
2.2. ACTION PREDICTION TASKS CHAPTER 2. LITERATURE REVIEW
temporal information of the input signal right after every convolution operation. Only
3D convolution and Slow fusion preserves the temporal information of the input signals
resulting in an output volume. The same phenomena is applicable for 2D and 3D polling.
Figure 2.3: Applying 2D convolution on an image results in an image. b) Applying2D

convolution on a video volume (multiple frames as multiple channels) also results in an
image. c) Applying 3D convolution on a video volume results in another volume, pre
serving temporal information of the input signal.adopted from [8]
2.2 Action prediction Tasks
A sample can have multiple targets. .e.g a image can be segmented, captioned, classified,
also one do object detection, VQA tasks on the same image. selecting the right prediction
task and right annotation is very important. The tasks related to action prediction in videos
can be mainly classified based on kind of annotations required.
2.2.1 Action Recognition
For this task the dataset should have a single foreground actor throughout the clip
performing a single task. e.g. while driving only a single lead car(foreground) can exists
and task is to simply recognize the action it performs.
2.2.2 Action Detection
In this case, we need model to annotate each actor by drawing a labeled bounding
box around each actor.
2.2.3 Action Localization
The task here is to classify the untrimmed video and also detect the relevant portions
in video(frames) where activity of Interest takes place. e.g. while driving relevant activ
ities can be lead car either taking Left or Right. frames with irrelevant activities may be
No car in lead, car blocks path, pedestrian blocks path, animal blocks path, etc.
8
2.3. SUPERVISION CATEGORIES CHAPTER 2. LITERATURE REVIEW
A Lead car can perform a series of relevant tasks such as slowing down, signal using
tail light indicator , turn left in this case recognizing actions per frame /subclip is a good
option. this can be implemented using a many to many sequence model with a pseudo
class "Others" assigned for frames where all other action takes place. place
2.3 Supervision categories
Sometimes we don’t have rich annotated data. e.g. one wants to do action localization
or action detection but has only video class labels. This section discusses about some
selfsupervised architectures for action localization, action detection .
2.3.1 video level annotation
If a train video dataset is annotated by just a class label. e.g. a trimmed/untrimmed
surveillance video clip with labeled as accident /no accident. using following models we
can achieve any Analytics task. I have listed below few works in activity Recognition,
Detection and Localization using only video level annotation at train time.
Action Recognition
Activity Recognition is a simple task, since all the required annotation is available, a Fully
Supervised Many to One Sequential Model can easily solve the task.
Action Detection
Action Detection is a slightly complex task, since bounding box ground truth is not avail
able. model has to learn to spatially localize activity in selfsupervised way. Zhenyang Li
et.al.[4] solves generates Bounding box information in during inference using attention
model. figure 2.4 describes heat maps(attention maps) of highly accurate model enables
us to diagnose where the model is looking. This attention maps are used to draw bounding
boxes.
9
Figure 2.4: saliency maps used to localize activity.adopted from [4]
Action Localization
Model has to identify frames that represents desired activity in selfsupervised way.

Waqas Sultani et. al. [11] uses multipleinstance learning(MIL) for anomaly localiza
tion in long surveillance videos. MIL is a variation on supervised learning. Instead of
receiving a set of instances which are individually labeled, the learner receives a set of
labeled bags, each containing many instances. In the simple case of multipleinstance
binary classification, a bag may be labeled negative if all the instances in it are negative.
On the other hand, a bag is labeled positive if there is at least one instance in it which is
positive. From a collection of labeled bags, the learner tries induce a concept that will
label individual instances correctly.Authors segments long video in 32 smaller segments
and extracts C3D features at fc6 layer for each segment. each snippet of 4096 dimen
sion is labeled as 0 /1 (noanomaly /anomaly) are passed a Dense network with MIL loss
function.2.5 explains the learning setup.
Figure 2.5: Video Classification and Recognition when frame level label are missing.

adopted from [11]
10
2.3.2 Frame Level Bounding Box Annotation
In this dataset, each action in the video is annotated by bounding boxes at every
frame. All of the tasks such as Recognition and Localization are easy with this rich
annotated data. for a video with m Riu Hou et. al. [12] solves the Action Detection when
multiple actions occurring simultaneously, by porting a Faster RCNN to Video Domain
by replacing ROI to TOI. ROI pooling to TOI polling, 2D CNN to 3D CNN.
Figure 2.6: Video Detection and Localization when frame level label and bounding box
information are present. adopted from [11]
2.3.3 Frame Level Bounding Box + Supervised Attention Maps

Neural networks works by finding the correlation between features and target
classes. A causal relation between two events exists if the occurrence of the first causes
the other. The first event is called the cause and the second event is called the e fect. A
correlation between two variables does not imply causation. On the other hand, if there
is a causal relationship between two variables, they must be correlated. Sometimes based
on the dataset, neural network finds a noncausal correlation between class labels and
features which may give good validation results but disappointing outcomes at real world
tests. e.g. Boat class in Pascal VOC 2012 contains images with boats in lake/sea, trained
neural networks exploit this correlation in dataset and starts predicting Boat by just look
ing at water inside images. The validation set also consists of boat and water occurring
simultaneously thus one gets high validation accuracy. this problem can be visualized us
11
ing saliency maps to understand network’s reasoning behind its prediction. The problem
is with the dataset rather than model, these can be solved by either changing the dataset
or loss function in such a way that that noncausal correlation cannot be learnt.
a. Recollect Dataset where this noncausal correlation is absent
Using Saliency maps, we can understand the network reasoning and recollect a new
dataset where the incorrect/unwanted features cannot be learnt.
b. Force Network to extract proper Features
Kunpeng Li et.al.[5] modified the loss function in a fully di ferentiable network to
force model to learn proper spatial features using supervised attention maps. the
modified loss function requires human annotated saliency maps.
Figure 2.7: Guided attention to correct disappointing CAM maps and learn correct corre
lations. adopted from [5]
12
Chapter 3
Diferent Approaches & Limitations
In this chapter we discuss about the limitations of di ferent approaches. we first tried

Algorithms because of small dataset. In section 3.1, we discuss on how these algorithms
works on ensemble of pretrained networks extracting information such as scene objects,
their depths and egolane. we quickly realized that scalability in detecting new scenarios
is a very big limitation. in section 3.2 and 3.3, we explored deep learning techniques for
video tasks. In this section we have discussed about di ferent networks and filters(CNN
and LSTM layers), their limitations with respect to our Driving Scene Recognition Prob
lem and dataset. we used saliency map and adversarial test samples collected from GTA5
to understand network’s learnt features.
3.1 Ensemble Of Models + Algorithm

We made collision warning system using monocular camera. In our vision based col
lision warning system we replaced the RADAR with Depth Estimation Network since
RADAR plays important role in determining obstacles. The system had a lot of false
positives since vehicles close to Me but not on my driving path are also falsely detected
in Collision warning system. So, we also tried to fuse egolane segmentation model to
identify if the vehicle is directly in my path to reduce the false warning rate. The lane
segmentation algorithm fails when lane markings are not found which is usually the case
at intersections. Since the lane detection system has high failure rate we stopped further
development in algorithms.
A Demo of a Deliverable on the Vision Based Collision Warning System (whether
a vehicle is close enough and in direct path) can be seen at https://youtu.be/
zWDouBy7DVU
. The video has 4 video stream which is explained in 3.1.
13
3.1. ENSEMBLE OF MODELS + ALGORITHM
CHAPTER 3. DIFFERENT APPROACHES & LIMITATIONS
Figure 3.1: Snapshot of Vision Based Collision system
Top Left : The actual warning system which warns the driver for collision.
Top Right : Internal working of alorithm using the Depth Estimation and Object Detec
tion stream to warn user. Please refer Algorithm 1 for exact implementation details.
Bottom Left : SSD Object Detector[9] trained on COCO dataset, detecting vehicles in
the frame.
Bottom Right : MonoDepth[10] is depth estimation network trained on CamViD dataset
to estimate depth in scene using monocular camera feed. whiter color represents
close objects.
The Pseudocode for Vision Based Collision Warning System.
14
3.1. ENSEMBLE OF MODELS + ALGORITHM
CHAPTER 3. DIFFERENT APPROACHES & LIMITATIONS
Algorithm
video ← VideoCapture( f ileName)
while True do
f rame ← video.read f rame() // Read frame from input video
ob jects ← S S D.detectOb jects( f rame) // Detect all the objects in

frame
depthMap ← Monodepth .estimateDepth( f rame) // Estimante Nearness

of the objects in frame
for object in objects do
if object.class = vehicle then
CroppedDepthMap = depthMap.crop(object.boundingBox)
if mean(CroppedDepthMap) ≥ threshold // Nearness of an vehice
object greater than threshold
DrawRedWarningBox(object.boundingBox) // Raise Warning
then
end
end
end
end
Algorithm 1: Vision Based Collision Warning System
Limitations
a. Generalize to Multiple Scenario
For every new relevant scenario that may come in future, one has to update the al
gorithm for the recognizing the new scenario. Moreover, many complex scenarios
such as automatically recognizing the Road Accidents, cannot be solved by modi
fying the algorithm.
b. Dependence on Lane Segmentation This approach for now is highly dependent on
accurate lane segmentation. considering the case that many scenarios we are in
15
3.2. 3D CNN CHAPTER 3. DIFFERENT APPROACHES & LIMITATIONS
interested happens at the intersections, where lane markings are either absent or
highly complex challenges lane segmentation.
3.2 3D CNN
3D CNN can extract spatiotemporal features such as oscillating colors, moving edges,
etc. A many to one 3D CNN model such as C3D[8] consisting of 800 million training
parameters, is memory and parameter intensive. It requires very large dataset to prevent
overfitting.
Limitations
a. Small Temporal Footprint
These models has giant feature cubes( batchsize, #frames,height,width,Channels).
To keep the memory consumption lower we need to restrict the #frames. A many to
one 3DCNN +Dense model is not used in literature to extract long spatiotemporal
features to best of my knowledge.
b. Small Resolution
With smaller Resolution(112*112) frames extracting Tail light information is dif
cult since such small visual information is lost at low resolution.
3.3 Time Distributed 2D CNN + Dense LSTM

Using a Imagenet Trained 2D CNN model to extract generic spatial features from last
convolution layers for each individual frame and feeding these sequence of features to a
Recurrent model(Many to Many) to predict labels for individual frame reduces training
parameters significantly. We saw a major improvement with this model, however the
model was not able to learn blinking tail light information, since the lower 2D Convolution
layers are incapable of extracting spatiotemporal features such as oscillating color(Tail
light blinking).
Limitations
a. Dense Gates
Vanilla LSTM uses Dense Networks as Gates, which has significantly higher pa
rameters as compared to Convolution Networks, which limits the amount of infor
mation passed to LSTM block.
16
CHAPTER 3. +DIFFERENT APPROACHES
3.3. TIME DISTRIBUTED 2D CNN DENSE LSTM & LIMITATIONS
b. Inefcient use of Spatial Information
Vanilla LSTM flattens the Convolution output during inputtostate Dense gate,
Vectorizing the feature cube loses the information stored in spatial ordering.
ConvLSTM: Overcome Dense LSTM limitation
Xingjian Shi et. al.[6] used a convolution variant of LSTM to predict target which itself
are sequence of heat map to forecast rain density given current rain Density maps. They
modified the Dense LSTM inputtostate and statetostate gates by Convolution network.
moreover the 1D hidden state and cell state are replaced by 3D tensors, thus retaining
spatiotemporal information. Equations below describes the gate operation.
it = σ(Wxi ∗Xt +Whi ∗Ht−1 +Wci ◦ Ct−1 +bi )
ft = σ(wx f ∗Xt +Wh f ∗Ht−1 +Wc f ◦ Ct−1 +b f )
Ct = ft ◦ Xt +it ◦ tanh(W xc ∗Xt +Whc ∗Ht−1 +bc )
ot = sigma(wxo ∗Xt +Who ∗Ht−1 +Wco ◦ Ct +bo )
Ht = ot ◦ tanh(Ct )
Figure 3.2: (Left)Input 3D tensor (Right) Forward pass in ConvLSTM. adopted from [6]
17
Chapter 4
Experimentation & Results
This chapter Discusses about experiments and results based on models made using 3D
CNN, LSTM and Time Distributed 2D CNN layers in various arrangements to overcome
the individual’s limitations discussed in last chapter. We have tried our model on two
datasets namely GTA 5 and RealRoadScene. We generated synthetic dataset by playing
GTA 5[13] and used these trained GTA 5 models weights to further train on RealRoad
Scene dataset.
4.1 GTA 5 Experiments
4.1.1 GTA 5 Dataset
The dataset consists of a trimmed video clips with a car driven in 3rd person view
with either Left or right indicators on with associated labels as "Lead car turns left" and
"lead car turn right" respectively as seen in 4.1 . Thus we have a task of action recog
nition on trimmed clips. We tried many to one model C3D like model, many to many
ConvLSTM model and other hybrid models combining 3D CNN and ConvLSTM in many
to many settings.
Figure 4.1: (left)A frame with Label Left (Right) Mirrored frame with label Right
18
4.1. GTA 5 EXPERIMENTS CHAPTER 4. EXPERIMENTATION & RESULTS
4.1.2 3D CNN model
We finetuned C3D[8] model, replaced the Dense layers with Global Average Pool
ing and frozed the convolution weights. C3D at input takes 112*112 height and width
video of 16 frames and predicts a label for the snippet as seen in the figure 4.2 which
makes small headlights less distinguishable. We also tried a smaller 3DCNN model
trained from scratch with larger input size and the results had significantly improved
on GTA validation set. C3D well extracts the Left and Right blinking tail lights fea
tures on cars with large tail lights cars but for rest it looks more at the orientation to
predict the scene. https://youtu.be/yCGg2HMvlvAA major limitation with 3DCNN
model is prediction latency because model waits for 1 second(610 frames@6FPS)
frames before doing correct prediction. A video of delay in detection can be seen at
https://youtu.be/XA_W5zzDTbM?t=23
Figure 4.2: Train Model and input output setup
4.1.3 Xception + ConvLSTM

The training setup is slightly diferent in this setup .as seen in 4.4 We feed individual
frame to a 2D CNN model to extract good spatial feature in each frame, these sequence
of frames are then feed to a LSTM layers to extract good spatiotemporal features, finally
a time distributed Dense Softmax classification is performed on outputs(hidden state) of
LSTM layer.
19
4.1. GTA 5 EXPERIMENTS CHAPTER 4. EXPERIMENTATION & RESULTS
4.1.4 Xception + BiDirectional ConvLSTM

In BiDirectional models, the network looks at the input sequence in forward and
reverse order, which can potentially improve the prediction because forecasting future is
not required.
20
4.2. REALROADSCENE DATASETCHAPTER 4. EXPERIMENTATION & RESULTS
4.2 RealRoadScene Dataset
We collected 88 untrimmed 20 second Video Clips by driving on real road with camera
mounted on hood, each untrimmed consisting of a single action belonging to one of the
class such as
1. Lead car takes Left
2. Lead car takes Right
3. Lead car Stops
Since the video clip has only single actor(single lead car) performing a task somewhere in
the video, a action recognition model with action localization in time domain is suitable
and sufcient. We added a new pseudo class ’Clear’ which means neither Left nor Right
turn. we created CNN + Recurrent network similar to those discussed in GTA5 experi
ments. Since Stop clips are very few in terms of count and frames and also di fcult to
label. we didn’t considered stop clips and just performed multi class classification with
classes as Left, Right and Clear. We augmented the dataset by mirroring each video. So a
Mirrored Left turn is annotated as right and vice versa. 4.2 summarizes the dataset.
Scenario Count Scenario Count

Left 61 Clips Left 68 Clips
Right 7 Clips Right 68 Clips
Stop 20 Clips Stop 40 Clips
Table 4.1: RealRoadScene Data action Table 4.2: Augmented RealRoadScene

Class Distribution Data action class Distribution
These videos are manually annotated by me. Below we have reported the models, their
performances and discussed the problems with the models. We selected the best model
using Kappa statistic on the validation set and reported their misclassification rate.
4.2.1 3D CNN + Xception + BiDirectional ConvLSTM

3D CNN can extract low level spatiotemporal features such as blinking lights, Mov
ing edges,etc but cannot exact high level features such as lead car because of small tem
poral footprint. since 3D CNN is parameter intensive, using large 3D CNN model is
not possible on tiny dataset. So used combination of 2DCNN model and a tiny 3DCNN
model to extract high level spatial features and low level spatiotemporal features respec
tively. Adding a LSTM on top of these 2D CNN and 3D CNN model creates sequence to
sequence model with large temporal footprint. The final model combining 3 Components
can be seen in 4.8 has capacity to model high and low level spatiotemporal features.
21
Figure 4.5: C3D + Xception + BiDirectional network
Figure 4.6: C3D + Xception + BiDirectional network
22
Figure 4.7: CAM Maps visualizing ConvLSTM’s class attention for Xception with Bidi
rectional convolution model
4.2.2 3D CNN + Xception + UniDirectional ConvLSTM

We removed backward LSTM to make model online.
Figure 4.8: C3D + Xception + UniDirectional network
23
Figure 4.9: CAM Maps visualizing ConvLSTM’s class attention for Xception with Bidi
rectional convolution model
Model Architecture Frame Classification Accuracy Kappa

3DCNN+ Xception + UniDir ConvLSTM (Realtime) 96.54% 88.1%
3DCNN+ Xception + BiDir ConvLSTM(Ofine) 98.53% 94.70%
Table 4.3: Model Prediction Accuracy test dataset Simple
24
Chapter 5
Conclusion
I worked on multiple model architectures starting from simple video level annotation to
deeper frame level annotation to deepest pixel level annotations 1 . We are getting 98.5%
frame classification accuracy on test dataset, 1,415 frames out of 84,964 are incorrectly
classified in 31 videos. after analyzing the saliency map and the train dataset, we found
out following issues
1. Train dataset doesn’t have a lot of trucks in the background while test dataset con
sists of trucks and that causes a slight amount of misclassification at frames with
trucks in scene.
2. No good understanding egolane concept. many times incoming trafc is taken into
consideration while making incorrect prediction.
Train and test samples has very few complicated samples and frames. We need a new
dataset with more classes and di fcult samples. Nvidia PX2 solved the uses let network
learnt self driving without supervising it with lane detection with large amount of human
driving data.
1
Pixel level annotation is used for guided attention map, code is implemented since annotated dataset is
not available was not able to put results here
25
Chapter 6
Future Work
6.1 Hierarchical Classification
Uptil now we have used flat classification, by flat classification problem we are referring to
standard multiclass classification problem. On the other hand, this problem naturally cast
as hierarchical classification problems, where the classes to be predicted are organized
into a class hierarchy, typically a tree or a DAG. please see figure 6.1 for hierarchy in
our problem. We can start exploiting the structure in labels to solve problem of class
imbalance and improve overall accuracy at the cost of extra computation.
Figure 6.1: Yellow blocks are end classes
This problem can be solved in two ways.
Discriminative Approach
Using some architecture like HiNet[15]. we can use a DAG tree where a leaf class
can have multiple parents.
Generative Approach
If we consider following scenarios,
26
6.2. TRANSLATION VARIANT CNN LAYERS CHAPTER 6. FUTURE WORK
1. Lead Car turns Left
2. Lead Car turns Right
3. Lead Car Blocks the path
4. Pedestrian Blocks the path
5. Animal Blocks the path
6. Lead Car collides with Living Thing
7. Lead Car collides with NonLiving Thing
8. None of the above(Clear)
we can see that the Classes are related to each other and there exists a embedding for
the labels. Ashish Mishra et.al. [14] used Generative approach for Zero Shots and
Few Shots Learning to model map a Class Distribution(samples assumed to come
from some distribution) to attribute Distribution(relations within labels). This setup
is useful when we have no samples of some class but still want to recognize that
class.
Figure 6.2: Each class attribute is projected to the visual space, In the visual space each
class is represented by a Gaussian distribution
6.2 Translation Variant CNN layers
Sometimes the class label depends on the position of object in frame. Driving Scene
Recognition requires a translation variant model since position of the car relative to the
camera has causal relation with the target classes. We tested the hypothesis by using
a translation equivariant CNN(2DCNN, 3DCNN, COnvLSTM) and translation invariant
Global Average Pooling only and trained it on RealRoadScene, our test dataset kappa
dropped by 20% to 74% which implies that relative position of the object plays a signifi
cant role in Driving Scene Recognition. One can always vectorize the feature cube(flatten
is translation equivariant layer) instead of Global Average pooling before passing to dense
27
6.3. GUIDED ATTENTION INFERENCE NETWORKCHAPTER 6. FUTURE WORK
layers, in order to make model translation variant and this is what we did for now. but

we can improve the overall performance of network using translation variant convolution
operations. e.g. Xingang Pan et. al.[16] uses Spatial CNN for lane detection, a CNN
operation that slices the feature cube width wise and height wise to perform convolution
operations on itself.
Figure 6.3: SCNN implementation. adopted from [16]
6.3 Guided Attention Inference Network
The model is not able to focus on driving lane. using Supervised attention map for lane
segmentation as discussed in section 2.3.3 can improve the performance on straight with
out increase in computation cost at inference time. These supervised attention maps can
also be used to force model attention towards tail lights of lead car.
28
Bibliography
[1] Lex Fridman "MIT 6.S094: Deep Learning for SelfDriving Cars, Lecture 2 slide 43"
https://selfdrivingcars.mit.edu
[2] Lex Fridman "MIT 6.S094: Deep Learning for SelfDriving Cars, Lecture 2 slide 44"
https://selfdrivingcars.mit.edu
[3] Andrew Karpathy blog "The Unreasonable E fectiveness of Recurrent Neural Net

works" http://karpathy.github.io/2015/05/21/rnn-efectiveness
[4] Zhenyang Li, Efstratios Gavves, Mihir Jain, Cees G. M. Snoek "VideoLSTM Con

volves, Attends and Flows for Action Recognition", Arxiv paper https://arxiv.
org/abs/1607.01794v1
[5] Kunpeng Li, Ziyan Wu, KuanChuan Peng, Jan Ernst, Yun Fu "Tell Me Where to

Look: Guided Attention Inference Network", Arxiv paper https://arxiv.org/
abs/1802.10171v1
[6] Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, Wang

chun Woo " Convolutional LSTM Network: A Machine Learning Approach for Pre
cipitation Nowcasting", Arxiv paper https://arxiv.org/abs/1506.04214v2
[7] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk

thankar, Li FeiFe "Largescale Video Classification with Convolutional Neu
ral Networks" , CVPR 2014 , https://cs.stanford.edu/people/karpathy/
deepvideo/
[8] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri "Learn
ing Spatiotemporal Features with 3D Convolutional Networks",ICCV 2015, http:
//vlg.cs.dartmouth.edu/c3d/
[9] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng
Yang Fu, Alexander C. Berg "SSD: Single Shot MultiBox Detector" , Arxiv paper
https://arxiv.org/abs/1512.02325v5
29
BIBLIOGRAPHY BIBLIOGRAPHY
[10] ClÃl’ment Godard, Oisin Mac Aodha, Gabriel J. Brostow "Unsupervised Monocular
Depth Estimation with LeftRight Consistency", Arxiv paperhttps://arxiv.org/
abs/1609.03677
[11] Waqas Sultani, Chen Chen, Mubarak Shah "Realworld Anomaly Detection in

Surveillance Videos", Arxiv paper https://arxiv.org/abs/1801.04264v2
[12] Rui Hou, Chen Chen, Mubarak Shah "Tube Convolutional Neural Network (T

CNN) for Action Detection in Videos", Arxiv paper https://arxiv.org/abs/
1703.10664
[13] Rockstar Games "Grand Theft Auto 5" https://www.rockstargames.com/V/
[14] Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, Arulkumar S "A

Generative Approach to ZeroShot and FewShot Action Recognition", Arxiv paper
https://arxiv.org/abs/1801.09086
[15] Zhenzhou Wu, Sean Saito "HINET: Hierarchical Classification with Neural Net

work", Arxiv paper https://arxiv.org/abs/1705.11105
[16] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang "Spatial As

Deep: Spatial CNN for Trafc Scene Understanding" Arxiv paper https://arxiv.
org/abs/1712.06080
30

MTech Thesis 163190012 IITB LIBRARY Signed

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

MTech Thesis 163190012 IITB LIBRARY Signed

Hochgeladen von

Copyright:

Verfügbare Formate

Driving Scene Recognition

4 Experimentation & Results 18

4.1 (left)A frame with Label Left (Right) Mirrored frame with label Right . 18

(a) Discussed di ferent loss functions and its usage depending on the task and

(b) Deployed this model on Webserver for users to interact and assist them in

2.1.1 Time Distributed 2D CNN + Recurrent Layer

Figure 2.3: Applying 2D convolution on an image results in an image. b) Applying2D

Model has to identify frames that represents desired activity in self­supervised way.

Figure 2.5: Video Classification and Recognition when frame level label are missing.

2.3.3 Frame Level Bounding Box + Supervised Attention Maps

Diferent Approaches & Limitations

In this chapter we discuss about the limitations of di ferent approaches. we first tried

3.1 Ensemble Of Models + Algorithm

ob jects ← S S D.detectOb jects( f rame) // Detect all the objects in

depthMap ← Monodepth .estimateDepth( f rame) // Estimante Nearness

DrawRedWarningBox(object.boundingBox) // Raise Warning

3.3 Time Distributed 2D CNN + Dense LSTM

it = σ(Wxi ∗Xt +Whi ∗Ht−1 +Wci ◦ Ct−1 +bi )

ft = σ(wx f ∗Xt +Wh f ∗Ht−1 +Wc f ◦ Ct−1 +b f )

Ct = ft ◦ Xt +it ◦ tanh(W xc ∗Xt +Whc ∗Ht−1 +bc )

ot = sigma(wxo ∗Xt +Who ∗Ht−1 +Wco ◦ Ct +bo )

Experimentation & Results

4.1.3 Xception + ConvLSTM

4.1.4 Xception + Bi­Directional ConvLSTM

Scenario Count Scenario Count

Table 4.1: RealRoadScene Data action Table 4.2: Augmented RealRoadScene

4.2.1 3D CNN + Xception + Bi­Directional ConvLSTM

Figure 4.5: C3D + Xception + Bi­Directional network

Figure 4.6: C3D + Xception + Bi­Directional network

4.2.2 3D CNN + Xception + Uni­Directional ConvLSTM

Figure 4.8: C3D + Xception + Uni­Directional network

Model Architecture Frame Classification Accuracy Kappa

layers, in order to make model translation variant and this is what we did for now. but

[3] Andrew Karpathy blog "The Unreasonable E fectiveness of Recurrent Neural Net­

[4] Zhenyang Li, Efstratios Gavves, Mihir Jain, Cees G. M. Snoek "VideoLSTM Con­

[5] Kunpeng Li, Ziyan Wu, Kuan­Chuan Peng, Jan Ernst, Yun Fu "Tell Me Where to

[6] Xingjian Shi, Zhourong Chen, Hao Wang, Dit­Yan Yeung, Wai­kin Wong, Wang­

[7] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk­

[11] Waqas Sultani, Chen Chen, Mubarak Shah "Real­world Anomaly Detection in

[12] Rui Hou, Chen Chen, Mubarak Shah "Tube Convolutional Neural Network (T­

[13] Rockstar Games "Grand Theft Auto 5" https://www.rockstargames.com/V/

[14] Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, Arulkumar S "A

[15] Zhenzhou Wu, Sean Saito "HINET: Hierarchical Classification with Neural Net­

[16] Xingang Pan, Jianping Shi, Ping Luo, Xiaogang Wang, Xiaoou Tang "Spatial As

Das könnte Ihnen auch gefallen

Model has to identify frames that represents desired activity in selfsupervised way.

4.1.4 Xception + BiDirectional ConvLSTM

4.2.1 3D CNN + Xception + BiDirectional ConvLSTM

Figure 4.5: C3D + Xception + BiDirectional network

Figure 4.6: C3D + Xception + BiDirectional network

4.2.2 3D CNN + Xception + UniDirectional ConvLSTM

Figure 4.8: C3D + Xception + UniDirectional network

[3] Andrew Karpathy blog "The Unreasonable E fectiveness of Recurrent Neural Net

[4] Zhenyang Li, Efstratios Gavves, Mihir Jain, Cees G. M. Snoek "VideoLSTM Con

[5] Kunpeng Li, Ziyan Wu, KuanChuan Peng, Jan Ernst, Yun Fu "Tell Me Where to

[6] Xingjian Shi, Zhourong Chen, Hao Wang, DitYan Yeung, Waikin Wong, Wang

[7] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk

[11] Waqas Sultani, Chen Chen, Mubarak Shah "Realworld Anomaly Detection in

[12] Rui Hou, Chen Chen, Mubarak Shah "Tube Convolutional Neural Network (T

[15] Zhenzhou Wu, Sean Saito "HINET: Hierarchical Classification with Neural Net