Sie sind auf Seite 1von 41

Advanced Topics in Autonomous Driving

using
Deep Learning

Presenter: Nasim Souly


Electronics Research Laboratory https://aid-driving.eu/
www.vwerl.com/
New challenges in Autonomous driving
• AVs have to gain people trust
• AVs need to have a better social understanding of pedestrians

Intent prediction and communication

• Models need to be
• Fast -> low latency to react quickly
• Small -> to run on embedded systems
Autonomous driving
• Benefits of autonomy in driving
• Save lives (~1 million people dies every year: human recklessness (DUI, distracted,…)
• Mobility is increased
• Labor-cost savings
• Reduction in carbon dioxide (CO2) emissions by optimized driving.
• Personalized transportation
Autonomous Driving
SEDRIC is the first concept car from the Volkswagen
Group, the prototype of an autonomous vehicle

DARPA Grand Challenge


‘Stanley’

Audi A8 Traffic Jam Pilot

6
Intelligent System

Artificial Intelligence: A Modern Approach, Global


Edition
by RUSSELL / NORVIG

Utilizing smart functional modules for driving

Intelligent module Actions


Sensor data
(e.g. Deep networks, (e.g. accelerate,
(e.g images)
Planning, etc.) break)
Why deep learning?

• All about the features

• Driving is also related with certain features


• Features can determine what action to take next
Why Deep learning
• Feature extraction
• Representation learning
• Scalable
Deep Neural Networks
• Universal estimators (special purpose functions)
• Break through:
• GPU
• Large datasets (e.g. Imagnet)
• Research (algorithms)
• Infrastructures(AWS ,…)
• Softwares (caffe, tensorflow, pytorch,…)
Common perception models
• Object recognition, object detection, semantic segmentation
• Cars need these models to perceive the surroundings
Need - Rise of Machines

• AVs have to gain people trust


• AVs need to have a better social understanding of pedestrians

Communicate
Predict intent
intent

Intent prediction and communication

12
Predict pedestrian intentions

Predict intent Communicate intent


• Recognition of pedestrian’s movements
and prediction of intent

• Need to recognize each pedestrian’s activity over time (in video context)

• Sensor-agnostic representation of human movements


noun: streetsmart
the experience and knowledge necessary to deal with the
potential difficulties or dangers of life in an urban
environment.
14
Modular AV Architecture


Gesture/Activity
Recognition

Keypoints
(skeleton) Intent
Estimation Prediction

Sensors Perception Interpretation & Path Planning


Prediction

15
Keypoints Detection

Figure from OpenPose [Cao, Z 2017]

16
Approach: Top-down
• Top-down
• Detect a human first
• Run pose detection model on each person detected
• Slightly better accuracy, at the cost of latency for crowded scenes

Figures from [Cao, Z 2017]

17
Approach: Bottom-up
• Bottom-up
• Find all joints/keypoints in image
• Post-process joints to separate each person from another
• Real-time capable for crowded scenes

Figures from CMU-pose [Zhe Cao, 2017l ]

18
Multi-Person Pose Estimation using Part Affinity Fields [Cao, Zhe 2017]

19
Approach
• Jointly Learning Parts Detection and Parts Association

21
Pose methods Comparison
• For the realtime purpose one of the best approach is OpenPose
• post-processing is needed ; quadratic in terms of scaling to the number of people
• Inference time is mostly spent on the CNN
• Even for large number of people (~20), the post-processing time is insignificant.

• Some other bottom-up approaches


• Deep(er) Cut [Insafutdinov]
• PersonLab [Papandreou, G 2018]

22
San Francisco

23
San Francisco

24
Munich
blurred/anonymized image:

25
Open Pose Demo
Compression
• Modular buildups
• Several CPU, and GPUs
• No space for customer
Compression
• Compress Deep Neural Networks to make faster predictions on embedded hardware and reduce memory
footprint

Cars
Pedestrians Data
Collection
Bikes
Corner Case Semantic
Example: Detection Segmentation
Object detection only Perception
10 fps
Path Planning
Data
Processing
Multiple cameras, LiDAR, CAN

29
Deep Network Compression
• Memory efficient architectures
TensorRT from Nvidia
• Rethinking the design of the architecture (Squeeze-net)
• Platform dependent compression
• Not supporting custom layers
• Needs specific hardware

• Generic Network Compression


• Can be applied on different architectures
• Does not depend on particular platform

30
Convolutional Neural Network
Network Structure
Process
• Read Sensor data
• Feature extraction and fusion
• Link parts together
Unstructured pruning [Han2015]

Conv layer : 512 x 512 x 3 x 3 = 2.4M


Fc: feed 7-by-7-feature maps into the first 4096-node-fc-
layer accounts for 512 x 7 x 7 x 4096 = 102.8M
parameters alone
Low rank decompositions 2D [Xue2013]

• Replace weight matrix by two smaller weight matrices


Comparison compression methods for fully connected layers

Approach Size reduction Speed Accuracy

Unstructured 100x smaller Similar Comparable


pruning
[Han2015]
Low rank 40x smaller 2.4x faster Comparable
decompositions
[Xue2013]
Structured Fisher pruning [Molchanov2017, Theis2018]

1.4 0.8 2.7


height
width

channels
Remove channel slices
by importance
Benchmark
Multiple pruning rounds: compress model by [5, 10, …, 50] %
• Pruning rounds (10, 100, 1)  one complete loop
(prune + retrain + evaluation)
• Filters to prune in each round (93, 9, 930)
• Pruning batch size (32)
• Number of pruning steps (4)
10x
Different pruning strategies
0. Baseline (no retrain)
1. Retrain after each pruning round
1 Prune Retrain

93 channels
10 rounds with 93 channels
100x

2. Retrain after each pruning round


100 rounds with 9 channels
2 Prune Retrain

9 channels

1x
3. Retrain once at end
1 round with 930 channels

3 Prune Retrain
930 channels
Results: Structured Fisher pruning (10 rounds, no retraining)

Loss [abs] Accuracy [%]


0.45 1.2
0.4
1
0.35
0.3 0.8

0.25 0.6
0.2
0.4
0.15
0.1 0.2

0.05 0
0 1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5 6 7 8 9 10 11
% compression % compression

fulleval/baseline_prune_10x93
Results: Structured Fisher pruning (10 rounds, with retraining)
0.6 Loss [abs]

0.5

0.4

0.3

0.2

0.1

0 % compression
1 2 3 4 5 6 7 8 9 10
Series1 Series2
Only retrained if accuracy < 98%
Results: Structured Fisher pruning (1 round, with retraining)

Baseline Compressed Compressed


model model model
(before retrain) (after retrain)
Size 3,726,801 1,725,092 1,725,092
parameters parameters parameters
45 MB 21 MB 21 MB

Accuracy 28% dropped 13% improved


Loss: 0.097977 Loss: 0.352622 Loss: 0.086778

Compressed to 50% 40 epochs, ~18 hours


Summary
• Autonomous driving has come a long way
• Deep learning shows promising results for autonomous driving (also it is Scalable )
• Common perception models are not enough -> need more complex analysis
• Typical networks are too big for cars ->generic compression to reduce the size and time
References
• Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation
using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (pp. 7291-7299
• Song Han, Jeff Pool, John Tran, and William Dally. “Learning both weights and connections
for efficient neural network.” In NIPS, 2015.
• Xue, Jian, et al. "Singular value decomposition based low-footprint speaker adaptation and
personalization for deep neural network." 2014 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2014.
• Theis, Lucas, et al. "Faster gaze prediction with dense networks and fisher pruning." arXiv
preprint arXiv:1801.05787 (2018).
• Insafutdinov, Eldar, et al. "Deepercut: A deeper, stronger, and faster multi-person pose
estimation model." European Conference on Computer Vision. Springer, Cham, 2016.
• Papandreou, George, et al. "Personlab: Person pose estimation and instance segmentation
with a bottom-up, part-based, geometric embedding model." Proceedings of the European
Conference on Computer Vision (ECCV). 2018.

Das könnte Ihnen auch gefallen