Beruflich Dokumente
Kultur Dokumente
Lecture 6: CNNs for Detection,
Tracking, and Segmentation
Object Detection
Bohyung Han
Computer Vision Lab.
bhhan@postech.ac.kr
Region‐based CNN (RCNN) Selective Search
• Motivation
Sliding window approach is not feasible for object detection with
convolutional neural networks.
We need a more faster method to identify object candidates.
• Finding object proposals
Input image Extract region Compute CNN features Classification
proposal Greedy hierarchical superpixel
segmentation
Any proposal method Any architecture Softmax, SVM
(e.g., selective search, edgebox) Diversification of superpixel
construction and merge
• Object detection • Using a variety of color spaces
Independent evaluation of each proposal • Using different similarity
measures
Bounding box regression improves detection accuracy.
• Varying staring regions
Mean average precision (mAP): 53.7% with bounding box regression in
VOC 2010 test set
[Girshick14] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for [Uijlings13] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders: Selective Search for
Accurate Object Detection and Semantic Segmentation, CVPR 2014 Object Recognition. IJCV 2013
3 4
Bounding Box Regression Detection Results
• Learning a transformation of bounding box • VOC 2010 test set
Region proposal: , , ,
Ground‐truth: , , ,
Transformation: , , ,
• Feature analysis on VOC 2007 test set
exp
exp
CNN pool5 feature
∗
argmin
5 6
Fast RCNN Faster RCNN
• Fast RCNN + RPN
Proposal computation into network
Marginal cost of proposals:
10ms
• Fast version of RCNN
9x faster in training and 213x faster in testing than RCNN
A single feature computation and ROI pooling using object proposals
Bounding box regression into network
Single stage training using multi‐task loss
[Ren15] S. Ren, K. He, R. Girshick, J. Sun: Faster R‐CNN: Towards Real‐Time Object Detection with Region
[Girshick15] R. Girshick: Fast R‐CNN, ICCV 2015 Proposal Networks. NIPS 2015
7 8
Object Detection Performance Faster RCNN with ResNet
RCNN family achieves the state‐of‐the‐art
performance in object detection!
Pascal VOC 2007 Object Detection mAP (%)
9 10
Faster RCNN with ResNet
Visual Tracking with
Convolutional Neural Networks
12
11
Main Idea Visual Tracking
• Training shared features and domain‐specific classifiers jointly. • MDNet (Multi‐Domain Network)
Multi‐domain learning
Domain‐specific classifiers Separating shared and domain‐specific layers
Domain 1
Shared feature representation
Domain 2
Domain 3
Domain 4
The Winner of Visual Object Tracking Challenge 2015
Transfer to a new domain [Nam15] Hyeonseob Nam, Bohyung Han: Learning Multi‐Domain Convolutional Neural Networks for
Visual Tracking, CVPR 2016
13 14
Transfer shared features
New Sequence
15 16
Online Tracking using MDNet Features Online Tracking: Overview
⋅ : positive score
∗ argmax x
Frame 2
Transfer shared features
New Sequence
Fine‐Tuning Repeat for the next frame
17 18
Online Network Update Hard Negative Mining
• Long‐Term Update • Short‐Term Update • Provide a “hard” minibatch in each training iteration.
Performed at regular intervals Performed at abrupt appearance
∗
Using long‐term training changes ( 0.5
samples Using short‐term training samples Randomly Select ≪ A MINIBATCH
draw samples with
For Robustness For Adaptiveness
samples highest scores
Long-term update Pool of
Negative
Samples
Training
CNN
Pool of
∗
1 0.82 0.91 0.86 0.93 0.94 0.85 0.73 0.78 0.66 0.38 0.53 0.47 0.62 0.83 0.88 Positive
Frame # Samples
Randomly
draw
samples
Short-term update
19 20
Hard Negative Mining Bounding Box Regression
• Improve the localization quality.
Positive sample Negative sample
‐ DPM [Felzenszwalb et al. PAMI’10], R‐CNN [Girshick et al. CVPR’14]
Frame 1 Frame
Ground-Truth
Positive samples
Tracking result
Train a bounding box Adjust the tracking result by
regression model. bounding box regression.
1st minibatch 5th minibatch 30th minibatch
Training iteration
21 22
Results on OTB100[Wu15] Results on VOT2015
• Protocol
MDNet is trained with 58 sequences from {VOT’13,’14,’15} excluding
{OTB100}.
Distance precision and overlap success rate by One‐Pass‐Evaluation (OPE)
Semantic Segmentation by
Fully Convolutional Network
25
26
Semantic Segmentation using CNN Fully Convolutional Network (FCN)
• Image classification • Interpreting fully connected layers as convolution layers
Each fully connected layer is identical to a convolution layer with a
large spatial filter that covers entire input field.
Query image
1
1 16 16
fc7 4096 fc7 4096
4096
• Semantic segmentation fc7 1
1
4096 1
Given an input image, obtain pixel‐wise segmentation mask using a deep fc6 1
1 1 16 16
Convolutional Neural Network (CNN) fc6 4096 fc6 4096
7 7 512
pool5
22 512
7 7 512 22
pool5 pool5
27 28
FCN for Semantic Segmentation Deconvolution Filter
• Network architecture[Long15] • Bilinear interpolation filter
• End‐to‐end CNN architecture for semantic segmentation Same filter for every class
• Interpret fully connected layers to convolutional layers No filter learning!
• How does this deconvolution work?
500x500x3 Deconvolution layer is fixed.
Fining‐tuning convolutional layers of
the network with segmentation
ground‐truth.
seg ∘
16x16x21
Skip Architecture Limitations of FCN‐based Semantic Segmentation
• Ensemble of three different scales • Coarse output score map
Combining complementary features A single bilinear filter should handle the variations in all kinds of object
classes.
More semantic Difficult to capture detailed structure of objects in image
• Fixed size receptive field
Unable to handle multiple scales
Difficult to delineate too small or large objects compared to the size of rec
eptive field
• Noisy predictions due to skip architecture
Trade off between details and noises
Minor quantitative performance improvement
More detailed
31 32
Results and Limitations Results and Limitations
33 34
35