Lecture 6: Cnns For Detection, Tracking, and Segmentation: Region Based CNN (RCNN) Selective Search

CSED703R: Deep Learning for Visual Recognition (2016S)
Lecture 6: CNNs for Detection,
Tracking, and Segmentation
Object Detection
Bohyung Han
Computer Vision Lab.
bhhan@postech.ac.kr
Region‐based CNN (RCNN) Selective Search
• Motivation
 Sliding window approach is not feasible for object detection with
convolutional neural networks.
 We need a more faster method to identify object candidates.
• Finding object proposals
Input image Extract region Compute CNN features Classification
proposal  Greedy hierarchical superpixel
segmentation
Any proposal method Any architecture Softmax, SVM
(e.g., selective search, edgebox)  Diversification of superpixel
construction and merge
• Object detection • Using a variety of color spaces
 Independent evaluation of each proposal • Using different similarity
measures
 Bounding box regression improves detection accuracy.
• Varying staring regions
 Mean average precision (mAP): 53.7% with bounding box regression in
VOC 2010 test set
[Girshick14] R. Girshick, J. Donahue, S. Guadarrama, T. Darrell, J. Malik: Rich Feature Hierarchies for [Uijlings13] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders: Selective Search for
Accurate Object Detection and Semantic Segmentation, CVPR 2014 Object Recognition. IJCV 2013
3 4
Bounding Box Regression Detection Results
• Learning a transformation of bounding box • VOC 2010 test set
 Region proposal: , , ,
 Ground‐truth: , , ,
 Transformation: , , ,
• Feature analysis on VOC 2007 test set
exp
exp
CNN pool5 feature
∗
argmin
5 6
Fast RCNN Faster RCNN
• Fast RCNN + RPN
 Proposal computation into network
 Marginal cost of proposals:
10ms
• Fast version of RCNN
 9x faster in training and 213x faster in testing than RCNN
 A single feature computation and ROI pooling using object proposals
 Bounding box regression into network
 Single stage training using multi‐task loss
[Ren15] S. Ren, K. He, R. Girshick, J. Sun: Faster R‐CNN: Towards Real‐Time Object Detection with Region
[Girshick15] R. Girshick: Fast R‐CNN, ICCV 2015 Proposal Networks. NIPS 2015
7 8
Object Detection Performance Faster RCNN with ResNet
RCNN family achieves the state‐of‐the‐art
performance in object detection!
Pascal VOC 2007 Object Detection mAP (%)
9 10
Faster RCNN with ResNet
Visual Tracking with
Convolutional Neural Networks
12
11
Main Idea Visual Tracking
• Training shared features and domain‐specific classifiers jointly. • MDNet (Multi‐Domain Network)
 Multi‐domain learning
Domain‐specific classifiers  Separating shared and domain‐specific layers
Domain 1
Shared feature representation
Domain 2
Domain 3
Domain 4
The Winner of Visual Object Tracking Challenge 2015
Transfer to a new domain [Nam15] Hyeonseob Nam, Bohyung Han: Learning Multi‐Domain Convolutional Neural Networks for
Visual Tracking, CVPR 2016
13 14
Multi‐Domain Learning Online Tracking using MDNet Features

• Iteration #nK+1
Iteration #nK+2
Iteration #nK
Transfer shared features
New Sequence
15 16
Online Tracking using MDNet Features Online Tracking: Overview
⋅ : positive score
∗ argmax x
Frame 2
Transfer shared features
New Sequence
Draw target Find the Collect training Update the

candidates optimal state samples CNN if needed
Fine‐Tuning Repeat for the next frame
17 18
Online Network Update Hard Negative Mining
• Long‐Term Update • Short‐Term Update • Provide a “hard” minibatch in each training iteration.
 Performed at regular intervals  Performed at abrupt appearance
∗
 Using long‐term training changes ( 0.5
samples  Using short‐term training samples Randomly Select ≪ A MINIBATCH
draw samples with
 For Robustness  For Adaptiveness
samples highest scores
Long-term update Pool of
Negative
Samples
Training
CNN
Pool of
∗
1 0.82 0.91 0.86 0.93 0.94 0.85 0.73 0.78 0.66 0.38 0.53 0.47 0.62 0.83 0.88 Positive
Frame # Samples
Randomly
draw
samples
Short-term update
19 20
Hard Negative Mining Bounding Box Regression
• Improve the localization quality.
Positive sample Negative sample
 ‐ DPM [Felzenszwalb et al. PAMI’10], R‐CNN [Girshick et al. CVPR’14]
Frame 1 Frame
Ground-Truth
Positive samples
Tracking result
Train a bounding box Adjust the tracking result by
regression model. bounding box regression.
1st minibatch 5th minibatch 30th minibatch
Training iteration
21 22
Results on OTB100[Wu15] Results on VOT2015
• Protocol
 MDNet is trained with 58 sequences from {VOT’13,’14,’15} excluding
{OTB100}.
 Distance precision and overlap success rate by One‐Pass‐Evaluation (OPE)
[Wu15] Y. Wu, J. Lim, M.‐H. Yang: Object Tracking Benchmark. TPAMI 2015 Ground‐truth Our 15 repetitions

23 24
Semantic Segmentation
• Segmenting images based on its semantic notion
Semantic Segmentation by
Fully Convolutional Network
25
26
Semantic Segmentation using CNN Fully Convolutional Network (FCN)
• Image classification • Interpreting fully connected layers as convolution layers
 Each fully connected layer is identical to a convolution layer with a
large spatial filter that covers entire input field.
Query image
1
1 16 16
fc7 4096 fc7 4096
4096
• Semantic segmentation fc7 1
1
4096 1
 Given an input image, obtain pixel‐wise segmentation mask using a deep fc6 1
1 1 16 16
Convolutional Neural Network (CNN) fc6 4096 fc6 4096
7 7 512
pool5
22 512
7 7 512 22
pool5 pool5
Fully connected layers Convolution layers For the larger Input field

Query image
27 28
FCN for Semantic Segmentation Deconvolution Filter
• Network architecture[Long15] • Bilinear interpolation filter
• End‐to‐end CNN architecture for semantic segmentation  Same filter for every class
• Interpret fully connected layers to convolutional layers  No filter learning!
• How does this deconvolution work?
500x500x3  Deconvolution layer is fixed.
 Fining‐tuning convolutional layers of
the network with segmentation
ground‐truth.
seg ∘
16x16x21
Fixed Pretrained on ImageNet 64x64 bilinear interpolation

Deconvolution Fine‐tuned for segmentation
[Long15] J. Long, E. Shelhamer, and T. Darrell, Fully Convolutional Network for Semantic Segmentation.
CVPR 2015
29 30
Skip Architecture Limitations of FCN‐based Semantic Segmentation
• Ensemble of three different scales • Coarse output score map
 Combining complementary features  A single bilinear filter should handle the variations in all kinds of object
classes.
More semantic  Difficult to capture detailed structure of objects in image
• Fixed size receptive field
 Unable to handle multiple scales
 Difficult to delineate too small or large objects compared to the size of rec
eptive field
• Noisy predictions due to skip architecture
 Trade off between details and noises
 Minor quantitative performance improvement
More detailed
31 32
Results and Limitations Results and Limitations
Input image GT FCN‐32s FCN‐16s FCN‐8s Input image GT FCN‐32s FCN‐16s FCN‐8s
33 34
35

Lecture 6: Cnns For Detection, Tracking, and Segmentation: Region Based CNN (RCNN) Selective Search

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 6: Cnns For Detection, Tracking, and Segmentation: Region Based CNN (RCNN) Selective Search

Hochgeladen von

Copyright:

Verfügbare Formate

CSED703R: Deep Learning for Visual Recognition (2016S)

Multi‐Domain Learning Online Tracking using MDNet Features

Draw target Find the Collect training Update the

[Wu15] Y. Wu, J. Lim, M.‐H. Yang: Object Tracking Benchmark. TPAMI 2015 Ground‐truth Our 15 repetitions

Fully connected layers Convolution layers For the larger Input field

Fixed Pretrained on ImageNet 64x64 bilinear interpolation

Input image GT FCN‐32s FCN‐16s FCN‐8s Input image GT FCN‐32s FCN‐16s FCN‐8s

Das könnte Ihnen auch gefallen