Sie sind auf Seite 1von 43

Object Detection and

Dense Captioning
You Only Look Once: Unified, Real-Time Object
Detection. Joseph Redmon, Santosh Divvala,
Ross Girshick, Ali Farhadi, CVPR 2016
DenseCap: Fully Convolutional Localization
Networks for Dense Captioning. Justin Johnson,
Andrej Karpathy, Li Fei-Fei, CVPR 2016

Dana Berman and Guy Leibovitz


Faster R-CNN
I Region Proposal
Network (RPN)
Anchor boxes
(xa , ya , wa , ha )
Predict:
k (tx , ty , tw , th )

x = xa + tx wa
w = wa exp(tw )

I ROI pooling classifier


and bbox regression
Faster R-CNN - Limitations
I Training:
NIPS 2015:
alternating
optimization
arXiv 2016:
end-to-end
(approximately)
I Not real-time:
0.2sec/image
YOLO
YOLO - Overview
I You only look once
I Trainable end-to-end
YOLO - Method
Input image
YOLO - Method
Image is the
We split splitimage
into ainto
grida grid
YOLO - Method
Each cell predicts boxes and confidences: P(object)
Each cell predicts boxes and confidences: P(Object)
YOLO - Method
Each cell predicts boxes and confidences: P(object)
Each cell predicts boxes and confidences: P(Object)
YOLO - Method
Each cell predicts boxes and confidences: P(object)
Each cell predicts boxes and confidences: P(Object)
YOLO - Method
Each cell also
Each cell also predicts
predicts aa class
class probability.
probability
YOLO - Method
Class probability is conditional: P(class|object)
Each cell also predicts a class probability.

Bicycle Car

Dog

Dining
Table
YOLO - Method
Combining the boxthe
Then we combine andbox
class
andpredictions
class predictions.
YOLO - Method
Non-Maximal Suppression
Finally we do NMS and threshold
and threshold detections
detections
YOLO - Method
The output size is fixed.
Each cell predicts:
I B bounding boxes. For each bounding box:

I 4 coordinates (x, y , w , h)

I 1 confidence value P(object)

I N class probabilities P(class|object)


YOLO - Method
dicts:

bounding box:
rdinates (x, y, w, h)
fidence value
mber of class
ities

OC:

For Pascal VOC:


ng boxesI/ cell
7 7 grid
es
I B = 2 bounding boxes / cell
+ 20) = 7 x 7 x 30 tensor = 1470 outputs
I N = 20 classes
7 7 (2 5 + 20) = 7 7 30 tensor
YOLO - Method
Neural Network
YOLO - Method
Inspired by Inception from GoogLeNet
YOLO - Method
Inception module (CVPR 2015):
YOLO - Method
One neural network is trained to be the whole
detection pipeline
YOLO - Method
Training:
I Pre-training conv. layers on ImageNet,

using low-res input (1 week)


I For detection: add layers, increase image

resolution
I Normalize bounding box coordinates to [0, 1]

I Data augmentation: random scale, translation,

exposure and saturation


I Loss function: L2
YOLO - Method
Loss function:
YOLO - Framework
Darknet - Open source neural networks in C
http://pjreddie.com/darknet/
YOLO - Results
Example of results on natural images:
YOLO works across a variety of natural images
YOLO - Results
It also generalizes well to new domains (such as art):
It also generalizes well to new domains (like art)
YOLO - Results
Quantitative detection and localization results:
YOLO - Results
Limitations:
I Small objects

I Unusual aspect ratios

I Multiple objects per grid cell


Beyond YOLO
SSD: Single Shot MultiBox Detector
I ECCV 2016
I More accurate than Faster R-CNN
I FPSYOLO > FPSSSD > FPSFasterRCNN
YOLO9000: Better, Faster, Stronger
I arXiv, 25 Dec 2016
I 9000 object classes
Lessons from SSD and YOLO9000
I Multi-scale feature maps
I Predict anchor box offsets
I Normalized

I h ha exp(t)

I Aspect ratios

I Data augmentation (scale, brightness, etc.)


Dense Captioning
Background: Detection
Computer andTasks
Vision Captioning
Background:
VisualVisual
GenomeGenome Dataset
Dataset
Two men playing frisbee A red flying frisbee

Wooden
privacy The legs
fence of a man

The ground
is made A boy
of stone wearing
jeans
An athletic
shoe on A red
a foot tricycle

108,077 images
5,408,689 regions + captions
Krishna et al, "Visual Genome", 2016
Justin Johnson*, Andrej Karpathy*, Li Fei-Fei
Overview Stanford University

Fully Convolutional Localization and Captioning Architecture


a Image: Region features:
w
s 3xWxH Conv features: BxCxXxY Region Codes:
k C x W x H BxD
a
n
CNN LSTM
e Striped gray cat
) Recognition
Network Cats watching TV
d
d Localization Layer
s Region Proposals: Sampling Grid:
4k x W x H Best Proposals: BxXxYx2
n
Bx4
s
Conv
Grid
Sampling
e Generator
-
y Region scores:
k x W x H Region features:
t Conv features: Bilinear Sampler
C x W x H B x 512 x 7 x 7

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren et al., NIPS 2015
Spatial Transformer Networks, Jaderberg et al., NIPS 2015
Spatial Transformer Networks

(a) (b)

wo examples of applying the parameterised sampling grid to an image U producing the outp
ling grid is the regular grid G = TI (G), where I is the identity transformation parameter
Spatial Transformer Networks
Localisation net Grid !
generator
(a) (b)
T (G)
3: Two examples of applying the parameterised sampling grid to an image U producing the ou
sampling grid is the regular grid G = TI (G), where I is the identity transformation paramet
V affine transformation T (G).
mpling gridUis the result of warping the regular grid with an
Sampler
rity of exposition, assume for the moment that T is a 2D affine transformation A . W
s other transformations below. Spatial In this affine case, the pointwise transformation is
Transformer
0 t 1 0 t 1

The architecture ofsa spatial transformer module. Thex

i feature map U is passed to a localisation xi
input
x
which regresses the transformation
i
= T (G parameters
i ) = A
@ regular
. The yit A = grid11
spatial G over 12 13
V is transformed@to y t A
yis is applied to U as described in Sect.
ing grid T (G), which 3.3, producing
21the warped 23 feature i
22 output
he combination of the localisation network and sampling 1 mechanism defines a spatial transformer. 1
i , yi ) are the target coordinates of the regular grid in the output feature map, (xi ,
t t
a differentiable attention mechanism, while [14] use a differentiable attention mechansim s
(x
ng
urce Gaussian kernels in a generative
coordinates in the input model. The work by Girshick et al. [11] uses a region
algorithm as a form of attention,X H feature
and XW
[7]
map that define the sample points, and A is the
show that it is possible to regress salient regions
rmation
NN. matrix.we
The framework We cuse height
present andcan
in this paper width
becseennormalised coordinates,
as a sgeneralisation such that 1 xti ,
of differentiable
s
within the spatial
to any spatial V
bounds
transformation.
i = U
of the output,nm k(x sm)k(y n)
and 1i xi , yi 1i when within the spatial b
s

nput (and similarly


atial Transformers for the y
n coordinates).
m The source/target transformation and samp
lent to the standard texture mapping and coordinates used in graphics [8].
ction we describe the formulation of a spatial transformer. This is a differentiable module
Vi = n m
Unm (bxi + 0.5c m) (byi + 0.5c n)
n m

ng
Spatial Transformer Networks
bx + 0.5c rounds x to the nearest integer and () is the Kronecker delta function
e bx + 0.5c
kernel rounds
equates x tocopying
to just the nearest integer
the value andnearest
at the () is pixel
the Kronecker
to (xs , y s )delta function.
to the output lo
pling kernel equates to just copying the value at the nearest pixel to (xsi i, yis )i to the output loc
).t Alternatively, a bilinear sampling kernel can be used, giving
yi ). Alternatively, a bilinear sampling kernel can be used, giving
XH XW
X W
H X
c c s
VV =
c UUnm
c max(0, 1 |x
|xsii max(0,11 |y|ys is n|)
m|)max(0, n|)
i = nm max(0, 1 m|)
i
i
n m
n m

w
lowbackpropagation
backpropagationofofthe
theloss
loss through
through this sampling mechanism
this sampling mechanismwe wecan
candefine
definethe
thegrad
gr
spect totoUUand
respect G.For
andG. Forbilinear
bilinear sampling
sampling (5) the partial
(5) the partial derivatives
derivativesare
are
W
HH X
@Vi ic X
c X W
X
@V
cc
== max(0,
max(0, 1 |xsis max(0,11 |y|yisis n|)
m|)max(0,
m|) n|)
@U
@Unmnm nn m
m

8
8
cc X H
XH W
XW
X <00
< ifif|m
|m xxsi |si | 1 1
@V
@Vii = cc max(0,
ss = UUnm
nmmax(0, 1 |yis n|)
n|) 11 ififmm xxsi si
@x
@x :
:
i i n mn m 11 ififm s
m<<xxi si
@V cc
imilarlytoto(7)
milarly for@V@yisis. .
(7)for
@y i i

givesususa a(sub-)differentiable
ves (sub-)differentiable sampling
sampling mechanism,
mechanism, allowing
allowingloss
lossgradients
gradientstotoflow
flowbac
ba
Losses Dense Captioning Architecture
Localization
Layer

Convolutional Recognition
Network Network

Recurrent
Joint training: Network
Minimize five losses

1. Box regression (position)


2. Box classification (confidence)
3. Box regression (position)
4. Box classification (confidence)
5. Captioning
Captioning
DenseRNN
Captioning: Prior Work
man throwing disc END

START man throwing disc

red frisbee END

Convolutional START red frisbee


Network gray stone ground END
Region Crop
Proposals

START gray stone ground

Karpathy and Fei-Fei, CVPR 2015 Recurrent Network


Results
man wearing a blue shirt
sitting on a chair black computer monitor
wall is white

people are in
the background

red and
brown chair

silver handle
on the wall

black bag
on the floor
man with black hair computer monitor on a desk
18
Additional Application - Finding Regions
Given Description
Finding regions given descriptions
head of a giraffe
0.1

0.1
0.2

0.9
0.9

21
0.4
Additional Application - Finding Regions
Given Description
Finding regions given descriptions
front wheel of a bus

Das könnte Ihnen auch gefallen