Beruflich Dokumente
Kultur Dokumente
Dense Captioning
You Only Look Once: Unified, Real-Time Object
Detection. Joseph Redmon, Santosh Divvala,
Ross Girshick, Ali Farhadi, CVPR 2016
DenseCap: Fully Convolutional Localization
Networks for Dense Captioning. Justin Johnson,
Andrej Karpathy, Li Fei-Fei, CVPR 2016
x = xa + tx wa
w = wa exp(tw )
Bicycle Car
Dog
Dining
Table
YOLO - Method
Combining the boxthe
Then we combine andbox
class
andpredictions
class predictions.
YOLO - Method
Non-Maximal Suppression
Finally we do NMS and threshold
and threshold detections
detections
YOLO - Method
The output size is fixed.
Each cell predicts:
I B bounding boxes. For each bounding box:
I 4 coordinates (x, y , w , h)
bounding box:
rdinates (x, y, w, h)
fidence value
mber of class
ities
OC:
resolution
I Normalize bounding box coordinates to [0, 1]
I h ha exp(t)
I Aspect ratios
Wooden
privacy The legs
fence of a man
The ground
is made A boy
of stone wearing
jeans
An athletic
shoe on A red
a foot tricycle
108,077 images
5,408,689 regions + captions
Krishna et al, "Visual Genome", 2016
Justin Johnson*, Andrej Karpathy*, Li Fei-Fei
Overview Stanford University
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren et al., NIPS 2015
Spatial Transformer Networks, Jaderberg et al., NIPS 2015
Spatial Transformer Networks
(a) (b)
wo examples of applying the parameterised sampling grid to an image U producing the outp
ling grid is the regular grid G = TI (G), where I is the identity transformation parameter
Spatial Transformer Networks
Localisation net Grid !
generator
(a) (b)
T (G)
3: Two examples of applying the parameterised sampling grid to an image U producing the ou
sampling grid is the regular grid G = TI (G), where I is the identity transformation paramet
V affine transformation T (G).
mpling gridUis the result of warping the regular grid with an
Sampler
rity of exposition, assume for the moment that T is a 2D affine transformation A . W
s other transformations below. Spatial In this affine case, the pointwise transformation is
Transformer
0 t 1 0 t 1
The architecture ofsa spatial transformer module. Thex
i feature map U is passed to a localisation xi
input
x
which regresses the transformation
i
= T (G parameters
i ) = A
@ regular
. The yit A = grid11
spatial G over 12 13
V is transformed@to y t A
yis is applied to U as described in Sect.
ing grid T (G), which 3.3, producing
21the warped 23 feature i
22 output
he combination of the localisation network and sampling 1 mechanism defines a spatial transformer. 1
i , yi ) are the target coordinates of the regular grid in the output feature map, (xi ,
t t
a differentiable attention mechanism, while [14] use a differentiable attention mechansim s
(x
ng
urce Gaussian kernels in a generative
coordinates in the input model. The work by Girshick et al. [11] uses a region
algorithm as a form of attention,X H feature
and XW
[7]
map that define the sample points, and A is the
show that it is possible to regress salient regions
rmation
NN. matrix.we
The framework We cuse height
present andcan
in this paper width
becseennormalised coordinates,
as a sgeneralisation such that 1 xti ,
of differentiable
s
within the spatial
to any spatial V
bounds
transformation.
i = U
of the output,nm k(x sm)k(y n)
and 1i xi , yi 1i when within the spatial b
s
ng
Spatial Transformer Networks
bx + 0.5c rounds x to the nearest integer and () is the Kronecker delta function
e bx + 0.5c
kernel rounds
equates x tocopying
to just the nearest integer
the value andnearest
at the () is pixel
the Kronecker
to (xs , y s )delta function.
to the output lo
pling kernel equates to just copying the value at the nearest pixel to (xsi i, yis )i to the output loc
).t Alternatively, a bilinear sampling kernel can be used, giving
yi ). Alternatively, a bilinear sampling kernel can be used, giving
XH XW
X W
H X
c c s
VV =
c UUnm
c max(0, 1 |x
|xsii max(0,11 |y|ys is n|)
m|)max(0, n|)
i = nm max(0, 1 m|)
i
i
n m
n m
w
lowbackpropagation
backpropagationofofthe
theloss
loss through
through this sampling mechanism
this sampling mechanismwe wecan
candefine
definethe
thegrad
gr
spect totoUUand
respect G.For
andG. Forbilinear
bilinear sampling
sampling (5) the partial
(5) the partial derivatives
derivativesare
are
W
HH X
@Vi ic X
c X W
X
@V
cc
== max(0,
max(0, 1 |xsis max(0,11 |y|yisis n|)
m|)max(0,
m|) n|)
@U
@Unmnm nn m
m
8
8
cc X H
XH W
XW
X <00
< ifif|m
|m xxsi |si | 1 1
@V
@Vii = cc max(0,
ss = UUnm
nmmax(0, 1 |yis n|)
n|) 11 ififmm xxsi si
@x
@x :
:
i i n mn m 11 ififm s
m<<xxi si
@V cc
imilarlytoto(7)
milarly for@V@yisis. .
(7)for
@y i i
givesususa a(sub-)differentiable
ves (sub-)differentiable sampling
sampling mechanism,
mechanism, allowing
allowingloss
lossgradients
gradientstotoflow
flowbac
ba
Losses Dense Captioning Architecture
Localization
Layer
Convolutional Recognition
Network Network
Recurrent
Joint training: Network
Minimize five losses
people are in
the background
red and
brown chair
silver handle
on the wall
black bag
on the floor
man with black hair computer monitor on a desk
18
Additional Application - Finding Regions
Given Description
Finding regions given descriptions
head of a giraffe
0.1
0.1
0.2
0.9
0.9
21
0.4
Additional Application - Finding Regions
Given Description
Finding regions given descriptions
front wheel of a bus