Cs231a 2016 Reports

3D Person Tracking in Retail Stores
Russell Kaplan and Michael Yu

Stanford University
450 Serra Mall, Stanford, CA 94305
{rjkaplan, myu3}@stanford.edu
Abstract tail stores are looking to obtain and harness similar data to
shelve products, auction shelf space, and strategically place
In this project, we sought track the movement of mul- discounts and information.
tiple people in 3D given security footage that is represen- There are numerous current approaches being taken to
tative of what would be available in retail stores without obtaining real-time consumer data in retail stores. Many
modifying existing camera deployments. More specifically, of these involve tracking peoples movement throughout the
this involves using a single, possibly fisheye-distorted view store, with Bluetooth or wifi tracking, or with photogates.
to track people in 3D space and model where they are in a All of these require significant hardware and deployment
room or store. This involved bridging various papers across costs, which has hindered their scalability.
the computer vision literature, looking at radial distortion Computer vision is a promising tool to address this busi-
resolution for images from an uncalibrated camera; cali- ness need. As weve seen significant progress in the field as
bration techniques from single-view metrology, affine ap- of late (notably from convolutional neural networks), cam-
proximation, and a simplified special-case mathematical eras are an attractive option to track users in a store and
model for object detections on a ground plane; and deep obtain data on customer retail behavior. Cameras are a par-
region-based convolutional networks for 2D person detec- ticularly attractive method to obtain this data because most
tion. retail stores already have the necessary hardware in place
We define an approach that uses this single security cam- for the purposes of security.
era view to track people on a ground plane, relying on some
In this project, we track the 3D positions of multiple peo-
assumptions about the geometry of the space, but no ad-
ple throughout a store in real-time, given a camera source
ditional hardware, giving our work an advantage over ex-
and certain assumptions about the store layout (namely, that
isting companies and processes that rely on more sophisti-
only one floor is visible to the camera, and that people’s
cated sensors or sensor networks for person-tracking. The
feet are visible in the image frames). We show a live 3D
assumptions we make are realistic in the retail store context,
bounding box whose coordinates are relative to the world
and our quantitative results are compelling: significantly
frame that tracks each person as they move, with error lev-
more accurate than WiFi-based tracking solutions that are
els within 55cm in outdoor settings and 30cm indoors, much
already being deployed [9]. Our approach can be deployed
better than the existing WiFi- and Bluetooth-based tracking
in most retail spaces without any hardware modifications to
solutions [9]. To do this, we tie together numerous concepts
existing security setups. We are excited by the real-world
from computer vision in a single approach - we design an
applicability of our work.
entirely new approach for this business need by integrat-
ing various approaches. To each subproblem, we tried var-
ious options, picked the best, and identified optimizations
1. Introduction for this particular case where applicable - for instance, we
As we see much of retail moving online from brick- found that affine approximation of tiled floor grids worked
and-mortar, opportunities to analyze consumer shopping far better than vanishing points given the configuration of
behavior are rapidly growing and becoming commercial- many security cameras with respect to the floor.
ized. Recommendation engines, product positioning on In this paper, we beginning by examining the problem
webpages, and sales funnels are all relentlessly A/B tested statement, and related work, both work that we read to gain
and optimized to seduce the consumer into clicking Buy. background knowledge, and also that we read to learn from
Much as the abundance of data makes optimization and it- and build on their approaches to solving problems of dis-
eration easy in the e-commerce space, brick-and-mortar re- tortion, calibration, and object detection. We then dive into
1
our approaches to each of these three problems, and how be used consistently with a model of radial image distor-
they tie together into an end-to-end approach that could be tion to solve for the radial distortion parameters and thus
deployed into retail stores. Finally, we use a larger scale undistort the image. For a given snake, the algorithm fits it
dataset to obtain some quantitative metrics with which to to the line of best fit, rotates this line to be horizontal, and
evaluate the success of our approach, and we leave space estimates constant distortion parameters that fit all of these
for future work, such as integrating Extended Kalman Fil- snakes/lines.
ters to enforce temporal consistency. Another problem that was clear was recovering both in-
trinsic and extrinsic camera parameters from a single view.
2. Problem Statement To do this, we used the affine calibration approximation
taught in class, and covered in R. Hartley and A. Zissermans
Our objective is to accurately predict the 3-D position textbook, Multiple View Geometry in Computer Vision [7].
of a person based on their location in a security camera. In particular, we used calibration from a checkerboard with
If the person’s true location (we use the location of their the direct linear transformation algorithm, with tiled floors
feet) can be given by (x∗, y∗, Z∗), and we estimate loca- as our checkboard. We had also tried single view metrology
tion (x0 , y 0 , z 0 ) for them in 3-D space, then we are trying to with three sets of parallel lines, but this left us estimating
minimize: extrinsics.
Finally, we had the problem of object detection, to find
p people in our image frame. Cutting edge research in object
d= (x ∗ −x0 )2 + (y∗ = y 0 )2 + (z ∗ −z 0 )2 detection suggests that deep convolutional nets is the best
way to do this. Scalable Object Detection using Deep Neu-
for each person in each image. Since we use people’s ral Networks by Erhan et al. at Google demonstrated that
feet, we can constrain z 0 = 0, enabling us to use a single convolutional neural nets are very powerful for finding re-
view to predict position - this creates errors when people gions of interest, while also having an effective recognition
jump, but this is not typical behavior. path that categorizes the object of interest. The two steps
We use two coordinate systems in this paper. The pri- take a while though, and are not necessarily fast enough to
mary system is standard, where x spans the width of the build real-time bounding boxes on video - Ren et al.s Faster
image, increasing to the right, and y spans the height, in- R-CNN: Towards Real-Time Object Detection with Region
creasing downwards, and the origin (0, 0) in the upper left Proposal Networks takes this work a step further by folding
hand corner of the image. We also use a unique coordi- the localization and recognition path into the same convolu-
nate system when undistorting images, where (0, 0) is at tional neural networks, training the weights with the local-
the optical center of the image, and x and y increase right ization and recognition cost functions alternately [10]. We
and downwards respectively. This coordinate system is nec- use this work directly as a component of our solution.
essary to model radial distortion parameters by expressing Finally, after proving our concept on some footage via
points in polar coordinates from the optical center. YouTube, we were able to find a much more expansive
dataset with ground truths from A new Dataset for Peo-
3. Related Work ple Tracking and Reidentification via the Video Surveil-
There is a significant amount of work that has been done lance Online Repository [11]. This dataset was also pre-
in the space of retail analytics via camera, but this work has calibrated and undistorted for us, via the methodology out-
been done almost exclusively by startups which protect their lined in Cooperative Object Tracking with Multiple PTZ
methods as intellectual property. These include Prism Sky- Cameras, presented by Everts, Jones, and Sebe [3].
labs, Brickstream, and RetailNext. These are all dependent
on custom hardware, or sensors to augment the surveillance 4. Technical Approach
feed.
4.1. Distortion Correction
On the technical front, we had to integrate work from
various frontiers in computer vision. Solving this business While there are a number of approaches to fixing this
problem required us to solve a number of technical prob- issue, such as un-distorting the image with projections of
lems. One was correction of barrel distortion - we leaned area, or computing radial distortion coefficients, most meth-
heavily on Sing Bing Kangs work in Semiautomatic Meth- ods depend on knowing the intrinsics of the camera pre-
ods for Recovering Radial Distortion Parameters from A distortion. Sing Bing Kangs work, however, suggests a
Single Image, in which he defined an algorithm by which method to manually pick points on a line and accordingly
a user to draws snakes on a distorted image, with each ap- fit distortion parameters, as referred to above [8]. In par-
proximately corresponding to a projected straight line in ticular, it tries to fit all points that should be collinear (as
space [8]. In his paper, he outlines how these snakes can denoted by a person) so that they are, while moving those
2
points as little as possible, and only moving all points radi- for a retail security video that has been corrected for barrel
ally by adjusting radius parameters. distortion), we can constrain ω to:
Radial distortion, of which barrel distortion is a type, can
be modeled by imposing a polar coordinate system on the  
ω1 0 ω2
image. From the center of the image, each pixel has an an-
ω=0 ω1 ω3  (1)
gle and distance from the optical center. Changes in this
ω2 ω3 ω4
distance create radial distortion. The distortion at a point,
which we call ∆r, is the change in distance from the op- This matrix has four unknowns, but it is only known up
tical center from the undistorted distance. We model this to scale. This means there are effectively three unknowns
distortion with the equation: if we set one of the unknown variables to 1 and scale the
rest accordingly. As a result, we can solve for the matrix ω
∞
X p by using our three vanishing points, and exploiting the fact
∆r = C2i+1 r2i+1 where r = x2 + y 2 that because they are mutual orthogonal, for each vi , vj with
i=1 i 6= j, vi> ωvj = 0. Thus we have three scalar equations in
three unknowns:
Then, the approach is to find values of C such that all
points we manually constrain to be collinear as collinear,
while also minimizing the distance we move them. This v1> ωv2 = 0 (2)
is a common radial distortion correction algorithm, and we v1> ωv3 =0 (3)
found that Photoshop actually provides a very effective im-
plementation which can be used to adjust entire videos, and v2> ωv3 =0 (4)
imposes the same distortion parameters on each frame. We
went this route to manually undistort video, rather than im- It is known that ω = (KK > )−1 , where K is the 3x3
plement the polar geometry and solver for the parameters matrix of camera intrinsics. So we can find K using the
from scratch. Cholesky decomposition of ω. We did this with our retail
video and found the camera intrinsics.
Unfortunately after
4.2. Calibration this process the extrinsics R|T are still unknown, so we
could not recover the entire camera matrix P = K R|T .
Being able to map pixel coordinates to world coordinates
Setting the camera to be the origin in world coordinates
is a central component of understanding shoppers’ 3D lo-
is not helpful, because even though it resolves the R|T
cations from images. Thus it is necessary to find a robust
parameters (they would simply be I|0 , we still need to
camera calibration for any given video feed. We considered
know where the ground plane is in world coordinates to re-
two approaches to solving this problem for our unlabeled
solve the projective ambiguity
ofmapping a pixel to a world
retail store data. For our 3DPeS data, calibration parame-
point. We tried estimating R|T by hand through trial and
ters were included with the dataset. The parameters were
error, but the results were very unreliable.
given in a third type of calibration formulation, which we
also explain below. Because retail cameras only need to be
calibrated once, it is practical to do these calibrations by
hand in a real-world context; thus we did not invest time in
automating the calibration process.
4.2.1 Single View Metrology

Given that we are constrained to one camera, in an environ-
ment with multiple sets of mutually orthogonal lines, it is
natural to first try an approach based on single-view metrol-
ogy to calibrate the camera. In our retail video, we labeled
three sets of mutually orthogonal lines by selecting two
Figure 1. Sets of mutually orthogonal lines used for single view
points on each of 6 lines. For each pair of parallel lines, we
metrology calibration in the retail camera frame. The line intersec-
found the corresponding vanishing point vi , i ∈ {1, 2, 3} by
tions give us three vanishing points, which we use in the equations
computing the intersection of the lines in image coordinates. above to solve for K.
We then considered the matrix ω, the projection of the abso-
lute conic Ω∞ into image coordinates. By assuming a cam-
era with square pixels and 0-skew (a reasonable assumption wher e
3
4.2.2 Affine Calibration 4.2.3 PTZ Calibration for Ground Plane Object Detec-
tion
Our problems with a calibration based on single view
metrology could be resolved by finding point correspon- Affine calibration worked well for our retail video data. But
dences and solving for the camera matrix directly. For our we used a different approach when working with the 3DPeS
retail video, we labeled 15 points by hand in the scene. We dataset, because that dataset already included parameters
place the origin at the bottom left corner of the bottom- for a different type of calibration. Due to the difficult po-
leftmost tile that is fully visible, we let each tile be 1 × sition of the cameras in the dataset, the publisher used a
1 in width and height in world coordinates, and we say that simpler type of calibration [3] designed specifically for Pan,
all tiles lie on the ground plane z = 0. We model the cam- Tilt, and Zoom (PTZ) cameras, which are commonly used
era matrix P as affine, which is a desirable approximation in surveillance. The methodology is fully described in the
even though the true camera matrix is projective, because source paper; we briefly summarize it here for convenience.
the lines in the scene are nearly parallel and solving for The calibration assumes that objects are only detected
fewer unknowns is preferred with only 15 point correspon- along a ground plane of Z = 0. Let U, V, H be the dis-
dences. That is, we let: placement of the camera coordinate system relative to the
world; ∆i = i − i0 , ∆j = j − j0 are the pixel positions
relative to the image’s optical center (i0 , j0 ); αxf and αyf are
 
a1,1 a1,2 a1,3 a1,4
P = a2,1 a2,2 a2,3 a2,4  (5) the horizontal and vertical scales between the image and im-
0 0 0 1 age plane; t is the tilt angle of the camera; and p0 = p + p0
is the pan angle after the camera is aligned with the world
Then, we use our n = 15 points to solve the following coordinate system. An object’s world coordinates X, Y are
over-constrained system of 2n equations: then given as:
 f 
Ax = b (6)
αx ∆j
X H U
= f R  αyf ∆i  + (9)
Where the world coordinates of point i are (xi , yi , zi ), Y αy ∆i sin t + cos t V
−1
the image coordinates are (ui , vi ), and:
where
 
x1 y1 z1 1 0 0 0 0
cos p0 sin p0 cos t sin p0 sin t

 x2 y2 z2 1 0 0 0 0 R= (10)
 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sin p0 − cos p0 cos t − cos p0 sin t
 
xn yn zn 1 0 0 0 0
A=
  (7) 4.3. Person Detection
0 0 0 0 x1 y1 z1 1 
0 0 0 0 x2 y2 z2 1
 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 0 0 0 x1 yn zn 1
 
u1
 v1 
 
 u2 
 
 v2 
b= (8)

. .
 
un 
vn
And x is a column vector of the eight unknowns in P ,
arranged in order with unknowns from the first row of P
before unknowns from the second. We solve the system
of equations in the standard way, by rearranging so that the
right hand side is 0 and the matrix A has additional columns
(and x additional rows) to maintain the constraints imposed
in the original equation by the values in b; then taking the
SVD of this augmented left-hand matrix, and using the last Figure 2. Our ConvNet can detect multiple pedestrians with high
column of the third output from SVD as the parameters of confidence, especially in clear environments such as this one. This
P. image is from Camera 3 in the 3DPeS dataset.
4
Person detection in 2D images is a well-studied problem stream of video data. For each frame in the stream, we run
[12] [1] [6] with several existing solutions, offering various our person detector over the image and get as output a set
tradeoffs between speed, accuracy and simplicity. There of bounding boxes. In the general case, the mapping be-
are essentially two parts to the problem: generating regions tween the pixels defining the bounding box and the world
of interest (RoIs) where a person might be, and classifying coordinate system is ambiguous, thanks to the ambiguity of
those regions to determine if the region does indeed contain the 3D to 2D transformation of the camera. But because
a person. we know these bounding boxes are people, we can make an
Recently, approaches that utilize deep ConvNets have assumption that each person’s feet rest on the ground plane
been shown to perform exceptionally well at object detec- (Z = 0). This is a reasonable assumption in nearly all re-
tion, and person detection specifically. For this part of our tail environments; the only test videos we encountered in
problem, we use a deep Convent object detection architec- which this is not the case is when the camera watches over
ture proposed by Girshick et. al. known as Faster R-CNN an escalator or can see multiple floors at once.
[10]. Faster R-CNN is an improvement on Fast R-CNN [4], Once we have bounding boxes for each person, we take
which is itself an improvement of the original R-CNN ar- the bottom center pixel of each box (call its image coor-
chitecture [5]. Faster R-CNN works as a single, unified dinates cx , cy ) and find the 3D coordinates associated with
ConvNet that uses shared convolutional layers to output fea- that pixel, assuming it lies on the Z = 0 plane. In the affine
ture maps that then get sent to a Region Proposal Network calibration case, this means finding the intersection of the
(RPN) and a classifer head. The network is trained end-to- ray from the camera along which any point in 3D would
end with back propagation and stochastic gradient descent, project to (cx , cy ), and the Z = 0 plane. By construction
with a multi-task loss function. The full details can be found this intersection must resolve to a unique point.
in [10]. In the PTZ calibration case, the ground plane assumption
We use a pretrained version of Faster R-CNN that we is built into the calibration model and so there is nothing
modified to only output person detections (the original ver- else that must be done besides converting cx , cy to offsets
sion outputs detections of 20 types of objects). from the optical center and plugging the results into equa-
tion (9).
In both cases, once we have obtained the world coordi-
nates of each person, we plot 3D voxels representing each
person in 3D graphing environment modeled after the room,
to visualize the positions of the people in 3D.
5. Experimental Setup and Results

We performed experiments on two data sources for this
project. The first was from a video clip of a retail security
system demo on YouTube [2]. This clip was useful for us
to understand real world surveillance video conditions. For
example, before looking for real surveillance video online
we had not considered the fact that we might encounter bar-
rel distortion: upon encountering that problem we realized
a practical implementation would need to correct for this
(which we now do). It was also helpful as a qualitative as-
sessment tool of our system’s performance. Unfortunately
YouTube videos don’t have ground truth labels, so we could
not evaluate our results quantitatively with this data source.
The lack of truth labels led us to search for other data
sources. We found that the 3DPeS Video Surveillance
dataset was quite useful in this regard. This dataset contains
Figure 3. The Faster R-CNN architecture. (Image from [?]ref3)) outside surveillance footage, where there are often fewer
occlusions and people are farther away than the in-store en-
vironment, so it is not exactly representative of a retail sys-
4.4. Putting it Together
tem. Nonetheless the dataset offered numerous advantages,
Given a neural network that can detect 2D bounding including ground-truth labels and pre-computed calibration
boxes around people and a calibrated camera, how do we parameters. It gave us the chance to address the same fun-
put the system together? Our approach takes as input a damental task, predicting where people are in 3D given a
5
single 2D camera, in a cleaner environment.
5.1. Environment
We performed all tests on late-2013 Macbook Pro with
16GB RAM and a 2.3GHz processor. Due to lack of hard-
ware, we ran all code on the CPU, even though the Con-
vNet runs much faster on the GPU. This resulted in an aver-
age execution time of 3.19s per frame, a roughly 15X slow-
down in prediction speed compared to the results reported
by [10] on better hardware. The time spent outside of our
ConvNet’s forward pass was negligible. From these num-
bers it is clear that a real world deployment of our work
should have a dedicated GPU. Figure 5. The point correspondences we labeled for the retail cam-
era video clip, vantage point 2. It is important that some of the
5.2. Results and Error Analysis: Retail Clip Exper- points are off the ground plane, or the calibration would be degen-
iments erate.
still cause problems if the feet of the person are not visible
in the image. This is because our pipeline assumes that the
bottom of the bounding box is where a person’s feet are, and
thus where the ground plane is. When that assumption is vi-
olated (e.g. because the bounding box ends at the person’s
waste), then the output is noticeably inaccurate.
Another typical failure occurred when people were in
rapid motion. In the video clip, there is a point at which
the two women sprint out of the store. For most of these
frames, the system loses track of them because no bound-
ing boxes are predicted. We hypothesize two reasons for
this failure. One is that the rapid and blurry stills of a hu-
Figure 4. Example bounding box predictions for the retail video man sprinting do not look very much like a typical person,
data. These were generated without doing image distortion correc- and these types of images are likely underrepresented in the
tion, although in our final implementation we were sure to correct dataset on which our ConvNet was trained. The second is
distortion first if it was present. that due the underlying architecture of the ConvNet, it has
a receptive field size of 228 pixels. This is suitable for most
The goal of our experiments on the retail clip data was purposes but when the people in this video clip are sprinting
to verify qualitatively that our approach was sound, and to with arms extended on both sides, their width in the image
produce for each frame a 3D visualization of the scene ge- easily exceeds 400 pixels. This makes it nearly impossi-
ometry with people accurately tracked throughout. In our ble for the ConvNet to have a chance at detecting the entire
affine calibration step, we hand-labeled 15 point correspon- bounding box.
dences, shown here in figure 5. The root-mean-square error
5.3. Results and Error Analysis: 3DPeS Dataset
(RMSE) of the calibration matrix we found on the data used
Experiments
to create it was 32.7084 pixels, less than the width of one
tile almost everywhere in the frame. Ad-hoc measurements We also evaluate our pipeline on the 3D People Surveil-
of the final 3D voxels outputs showed they were generally lance Dataset provided by [5]. In general our people detec-
within two thirds of a tile to the true position of each person tion ConvNet works much more reliably on this dataset be-
when the bounding box was correct, or roughly 20cm. We cause of the reduced occlusions, better lighting and higher
did not analyze this rigorously, as we performed most of our definition of the images. In our run of the pipeline on a
quantitative analysis on the second dataset. live stream of 17 frames from the same camera, we detect
A common failure mode of our solution on the retail clip 51 of 53 total person bounding boxes when the person is
data was occlusions. Occlusions cause problems in two more than halfway in the scene (i.e. not majority cut off
ways. The first is that they sometimes prevent our Con- by an edge of the image). Across all frames we tested, the
vNet from finding a person in the frame. Even if the Con- root-mean-square error of our position predictions in world
vNet does find a bounding box, however, occlusions can coordinates was 554 millimeters. This is about 2x higher
6
Figure 7. Our prediction errors in millimeters in the world frame
for each person in each image we evaluated, shown collectively.
Each point is the difference between the predicted x, y of a person
in world coordinates and the true x, y of the person.
Figure 8. The same graph as before but with the outlier (Y offset
> 3000) removed. Note the different scales of the X and Y axes.
Figure 6. Bounding boxes found for a sample frame and the cor-
responding 3D scene model that we generated. In the 3D model,
the origin is marked by the blue plus sign below the left voxel. It there is more error along the X axis than the Y axis, but
corresponds to the tile in the frame found right below the ”i” in most of the Y axis error that does occur is in the same di-
”DixonSecurity.com”. We can see here that the model is rather ac- rection: consistently slightly positive. This is because we
curate considering the low number of calibration points, the origi- use the bottom of the bounding box as the intersection point
nal fisheye distortion, and partial occlusions in the scene.
of the person with the ground, when in reality the ground
truth label for the person’s position in 3D considers the cen-
ter of the person overall. (Imagine a circle on the ground
than our ad-hoc estimate of our performance on the retail around the person’s feet. The centerpoint of this circle is
dataset, due mostly to the vastly greater field-of-view of the the ground truth x, y label. It will consistently be slightly
camera we used in this dataset. (This is an outdoor cam- offset from a point at the edge of one foot, which is what
era which overlooks more than 200 square meters of space, we get with the bounding box method.)
much more than can be seen by the indoor camera. So be- The X axis error is also because of our bounding-box-to-
ing off the same number of pixels will translate to a much intersection-point methodology. As people walk they swing
larger increase in RMSE.) their arms and stride their legs. The bounding box produced
We can glean several interesting insights from the pre- by our ConvNet will generally capture all of these extremi-
diction error graph. For example, we see that in general ties, so any time they are not displaced from the person’s
7
center symmetrically, the bottom center of the bounding References
box will not be an accurate representation of where the feet
[1] N. Dalal, B. Triggs, and C. Schmid. Human detection us-
intersect. Finally, the bounding boxes are in general imper-
ing oriented histograms of flow and appearance. Computer
fect, and random noise is surely a factor as well. Vision ECCV, 2006.
[2] T. Dixon. Fight caught on cctv security camera. https:
6. Conclusions and Future Work //www.youtube.com/watch?v=Kla8W8IIAtk.
[3] I. Everts, G. Jones, and N. Sebe. Cooperative object tracking
We’ve successfully developed and outlined an end-to- with multiple ptz cameras. Image Analysis and Processing,
end approach to turning raw security footage into a 3D 2007.
model of customer movement throughout a retail store. This [4] R. Girshick. Fast r-cnn. IEEE International Conference on
involves a one-time calibration of distortion and camera pa- Computer Vision (ICCV), 2015.
rameters, and then the usage of Faster-RCNN to find peo- [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ple in the frame. The aforementioned parameters are used ture hierarchies for accurate object detection and semantic
segmentation. CoRR, 2013.
to related their location on frame to their real location in-
[6] I. Haritaoglu, D. Harwood, and L. S. Davis. W4s: A real-
side the store. While scaling difficulties arise in the once-
time system for detecting and tracking people in 2 1/2d.
per-deployment cost of manually determining the camera’s Computer Vision — ECCV, 1998.
distortion, intrinsic, and extrinsic parameters, this method [7] R. Hartley and A. Zisserman. Multiple View Geometry in
seems to be accurate enough to effectively provide data to Computer Vision. Cambridge University Press, 2003.
retail environments. [8] S. B. Kang. Semiautomatic methods for recovering radial
With typical error in the range of 20cm or so in real space distortion parameters from a single image. Technical Report
in indoor settings, this could very plausibly be used to track CRL, 1997.
the location of shoppers in a retail space - information such [9] F. Manzella and I. T. Teije. The truth about in-store analytics:
as aisle choice, for instance, is easily determined at this level Examining wi-fi, bluetooth, and video in retail. 2014.
of granularity. Back projecting the person’s location into 3- [10] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-
D space is very important for these businesses, and it’s ex- wards real-time object detection with region proposal net-
citing that a simple and practical assumption about position works. arXiv, 2015.
(that feet are on the ground) is so effective. [11] V. S. O. Repository. 3dpes: A new dataset for people
tracking and reidentification. http://imagelab.ing.
There is, however, a major obstacle to usage of this ap- unimore.it/visor/3dpes.asp.
proach in practice. This is occlusion of the feet - it’s not [12] Z. Zivkovic and B. Krose. Part based people detection using
uncommon for shelves or other objects to block the feet of 2d range data and images. IEEE/RSJ International Confer-
subjects, making it impossible for our existing algorithm ence on Intelligent Robots and Systems, 2007.
to guess their position. We had to carefully select datasets
because of this limitation, but real retail stores will not be
able to do this, instead having to work with whatever their
camera sees. One promising way to handle this would be to
use temporal consistency (i.e. relate similar bounding boxes
across timeframes) to estimate foot position now based on
foot position in previous frames. This could be done with
Extended Kalman Filters, which allow us to integrate a
physical model of the world alongside noisy measurement
data (the person detector) to produce an output that is over-
all more robust. We can also use a Faster-RCNN architec-
ture ConvNet trained specifically to look for feet, and when
feet are not detected in a bounding box (due to occlusions)
we can instead use the position of the face and extrapolate
downward based on assumptions about human proportions.
The problem of consistent offsets in one direction caused
by the edge-of-feet point from the bounding box v.s. the
between-the-feet ground truth point can be resolved with a
simple addition of a mean error vector x∆ , y∆ that can be
learned from training data. Overall, we think this is a com-
pelling first step towards real-time 3D person detection in
retail and that the remaining obstacles are surmountable.
8
AUGMENTED REALITY IN LIVE VIDEO STREAMS USING POST-ITS
FINAL REPORT
JUNE 6TH, 2016
BOJIONG NI, JORIS VAN MENS

bojiong@stanford.edu, jorisvm@stanford.edu
Abstract Another frequently used method for object de-

In this paper, we set the goal to use generic Post-it tection and tracking is SIFT. Previous work shows
notes as fiducial markers for real-time augmented re- that it delivers strong results for a wide range of de-
ality applications. We compare detection using SIFT, tection goals, yet it is a computationally expensive
Template Matching and a custom designed “Color- approach[2].
Shape” method, and find that the latter approach de- Literature on color detection has often focused on
livers high quality results, allowing for robust and ac- the problem of skin color detection[13]. Detection in
curate projective pose estimation at low processing Red Green Blue (RGB) space is common, as is de-
time. tection in Hue Saturation Value (HSV) space. It is
noted that a potential downside of the latter space is
1. Introduction that it can cause hue discontinuities. In addition, the
value (brightness) dimension does not relate well to
Augmented reality is an exciting experience where
human perception of brightness. An alternative Tint
the virtual world meets the real. In a typical setup,
Saturation Lightness space is similar in nature. A
virtual imagery is projected into live, real world video
YCrCb space, where Y relates to brightness and Cr
streams. Popular implementations of augmented re-
and Cb relate to the color hue, is also found to ex-
ality often use fiducial markers that are specifically
hibit favorable color detection properties for various
designed to be easily detected in video streams. In
purposes[3].
our paper, we aim to create a robust augmented real-
A typical, fast approach to filtering color is done by
ity experience that works with a colored Post-it note
defining an explicit cuboid bounding box in the cho-
as fiducial marker, and which can run sufficiently fast
sen three dimensional color space[5]. This method
on a single thread on a general processor.
can be augmented by normalizing the image for
With this approach, we hope to bring the aug-
brightness[6]. The boundaries can be found empir-
mented reality experience to larger audiences. A spe-
ically by taking samples of the object in different
cific example would be the classroom, where teachers
scenes. Elliptical boundaries have also been tested[7],
presenting to their students on a generic laptop could
as have various probability-based models[17].
enrich their teaching with virtual experiences using
For edge detection, Canny edge detection is often
only a Post-it note and our software application.
regarded as one of the most accurate methods, with
2. Review of previous work decent performance, although various other edge de-
tection algorithms, such as Laplacian of Gaussian,
On the specific problem of locating Post-it notes may be less computationally expensive[1, 10].
in real time, we were not able to find any literature For line detection, typical methods are the Radon
(nor on Post-it note detection in general). There has and Hough transform. The two methods are similar
however been extensive research focused on designing in nature, where the Hough transform can be consid-
and finding fiducial markers that are easy to detect ered a discretized version of the more general Radon
in video frames, using a variety of methods[9, 16]. transform[19]. Various speed optimized Hough trans-
Template matching is one of such commonly used form have also been proposed[18, 14, 4].
methods for detecting markers in augmented reality.
In Template Matching approaches, the marker (or
template) tends to be carefully designed to enhance 2.1. Contribution of our work. We provide a
recognition and camera estimation[12]. comparison of several methods to solve the problem
1
Augmented Reality in Live Video Streams using Post-its
of robust, accurate and fast Post-it detection for aug- (5) For bright colored variants, the saturation
mented reality, which had not yet been done in pre- level is high compared to most surrounding
vious literature. scenes
In addition, we show that it is indeed possible (6) For various color variants (e.g. pink), the hue
to use generic Post-it notes as fiducial markers for is uncommon in most scenes
augmented reality, robustly obtaining full pose es-
timation at high accuracy and high speed. We do The Color-Shape method aims to make use of all
so using a combinatory approach of various low-level of these properties to achieve an optimal solution.
algorithms (color filtering, noise reduction, edge de- While creating this method and choosing param-
tection, line detection and several logical elements) eters, we aim to optimize several factors. First, we
specifically tailored to the use case, providing results aim for a high detection rate, which we define as the
superior to general methods such as SIFT and Tem- percentage of frames that return valid vertices (as op-
plate Matching. posed to frames that return no vertices). Second, we
aim for high accuracy, defined as the percentage of
detected vertices that are accurate. Third, we aim
3. Technical approach for speed, as measured in milliseconds of execution,
while also keeping the standard deviation in mind.
Using Post-its for augmented reality imposes two High standard deviation can result in video stutter
important constraints. First, the Post-it is an ob- even when mean speed is low, as a single frame with
ject with very few distinguishing features, implying high processing time will halt the video until process-
some general feature detection methods might not ing is completed.
work well. Second, the speed requirement implies We use a combination of color masking, binary
a further restriction on the methods being available noise reduction, edge detection, line detection and
to use, and creates a focus on minimizing execution various logic steps to estimate the Post-it’s location
speed. We aim to consistently render 30 frames per and calculate the transformation matrix. We rely on
second, which implies the full algorithm must take Python Opencv3.0.0 implementations of mentioned
less than 33 milliseconds to execute on our hardware. algorithms, and NumPy for other image-wide calcu-
We will test a manual ”Color-Shape” approach, a lations, given both run optimized machine code to
SIFT approach and a Template Matching approach provide optimal performance. Figure 1 shows the
and compare their applicability to solving our prob- main elements of the Color-Shape pipeline.
lem.1
3.1. Experimental setup. The hardware we use for

recording and processing is a MacBook Pro 2013.
The sticky notes we use are the original Post-it brand,
in various color variants. For testing accuracy and
speed we wave a Post-it in circular motion at 2 feet
distance from the webcam and capture 100 pose es-
timations. We test our methods in rooms well lit by
either daylight or artificial light.
3.2. Color-Shape Approach. In the color-shape

approach, we make use of a number of properties of
the colored Post-it under projective transformation.
These assumptions are as follows:
(1) It has 4 exact edges and vertices
(2) In the absence of radial distortion, the edges
are straight
(3) Due to its square shape and small size com-
pared to the camera distance for expected
scenes, the opposing edges will be of similar
size (near-affine transformation)
(4) It is a solid, non-porous object Figure 1. Color-Shape pipeline
1
Code can be found at github.com/Bojiong/cs231a
2 of 10
3.2.1. HSV filter. We create a mask on the image by

filtering for specific hue, saturation and value ranges hmin = min(Hsamples ) − stdev(Hsamples )
as such:
 hmax = max(Hsamples ) + stdev(Hsamples )
 [hmin < H < hmax ]



1 ∀ [s 3.2.2. Erode-Dilate. We use erosion and subse-
min < S < smax ] quently dilation to remove small patches (noise and
HSV mask =


 [v min < V < vmax ] false positives) from our mask while keeping larger
0

otherwise patches intact. For erosion, we move a pixel kernel K
(10x10 matrix of ones) over the mask M (which holds
To find relevant hmin , hmax , smin , smax , vmin and vmax all non-zero pixels). We then keep only those pixels p
values, we sampled Post-it camera captures under for which all surrounding pixels covered by the kernel
various lighting conditions (figure 2), and captured at that pixel (K ) are also part of the mask:
p
their mean HSV values (figure 3).
M K = {p|Kp ⊆ M }
This returns pixels surrounded by patches of ones,
while removing any smaller patches for which at least
one pixel covered by the kernel was zero. For dilation
the process is similar, but reverse: any mask pixel for
which at least one of the surrounding pixels within
the kernel is one, we return one. This effectively
”grows” single pixels into patches of 10x10, undoing
the ”shrinking” caused by the erosion. The final ef-
Figure 2. Post-it color samples un- fect can be seen in figure 4.
der various lighting conditions
Figure 4. Effect of erosion and

subsequent dilation (right) on noisy
mask (left)
3.2.3. Canny edge detector. We apply the Canny

edge detection algorithm to find edges in the mask.
The edge detector first applies a Gaussian filter to
the mask:
1 (i − (k + 1))2 + (j − (k + 1))2
Mij = e(− )
2πσ 2 2σ 2
Figure 3. Mean hue-saturation The size of the filter is given by (2k + 1) in both x
(left) and hue-value (right) distri- and y direction. Subsequently, it finds the gradients
bution of samples. Green markers throughout the image as such:
correspond to pale yellow Post-its, q
blue to bright yellow and pink to S = Sx 2 + Sy 2
pink.
Θ = atan2 Sy , Sx
We take minimum and maximum values for every Where S represents the size of the gradient and
dimension as below (example for hue H), and apply theta the angle. As a next step, non-maximum sup-
the mask. pression is applied to remove multiple signals for the
3 of 10
same line (”thin the edges”). After this, a threshold- transform. We subsequently erase all pixels on the
ing mechanism is applied to find the most likely true edge image corresponding to this line with a 3-pixel
edges and remove ones more likely caused by noise. boundary radius. On the new edge image, we re-
Given the relatively simple mask provided by the apply the Hough transform. We iterate through this
previous steps, the edge detector provides good re- method 4 times. The result on the edge image after
sults as expected. Experimentation with the thresh- 2 iterations can be seen in figure 6. As an alternative
old values within reasonable bounds caused no signif- to the iterative approach, we have also tested an ap-
icant difference in the results. proach that aims to filter out multiple line detection
per Post-it edge by filtering lines with similar rho and
3.2.4. Iterative Hough transform & line erase. On theta values. While this gave decent results (see ex-
the edge detected output, we will apply a Hough periments section), we found this to be less robust
transform to find lines. The Hough transform trans- than our iterative erase-line approach.
lates Euclidean x-y coordinates into curves in the 3.2.5. Intersection finder. To find intersections, we
polar space, representing lines with different dis- use the lines’ polar coordinates and solve the follow-
tances from the origin (r) and angles (θ). Cells in ing equation:
Hough space with votes above a certain threshold
will be accepted as lines and converted back into

cos(θ1 ) sin(θ1 ) x ρ
x-y coordinate space, where points at extreme x-y = 1
cos(θ2 ) sin(θ2 ) y ρ2
coordinate values on the specified line will be used as
endpoint estimates. Figure 5 shows the initial output. 3.3. Nearest group filter. Given the detected in-
tersections, we find all intersections that lie within
(or just outside of) the image. Of those, we filter for
the 4 intersections in the closest group by using only
the 4 intersections with minimum distance to the full
group’s geometric center (filtering out intersections
at largest distance from the rest).
3.3.1. Validity decision. Given the resulting inter-
sections, we perform several validations to verify
whether or not the intersections correspond to a valid
Post-it transformation. Specifically:
(1) There must be exactly 4 vertices
(2) There cannot be 3 vertices on a single line
Figure 5. All Hough lines found (3) Opposing edges must be of similar length
(near-affine transformation)
For the third rule, we apply a minimum to maximum
line length range as such:
µ = l1 + l2 /2
lmin = µ ∗ (1 − δ)
lmax = µ ∗ (1 + δ)
3.3.2. Find perspective. Subsequently, we will find
the projection matrix M that allows us to project
Figure 6. Initial edge image (left) pixels of the overlay image into the position of the
and edge image after 2 Hough and Post-it: Ptrans = M ∗ Poverlay . Projective matrix M
Erase Line iterations (right) has 8 degrees of freedom, and every matching pair of
points gives us 2 equations, so we can find the matrix
One problem with the Hough transform is that it using 4 matching points. For the Poverlay coordinates
will fit multiple lines on a single Post-it edge. As a (x & y), we use the 4 vertices of the square image we
robust method for finding the most promising lines want to overlay. For the Ptrans coordinates (x & y),
corresponding to the unique Post-it edges, while re- we use the 4 vertices of the Post-it in the video frame.
moving duplicate Hough matches for a single edge, We can now solve the linear system (using Direct Lin-
we use only the highest-voted line from the Hough ear Transformation):
4 of 10
3.4.1. Find matching key points. As SIFT is fre-

 0 quently used for object detection, it is worth trying
t2 x02 t3 x03 t4 x04

t 1 x1
 t1 y10 to see how SIFT applies to our use case. In order to
t2 y20 t3 y30 t4 y40  =
use SIFT, we first need a reference image and extract
t1 t2 t3 t4
   (1) local features from it. Then we extract the SIFT fea-
h11 h12 h13 x1 x2 x3 x4 tures from each video frame. SIFT features are scale,
h21 h22 h23   y1 y2 y3 y4  rotation and affine invariant[8], thus we do not need
h31 h32 h33 1 1 1 1 to explicitly account for the transformation between
reference camera frame and video camera frame.
Once we have the features in both reference image
Where M is only defined up to scale, allowing to and the video frame, we match the features in the
normalize to h33 = 1. two images using a K-Nearest Neighbors approach.
Here we will find the closest two matching points for
3.3.3. Output. Typical frames are shown in figure 7. each feature in reference. Each of the two matching
The top frame shows a noisy environment (with a points will have a distance value indicating how close
similar-colored object in background) and the bot- it is to the corresponding feature in the reference im-
tom frame shows occlusion of one corner. The left age. To eliminate false positives, we will only accept
images show all Hough lines (thin) with the 4 chosen a match if the closest distance is 70% or less than that
Hough lines plotted thick. The quadrilateral corners of the second closest point. The rationale behind this
are identified by white circles, and a projected ani- is that for a true match, the second closest matching
mation (globe) is shown on the original input on the point’s distance will be much larger than the closest
right.2 (true) one’s. When there is a false positive, various
closely matching points may have similar distances.
We used the Python Opencv3.0.0 2D feature li-
brary to extract features from images and find the
closest neighbors. We have experimented with both
colored images and gray scale images for detection.
For the reference picture, we tested with plain Post-
its, Post-its with patterns drawn on them, and a rec-
tangle directly drawn from an array. We will compare
the results in the experiments section.
3.4.2. Find transformation matrix. After the key

points are matched, we estimated the transformation
of the Post-it in video from the original feature lo-
cation and the matched feature location. Since we
know that SIFT is affine invariant, we can assume
the matched points are the source and destination of
an affine transformation. This transformation matrix
Figure 7. Debug images with ap- will later be used for transforming the animation to
plied mask, Hough lines and de- be overlay on the Post-it. Let (x, y) be the key point
tected vertices (left) and origi- in the original template, (x0 , y 0 ) be the matching key
nal captures with projected overlay point in the video frame. We have the following rela-
(right) tion:  0   
tx h11 h12 h13 x
ty 0  = h21 h22 h23  y 
3.4. SIFT approach. Scale-invariant feature trans-
form (or SIFT)[8] uses techniques of Difference of t 0 0 1 1
Gaussians, scale-space pyramid and orientation as- The transformation matrix has 6 unknowns (de-
signments to ensure the features are scale and ro- fined up to scale) and each matched key points will
tation invariant. It also resamples the local image give one two equation, we need at least 3 matching
orientation planes in order to achieve full affine in- points pairs to solve the problem. In case the system
variance. is over ranked (more than 3 pairs of independent key
2A video example can be found at
youtu.be/f16gHGPc3wE
5 of 10
points are found) or rank deficient, we can use least performed Canny edge detection to only work on the
square fitting to find the estimate. edges of the image (see figure 8).
We use the Opencv Python library for Template
3.5. Template Matching approach. In Template Matching. In the Opencv library, the sum of differ-
Matching, a small image is used as a template to see ences can be calculated in different ways, and we will
if a matching version can be detected in the larger compare results.
image. This approach takes the template as the con-
volution mask and perform a convolution with the 3.5.2. Find the rotation and scaling matrix. In this
search image, sliding a window of the same size of case, we can only find the rotation and scaling of
the template over the search image. Then we com- the matched window instead of the full perspective
pare the pixel intensity difference between the search transformation. As we iterate through different scale
image in the window and the corresponding pixel in s and rotation angle θ, the corresponding transfor-
template. We sum the difference over the window. mation matrix is:
The window with lowest difference sum gives the best P 0 = sR(θ)P + p
match.
where p is the location of the left upper corner of
3.5.1. Find the best matching window. Let us give a matched window box.
formal definition[15]. Suppose coordinates (xs , ys )
at the search image has intensity Is (xs , ys ) and 3.6. Overlay. Once we have the applicable homog-
the coordinates (xt , yt ) at the template has in- raphy from either of above methods, we apply it to
tensity It (xt , yt ). Define the absolute difference consecutive frames of an animation to overlay the an-
in the pixel intensities as Dif f (xs , ys , xt , yt ) = imation into the video (as described in the Color-
|Is (xs , ys )It (xt , yt )|. Shape section). We remove pixels with zero alpha
Define Sum of absolute differences measure as values to allow for basic transparency (allowing us to
overlay e.g. a spherical globe instead of only rectan-
TX Tcol
row X
gular pictures).
SAD(x, y) = Dif f (x + i, y + j, i, j)
i=0 j=0
4. Experiments and Results
We loop over the entire search image and calculate
4.1. Experiments using Color-Shape.
the corresponding SAD(x, y). The pixel with lowest
SAD is the best match. 4.1.1. Hough transform parameters. To find the best
Note that Template Matching is not scale invari- rho, theta and threshold values for the Hough trans-
ant. I.e. we do not know how big the object (Post-it) form, we have done various experiments as shown in
is in the search image (video frame). We will resize table 1.
and rotate the original template to create a series of
templates with different sizes and rotations. Then we Threshold 15 30 15
perform a search of all those templates in the search Rho 1 1 2
frame and return the best match. Theta π/45 π/45 π/45
Speed (µ, ms) 4 1.8 3.6
Speed (σ, ms) 3.2 1.2 2.4
Accuracy High High Mid
Detection rate High Low High
Threshold 15 15 15
Rho 5 1 1
Theta π/45 π/90 π/22.5
Speed (µ, ms) 3.6 6.1 3.7
Speed (σ, ms) 2.8 5.3 2.5
Accuracy Mid High Mid
Detection rate Mid High High
Table 1. Hough transform param-
Figure 8. Edge image for Tem- eter results
plate Matching
In order to reduce noise and improve speed, we Per these results, we have chosen 1, π/45 and 15
transform the image from RGB to gray scale and as our respective optimal parameters.
6 of 10
4.1.2. Speed of Color-Shape. An analysis of execu- 4.2.2. Gray scale reference and colored video frame.
tion time for the various elements of the Color-Shape In this experiment, we use a gray scale plain Post-it
method can be seen in figure 9. Total execution time as reference. Different from the above, the key points
is well within our target range (below 33 ms) at 12 in the reference image are greatly reduced and the
ms mean. matching is more accurate. However, the detection
We find that the iterative Hough transform takes is always at the corner of the Post-it and the results
most computation time. This triggered us to also de- still occasionally contain false positives. The result
sign a non-iterative method for finding the four most can be seen in Figure 11.
promising lines using only one Hough transform. This
alternative method filters lines with similar sigma and
rho (as described in the Hough transform section of
the technical approach). While the method saved 3
ms of mean execution time, we found it to be less
robust than the iterative approach (0.91 accuracy vs.
0.98 baseline). Given total execution time is already
low, we decide to trade execution time for higher ro-
(a) Match at corner (b) False positive
bustness.
(c) Key point match-

ing
Figure 11. Colored image with

gray Post-it
Figure 9. Color-Shape element-
wise execution time
4.2.3. Binary rectangle and colored video frame. In
this experiment, we use a rectangle drawn from an ar-
4.2. Experiments using SIFT. ray as reference image, and use colored video frames
for detection.
4.2.1. Colored reference with colored video frame. In
this approach, we take a picture of the Post-it, de-
tect the features using SIFT and try to match the
key points in video frame.
(a) match at center (b) match at corner
(a) False positive (b) Key points match
Figure 10. Colored image with col- (c) false positive (d) key point matching
ored reference
Figure 12. colored image with rectangle
We can see from figure 10 that the colored Post-it
picture contains too many features and key points, The value within the rectangle is 255 and 0 outside.
making the matching process slow and noisy. The This time, the features and key points from the refer-
Post-it is detected inaccurately. ence image are far fewer. From the results in Figure
7 of 10
12, we can see that it can successfully find matches the method described above where we only accept a
between the reference rectangle and the video frame. match if the closest distance is within 70% of the sec-
However, we still see some false positives. ond closest distance. We can see from figure 14 that
when the reference is simple, e.g. a binary rectan-
4.2.4. Post-it with pattern. In this approach, instead gle, the approach without distance ratio gives more
of using a plain Post-it, we use a Post-it with pattern matches within the Post-it. We can see from the im-
drawn. For simplicity, we will use gray scale for both age that there are two key points within the Post-it
reference and video frames. Figure 13 shows a re- (and the binary rectangle). Those two key points
sult. Here the matching key points are within in the do not have much difference in terms of color inten-
Post-it instead of at the corners as in previous cases. sity or texture. The gradients around those two key
The lack of features for plain Post-its gives little in- points are very similar. Thus they might have similar
teresting matching key points within the center of the distance to the same key point within the reference.
Post-it. A patterned Post-it, however, has richer con- This is an example where distance ratio can eliminate
tent and gradient variation within the area of Post-it, false negatives.
making it easier to match local feature inside. However, when the reference image has richer fea-
tures, i.e. with the patterned Post-it, the first ap-
proach generates many more false positive than the
second approach.
4.2.6. Speed of SIFT. Our experiments show that our

SIFT method takes on average 59 ms per frame to de-
tect the location of the Post-It, which is much higher
than our target maximum of 33 ms. This result is
(a) False positive (b) Key point match expected as SIFT is a computational heavy feature
detector.
Figure 13. Post-it with pattern
4.2.7. Accuracy of SIFT. From the above analysis,

we found that SIFT works best for patterned Post-its.
Both the Plain Post-it and digital rectangle template
have too few interesting feature key points, making
the perspective transformation between key points
rank deficient. For SIFT approach, a patterned Post-
It gives best results for our use case. We manually
reviewed 100 video frames to measure how accurate
the algorithm is in finding matches. The result shows
(a) No distance ratio (b) No distance ratio that 94 out of the 100 frames had the Post-it cor-
rectly detected. For the other 6, it either identified
no match or identified a point outside of the Post-it
area (false positive).
4.3. Experiments using Template Matching.

For the Template Matching method, we rotate and
resize the template and do a matching process for
(c) With distance ratio (d) With distance ratio each resized and rotated template. This causes la-
tency to increase as we try finer granularity of size
Figure 14. Distance Ratio Effect and angle rotation. Table 2 shows how processing
time varies depending on the parameters. For pro-
4.2.5. KNN distance ratio effect. We experiment cessing times greater than 70 ms (i.e. surpassing our
with two different schemes for KNN matching. The maximum goal of 33 ms), we experience significant
first is to find the nearest neighbor for key points delay in video rendering.
and directly take it as a matching. The second is
8 of 10
#rotation #sizes processing time (ms)

1 1 6 We reviewed 100 video frames and found 92 to have
2 2 18 the correct bounding box for the best matching re-
2 3 23 gion. For the rest, the boxes also includes a part
3 5 70 (< 50%) of the Post-it, but the majority of the Post-
5 10 195 it fell outside of the box.
Table 2. Template Matching Speeds
4.3.2. Post-It with Pattern. We reviewed 100 video
frames and found 94 to have the correct bounding
box for the best matching region. Some of the false
Similar to the SIFT case, we tested with both plain matches still have part (< 50%) of the Post-it region
Post-Its and Post-Its with patterns on it. In order in the box, while some completely missed the Post-it.
to maintain low latency, we restricted the number of Figure 16 shows one frame where we use Template
sizes to 2 and number of rotations to 3. It turns out Matching with a patterned Post-it.
Template Matching works well on both the patterned
and plain Post-its with such rotation and scale iter-
ation. Here, if the box includes > 50% of the region
of the Post-it, we categorize it as a correct match.
In both figure 15 and figure 16, the three images in
the first row (from left to right) are the original tem-
plate, the template after Canny edge detection and
the actual resized/rotated template giving the best
match.
The three images in the second row (from left to
right) are the video frame with best matched region
(the rectangle enclosed region), the Canny edge de-
tection of the video frame and the visualization of the
convolution between template and the best matched Figure 16. Post-it with pattern
region.
One drawback of the Template Matching approach
is that we cannot estimate the full perspective trans-
4.4. Comparison of methods. Comparing the
formation. We can only estimate similarity: we get
most important metrics for the 3 approaches, we find
scaling and rotation by iterating through different
the manual Color-Shape approach performs much
scales and rotations of the template, and translation
better than both SIFT and Template Matching. Ta-
from the box center location of the best matched re-
ble 3 shows the high-level comparative overview.
gion.
Template
4.3.1. Plain Post-It. Figure 15 shows one frame SIFT Color-Shape
Matching
where we use Template Matching for plain Post-it Time (ms) 59 23 12
detection. 0.94
(pattern)
Accuracy 0.94 0.98
0.92
(plain)
2 scale
Notes Pattern Bright color
3 rotation
Table 3. Comparison of Methods
In addition to processing time and detection accu-

racy, the Color-Shape method provides a more accu-
rate perspective transformation matrix that we use
to transform the animation to overlay on the Post-
Figure 15. Plain Post-it it. For SIFT, the affine transformation can be rank
9 of 10
deficient when insufficient key points are matched. [2] Daniel Wagner, Gerhard Reitmayr, Alessandro Mulloni,
For Template Matching, we can estimate a similar- Tom Drummond, Dieter Schmalstieg Real-time detection
ity matrix but cannot estimate the full perspective and tracking for augmented reality on mobile phones IEEE
Transactions on Visualization and Computer Graphics
transformation. (Volume:16 , Issue: 3), 2009
SIFT and Template Matching are designed for gen- [3] Charles A. Poynton Frequently Asked
eral, rich pattern detection, while the Color-Shape Questions about Colour Available at
method we use is designed specifically for our Post-it https://engineering.purdue.edu/ bouman/info/Color-
use case. SIFT and Template Matching can be used FAQ.pdf
[4] Fernandes, Leandro AF, and Manuel M. Oliveira. Real-
for a much broader range of use cases, whereas the time line detection through an improved Hough transform
Color-Shape method is limited in scope to Post-it de- voting scheme Pattern Recognition 41.1, 2008
tection (and highly similar use cases). [5] Fleck, Margaret M., David A. Forsyth, and Chris Bregler.
Finding naked people Computer VisionECCV’96. Springer
5. Conclusions Berlin Heidelberg, 1996
[6] Fleyeh, Hasan. Color detection and segmentation for road
We find that typical methods such as SIFT and and traffic signs Cybernetics and Intelligent Systems, 2004
Template Matching do not sufficiently meet our goals, IEEE Conference on. Vol. 2. IEEE, 2004
as they do not provide accurate pose finding for the [7] Lee, Jae Y., and Suk I. Yoo An elliptical boundary model
for skin color detection Proc. of the 2002 International Con-
plain-faced Post-it, and require too much computa-
ference on Imaging Science, Systems, and Technology, 2002
tion time to operate. [8] Lowe, David G. Object recognition from local scale-
We find the manually designed Color-Shape ap- invariant features Proceedings of the International
proach, exploiting the saturated color and square Conference on Computer Vision. pp. 11501157.
shape properties of the Post-it, to work well and meet doi:10.1109/ICCV.1999.790410, 1999
[9] M. Fiala Designing Highly Reliable Fiducial Markers IEEE
our goals. The full Color-Shape method takes an av-
Transactions on Pattern Analysis and Machine Intelligence,
erage of 12 milliseconds to execute (with 4.5 millisec- vol. 32, no. 7
onds standard deviation) on a 3-year old MacBook [10] Maini, R. et al. Study and Comparison of Various Image
Pro laptop, indicating the maximum 33 milliseconds Edge Detection Techniques International Journal of Image
goal should be achievable on a wide range of laptops. Processing (IJIP), Volume (3) : Issue (1), 2009
[11] Matas, J. et al. Robust Detection of Lines Using the Pro-
We achieve high accuracy projective pose estimation
gressive Probabilistic Hough Transform CVIU 78 1, pp 119-
with a reasonable robustness to noise and lighting 137, 2000
variation. [12] Nipat Thiengtham and Yingyos Sriboonruang Improve
Template Matching Method in Mobile Augmented Reality
6. Future Work for Thai Alphabet Learning International Journal of Smart
Home Vol. 6, No. 3, July, 2012
A number of improvements could be considered for [13] P. Kakumanu, S. Makrogiannis, and N. Bourbakis A sur-
future work: vey of skin-color modeling and detection methods Pattern
• Interest region: the Color-Shape approach Recogn. 40, 3, March 2007
[14] Palmer, Phil L., Josef Kittler, and Maria Petrou An opti-
could be made significantly faster by search-
mizing line finder using a Hough transform algorithm Com-
ing only in an interest region derived from puter Vision and Image Understanding 67.1, 1997
the previous location of the Post-it [15] Roberto, B. Template Matching techniques in computer
• Multiple Post-its: the Color-Shape approach vision: theory and practice 2009
could be generalized to find multiple Post-its [16] S. Garrido-Jurado, R. Muoz-Salinas, F.J. Madrid-Cuevas,
M.J. Marn-Jimnez Automatic generation and detection of
in one image
highly reliable fiducial markers under occlusion Pattern
• Dynamic color filter: the HSV color filter Recognition, Volume 47, Issue 6, June 2014
could be made more accurate by dynamically [17] Schumeyer, Richard P., and Kenneth E. Barner Color-
adapting parameters to the scene based classifier for region identification in video Photonics
• Wearables support: the source code could be West’98 Electronic Imaging. International Society for Op-
tics and Photonics, 1998
ported to relevant augmented reality hard-
[18] Singh, Chandan, and Nitin Bhatia A Fast Decision Tech-
ware such as smartglasses nique for Hierarchical Hough Transform for Line Detection
arXiv preprint arXiv:1007.0547, 2010
References [19] Van Ginkel, Michael, CL Luengo Hendriks, and Lucas J.
[1] Canny, J. A Computational Approach to Edge Detection van Vliet. A short introduction to the Radon and Hough
IEEE Trans. on Pattern Analysis and Machine Intelligence, transforms and how they relate to each other Delft Univer-
8(6), pp. 679-698, 1986. sity of Technology, 2004
10 of 10
Augmenting Videos with 3D Objects
Andrei Bajenov Darshan Kapashi Sagar Chordia

abajenov@stanford.edu darshank@stanford.edu sagarc14@stanford.edu
Abstract for camera tracking and 3D reconstruction. There is also

quite a bit of research in 3D reconstruction from a set of
We propose a way to automatically augment a video of a 2D images.
static scene with a 3D object. We use SFM algorithms to es-
timate the position of the camera. In order to properly sup- We have not been able to find a product that specifically
port occlusions, we use a novel approach to generate dense takes a video and augments it while supporting occlusions
depth maps for each frame of the video. We use a combi- (although a number of technologies like Wikitude support
nation of semi-global block matching, image segmentation projecting objects based on camera positions). This was our
using the watershed algorithm, and planar interpolation to main motivation for building this system.
remove noise and sharpen edges. The final result is a video
augmented with a 3D object that is properly occluded by 2. Problem Statement
objects in front of it.
We propose a system that puts a 3D object in a video of
a static scene while supporting occlusions. We break up the
problem into two components:
1. Introduction
The idea of augmenting videos has been around for a 1. Estimate camera matrices for each frame of the video,
while, and we see it everywhere today. One prominent so that we can project an object back into the scene.
example is CGI in movies, which augments reality with
2. Obtain sharp and accurate depth maps for each frame,
computer generated objects and makes the viewer believe
to deal with occlusions.
that these objects are part of the environment.
Below is an example of a scene that we used for testing
The process to do this is generally difficult and requires our system.
a lot of specialized software and equipment. In this paper,
we describe a system that, given an object mesh and a
video, allows anyone to place this object seamlessly into
the video without any other external inputs.
There are a number of interesting applications to this.

For example, it could be used for seeing how a piece of
furniture would look in a room or how a new house would
look in a particular location. If integrated with a smart-
phone’s camera, it could also be used when interacting with
an environment. For example it could provide navigational
pointers, highlight parts of an environment, or even project Figure 1: An example of a static scene
another person into an environment.
There are a growing number of technologies that are 3. Previous Work

being built to support this. Project Tango [1], by Google, is
3.1. Estimating Camera Positions
building a phone that has a built-in depth sensor to make
3D reconstruction and camera tracking easier. Wikitude [2] There has been a lot of research done in the area of Struc-
is an example of a piece of software that is designed for ture from Motion (SFM), and there are a number of existing
the purpose of augmenting smartphone videos. It allows libraries that implement SFM algorithms, including:
1
• Theia SFM [3] Specifically, the Middlebury website [13] contains a lot
of submissions and evaluations of many stereo-matching
• Bundler [11] algorithms. We ended up pursuing this approach the most
• Visual SFM [10] because the research here showed the most promising
results.
• OpenCV SFM [9]
To solve our problem, however, we do not need a full 3D
We were looking for a few things from the libraries: reconstruction of the scene. We just need an approximate
• Camera position estimation depth map that has good accuracy around the object bound-
aries. What we found while using just stereo-matching al-
• Camera parameter estimation gorithms was that they were prone to noise. To overcome
this problem and achieve the desired results, we propose a
• Reliable and accurate sparse 3D reconstruction
novel approach that uses a combination of image segmenta-
For this project, it was not our goal to try to improve or tion techniques, stereo-matching, and planar interpolation.
optimize any of these libraries. We tried a few of them and
picked the one that was easiest to use. In our case it was the 4. Technical Details
Theia SFM library.
Below we describe our solution. We talk about how we
3.2. Estimating Depth Map calibrate our camera and run SFM. We then describe our
method for getting accurate depth maps. Lastly, we describe
A critical part to solving our problem was obtaining ac-
how we project 3D objects back into the scene.
curate and dense depth maps for each frame of the video.
There were a number of techniques that we considered: 4.1. Sparse 3D Reconstruction and Camera Matrix
Estimation
• Reconstructing 3D objects using volumetric stereo and
using these reconstructions to obtain depth maps [12] As per standard practice in the camera model used in
computer vision, there are 2 parameters:
• Using a combination of the original images and the
sparse 3D points obtained from SFM to approximate • Intrinsic matrix K: A 3x3 matrix which incorporates
the 3D surface positions (using segmentation and pla- the focal length and camera center coordinates.
nar reconstruction).
• Extrinsic matrix [R T]: A 3x4 matrix which maps
• Using a combination of SFM and stereo matching al- world coordinates to camera coordinates. R denotes
gorithms to obtain a dense 3D reconstruction of the rotation and T denotes translation.
scene. [16]
The camera transformation is given by a matrix,
While researching volumetric stereo, we found that
it was genereally used to get a 3D reconstruction of a M = K[R T ]
single object within a scene. For our purposes, we needed
information about the full scene. To extend this algorithm It transforms a point in homogeneous world coordinates
to work on a full scene, we would have needed very reliable to homogeneous image coordinates.
image segmentation algorithms that were determenistic
between frames. We were not able to find anything that The point correspondence problem is defined as follows:
looked promising in this space, so we abandoned this idea. Given n images, find points in the images which correspond
to the same 3D point. There are several well known
We also considered using the sparse 3D points obtained algorithms which work reasonably well in practice, for
from SFM to approximate a dense 3D reconstruction. We example, SIFT, SURF and DAISY.
though about partitioning the original image into uniform
segments. We would then approximate each segment as a The Structure from motion (SFM) problem is defined
plane in 3D and use the sparse 3D points to estimate these as follows: Given m images and n point correspondences,
planes. Unfortunately what we found was that we did not find m camera matrices (M) and n 3D points. Solving the
have enough points in each image segment to do a planar SFM problem for a set of images will give us a sparse 3D
reconstruction, so we could not use this approach by itself. reconstruction of the scene.
Lastly we found a lot of research about using stereo- To get the intrinsic parameters for our camera we tried a
matching algorithms to aid in dense 3D reconstruction [17]. few different approaches:
2
• Computing K using single view metrology with 3 van- 4.2. Depth Map Estimation
ishing points derived from 3 pairs of mutually orthog-
A sparse reconstruction is not enough to get a full depth
onal lines in 3D.
map for each frame.
• Using a checkerboard image to calibrate using
Below we propose a novel approach of using a combi-
OpenCV routines.
nation of stereo-matching, image segmentation (using the
watershed algorithm), and planar interpolation to get dense
• Allowing Structure From Motion (SFM) algorithms to
3D depth maps for each frame.
self-calibrate (which is possible given enough view-
points of a static scene)
4.2.1 Terminology
Each of these approaches gave us a similar K, so we
decided to go with the self-calibration method since it is
automatic and we have plenty of views.
We tried a couple different SFM libraries. The SFM

library bundled with OpenCV didn’t give good results. The
Theia [3] library was easier to work with and was able to
give us fairly accurate sparse reconstructions of a 3D scene,
figures 2, 3.
Figure 4: Stereo setup
1. Disparity map. Disparity refers to the difference in im-

age location of an object seen by the left and right cam-
eras, resulting from different positions of two cameras
Figure 2: Theia’s camera positions and a sparse 3D recon- as seen in figure 4. A disparity map is a mapping for
struction each pixel in the image to the disparity value for that
pixel. The value represents the distance between the
locations of a point in the left and right rectified stereo
images. It indicates the relative distance to the camera.
A higher value means it is closer to the camera.
2. Depth map. A mapping for each pixel in the image to

the depth value for that pixel. It indicates how far a
point is from the camera. A higher value means it is
further from the camera.
Figure 3: Side view of the 3D reconstruction
At this point, we have the camera intrinsic and extrinsic

matrices as well as a small set of 3D points which can be
used to get a sparse reconstruction of the scene. Figure 5: Image rectification
3
3. Image rectification. A transformation to project two
images onto a single image plane as seen in figure 5.
After rectification, all epipolar lines are parallel in the
horizontal axis. All corresponding points have identi-
cal vertical coordinates.
4.2.2 Semi-global block matching

Figure 7: Weighted Least Squares filtering on an SGBM
After obtaining camera matrices for every frame of the disparity map
image, we use that information to perform stereo matching
between pairs of frames. We picked pairs of frames with a
good baseline distance between them (in our case around
2.0), and performed stereo matching on rectified versions 4.2.3 From disparity maps to 3D points
of these frames.
Disparity maps alone don’t help, since they don’t tell us the
We tried different stereo-matching algorithms and exact depth of objects in each frame. To convert between
picked Semi-global block matching (SGBM) since it was disparity maps and depth maps, we first need to reproject
readily available in OpenCV and had good performance on the disparity values to 3D.
the Middlebury dataset.
p = (x, y) is a point in the disparity map. The matrix
SGBM aims to minimize a global energy function E Q incorporates the transform between the left and right
for the disparity image D, based on the idea of pixel-wise cameras which is obtained during image rectification.
matching of mutual information and approximating a We can get the homogeneous coordinate in 3D using this
global 2D smoothness constraint by combining many 1D equation.
constraints.
[X Y Z W ]T = Q ∗ [x y disparity(x, y) 1]T
It takes as input 2 rectified images, taken from the cam-
era on the left and from the camera on the right. It also takes And finally get a mapping from 2D to 3D.
the camera matrix K. It produces disparity maps for the left
and right images. Figure 6 shows an example disparity map. 3d image(x, y) = (X/W, Y /W, Z/W )
Figure 8 is the dense 3D reconstruction obtained from

the above equation.
Figure 6: SGBM Disparity Map
After obtaining a disparity map, we do a first pass to

remove noise. We found a technique called Weighted Least
Squares filter [9]. The result is in figure 7.
Figure 8: 3D reconstruction from a disparity map
4
4.2.4 From 3D points to a depth map 4.2.5 Missing depth maps
We don’t have a disparity map for every frame. Frames
So far, we are able to obtain 3D points from a disparity where the camera is moving forward, for example, don’t
map. These points are not in world coordinates though, have a good corresponding stereo frame. Rectification
so we need to transform them to world coordinates before between such views introduces too much distortion.
generating depth maps.
As such, we need to be able to reconstruct a depth map
Let x be the point in world coordinates. Let p be the for any frame, using a depth map generated from some other
point in the original image. We rectify the image for stereo frame. Fig 10 is an example of a depth map viewed from a
matching. Rectification is a homographic transform. Let H different camera:
be the inverse of this transform. K, R and T are camera
parameters.
The point x maps to p using the camera transform.
p = KRx + KT
The point pr is the point in the rectified image, which

corresponds to the point p in the original image. xr is the
3D point in rectified coordinates.
Figure 10: Depth map from another camera location - more
pr = Kr Rr xr + Kr Tr occlusions
We get p by applying the rectification transform H on pr , There are far more missing depths, which are mostly
there due to occlusions. To help fill in the rest of this depth
map, we use a novel technique which combines image
p = Hpr segmentation and planar reconstruction, as described in
subsections below.
Using these equations, we can derive the equation for
point x in original world coordinates.
4.2.6 Image segmentation
x = R−1 K −1 HKr Rr xr + R−1 K −1 HKr Tr − R−1 T
We use the marker controlled watershed algorithm for
image segmentation. The watershed algorithm is based
We reproject these points back into the original frame on the concept of flooding the image from its minima
and compute the depth for each pixel. Figure 9 is a depth and preventing the merging of water coming from dif-
map for the frame from which we generated the disparity ferent sources. This partitions the image into 2 parts:
map. the catchment basins and the watershed lines. This
approach results in over-segmentation, so we use a vari-
ant which is based on starting to flood from a set of markers.
To find the set of markers, we apply the following set of

transformations to the image. In the process of thresholding
an image, we set the value for each pixel to either 0 or 1
based on a threshold. We use adaptive thresholding to the
image. It considers local variations in intensity and makes
pixels white and black. This is significantly better than
using a global threshold because the lighting in the scene is
not uniform. Then, the transformed image has several small
holes. We use morphological opening to fill in these holes
Figure 9: Example depth map obtained from a disparity and have a much smaller set of bigger segments. Next,
map we apply a distance transform, followed by a thresholding
5
transform which gives us a candidate set of markers. Using use this set of points to estimate a plane using SVD decom-
this, we can apply the watershed algorithm to segment the position for the linear system Ax = t. This gives us a plane.
images.
This method has several parameters that can be tuned to ax + by + cz + d = 0

get a segmentation of desired quality. This includes
• Window size of adaptive threshold This plane gives us the depth for every point on it,
irrespective of whether we had a depth for it previously
• Kernel for the morphological opening from the stereo matching algorithms. This is how we fill
• Threshold for the distance transform up holes in the depth map. We can now trace a ray which
starts from the camera and hits the approximate plane. We
Figure 11 is an example of a segmented image. can compute the length of this line segment and this is the
depth of this image point.
For a point p = (x, y) in the image plane, we can

compute the point of intersection between the plane and the
ray and compute the depth as
depth = −d/(a ∗ (py − cy )/fy + b ∗ (px − cx )/fx + c)
For certain segments, because of measurement noise,

Figure 11: Image segmented using the watershed algorithm poor segmentation, or non-flat surfaces, it is possible to end
up with a bad estimate of the plane. We found a simple
heuristic to prune these bad planes, ||Ax−t|| > threshold.
Note that when tuning parameters, our goal is to make This helps to reduce noise in this approach for reconstruc-
sure that segments don’t spill between objects. We achieve tion.
this by tuning the parameters to generate small segments.
4.2.7 Planar interpolation With the above, we get an depth map that looks some-
thing like:
We have now divided the image into several segments. We
assume that each segment is part of a plane. We have the
depth map for each 2D image point, which means we have
a 3D point corresponding to each 2D point. The camera
matrix K has the following structure:
 
fx 0 cx
 0 fy cy 
0 0 1
For each 2D point, we can compute the corresponding
3D point. z is the depth of this point from the depth map.
p = (z ∗ (py − cy )/fy , z ∗ (px − cx )/fx , z) Figure 12: Planar interpolation of image segments
For each image segment, we collect all the 2D points

for which we know a depth (z is not equal to 0). We get
corresponding 3D points. For n points in a segment, we
construct a nx3 matrix A where each row is a 3D point For areas where the planes couldn’t be reconstructed, we
p = (x, y, z). The matrix t is a nx1 matrix of −1. We can fill those with the original depths, to get:
6
4.3. Augmenting Video with 3D Objects
The last step of our system is augmenting the video

with a 3D object. At this point we have estimated camera
matrices and depth maps. The location of the object
within the scene is determined manually, i.e. we take 3D
coordinates in the object and translate them such that it
is placed behind one of the boxes in the scene. In a real
application, you can imagine a user interface which lets
you drag and drop the object in the scene. We do not solve
Figure 13: Filling in missing segments with original depths this problem here and focus on the mathematical aspects.
The object (a bird) is a 3D object. It looks like a 2D blob
because we did not add shading to it.
4.2.8 Combining depth maps from multiple views
The 3D object is given as a mesh of triangles. An easy
So far we’ve dealt with depth maps obtained from a single
way to augment the frames of the video with this object is
pair of rectified images. We observed that using a single
to apply the camera transform on each of the vertices and
pair of images doesn’t give a full depth map when viewed
fill in triangles with the object texture. It gets slightly more
from different camera angles. Fortunately, in a video
complex when we want to handle occlusions.
sequence, we have many pairs of such images that can help
improve results and fill in missing depths.
Since we want to be as accurate as possible, we make
To find a depth map for the current frame, we take sure that the triangles which make up the object are small
a number of nearby frames for which we have obtained enough. If they aren’t, we can split each triangle into
high-quality depth maps. We deem a depth map to be high 3 smaller triangles using the centroid and the current
quality if it was obtained from a disparity map with a good vertices. A triangle is visible if all 3 vertices of the triangle
baseline distance which is not too small and not too large. are visible. With small enough triangles, this is a good
We then re-project the depth map from those frames into approximation.
the current view. What we end up with is multiple depths
per pixel. For each vertex v = [x y z], which is a point in 3d world
coordinates, we transform it into camera coordinates point
To pick the desired depth, we sort the depths and pick the p = (x, y, z) using the camera transform.
first value for which the value between it and the next value (x, y, z) = [R T ] v
is no higher than 15 percent. This is a rudimentary way to
pick the smallest z-value cluster. We pick the smallest z-
value cluster to eliminate noise and ignore occluded objects. The depth of the point is the z coordinate in camera co-
ordinates (zp ) and the depth in the frame of the video is
as computed using stereo correspondence and planar recon-
With the above approach we obtain a depth map as seen
struction (zi ). If the 3d object point is at a greater depth,
in Fig 14. Notice that there is less noise than in previous
zi < zp , it is hidden in the image, otherwise zi > zp , it is
depth maps, and more pixels have a depth value.
visible.
Figure 14: Depth map generated by merging depth maps

from multiple viewpoints Figure 15: Bird is completely visible in this view
7
where δd is the disparity error tolerance.
There are various parameters which can be tweaked

while evaluating stereo matching in the Middlebury
framework. We use default values for most parameters
except one parameter. eval bad thresh which controls
thresholding to decide whether a pixel is a bad match or
not was increased from 1.0 to 5.0. We found 1.0f was too
strict and > 90% pixels were marked as bad pixels in most
images. But a value of 5.0 gives good results for most
Figure 16: Bird is half hidden behind the box in this view images.
5. Evaluation Figure 17a shows an image in the Middlebury evaluation

dataset. Figure 17b shows the ground-truth disparity map of
As discussed in section 4.1, we use existing algorithms figure 17a. Disparity map of figure 17a is computed using
and libraries to estimate camera parameters in each image Semi-Global Block Matching (SGBM) algorithm and is
of the video. Our main contribution lies in effectively shown in figure 17c. Figure 17d shows image-segmentation
connecting various computer vision algorithms to augment on the original image figure 17a. Figure 17e shows refined
a video. The novelty in this paper is about refining depth disparity maps obtained by combining figure 17c and plane
maps using image segmentation and approximating a fitting on figure 17d. Computed disparity maps figure 17c
reconstruction using planes. Hence, we will skip evaluation and figure 17e are compared to the ground truth disparity
for camera calibration and focus on our method of refining map figure 17b and the above two metrics are computed.
depth maps using planar reconstruction. We also compare other algorithms from the Middlebury
evaluation framework to our methodology.
For easy evaluation of individual components of stereo
matching algorithms, the computer vision group from We can control the strictness of plane fitting to refine
Middlebury have designed a stand-alone flexible C++ disparity maps by norm thresh. When the error of fitting
implementation. [13] They also provide a collection of a plane to points of a given image segment is greater than
datasets and benchmark it against various state-of-art norm thresh then we don’t refine the disparity map and
algorithms. The evaluation framework is flexible and use the original disparity map. In figure 17e we can see
supports easy additions of new algorithms. We integrate how disparity maps are affected as norm thresh value
our methodology and compare the results with other is increased. Figure 17e is bad compared to figure 17d in
stereo matching algorithms already implemented in the terms of both metrics and hence it is critical to tune this
framework. parameter correctly.
We describe the quality metrics we use for evaluating the Sometimes refined disparity maps may not be better than
performance of various stereo correspondence algorithms the original disparity map because of bad image segmen-
and the techniques we used for acquiring our image data tation. Figure 18b is a better approximation to figure 18a
sets and ground truth estimates. [18] compared to figure 18c. In figure 18c, plane fitting on bad
image segmentation results in a weird disparity map. So
1. RMS (root-mean-squared) error, measured in disparity it’s important to control the quality of image segmentation
units, between the computed disparity map dC (x, y) to ensure refined disparity maps are better.
and the ground truth map dT (x, y), i.e.,
12 We compute disparity maps for 5 test images in the Mid-
1 X 2 dlebury evaluation dataset using our method as well as pre-
R= (|dC (x, y) − dT (x, y)| )
N defined algorithms in Middlebury. In the following table
(x,y)
we report mean values of RMS error and percentage bad
where N is the total number of pixels. pixels. normal-SGBM refers to our SGBM implementa-
tion of disparity map. planar-SGBM refers to the filtered
2. Percentage of bad matching pixels, disparity map generated by fitting planes using image seg-
mentation. [18] describes the other algorithms used for en-
1 X coding. As seen in the table, planar-SGBM performs better
B= (|dC (x, y) − dT (x, y)| > δd )
N than normal-SGBM in both metrics we defined earlier.
(x,y)
8
(a) Image to be evaluated from Middlebury dataset (b) Ground truth disparity map
(c) Disparity map with Semi-Global Block Matching

(d) Image segmentation using watershed transform
(SGBM)
(e) Filtered disparity map combining SGBM and plane (f) Filtered disparity map combining SGBM and ag-
fitting on image segmentation gressive plane fitting on image segmentation
Figure 17: StereoMatch Evaluation using middlebury dataset
(c) Refined disparity map using image

(a) True disparity map of sawtooth (b) Disparity map by SGBM
segmentation
Figure 18: Another test image from Middlebury evaluation dataset
9
Algorithm Mean RMS error Mean Bad pixel ratio 7. Conclusion
SSD09bt05 1.559039 0.024049 In this paper, we proposed a system that takes a 3D
object mesh and a video, and augments that video with the
SSD09t20 1.714662 0.030914 object. The system is able to estimate camera positions and
generate depth maps for each frame (to support occlusions).
SADmf09bt05 1.733064 0.031358
SADmf09t02 2.4118 0.051564 We used the Theia SFM library to estimate camera
positions, and proposed a novel method to estimate depths
SAD09t02 2.565821 0.058292
in each frame. To estimate depths, we used a combination
SADmf09t01 3.019650 0.084793 of image segmentation techniques (watershed algorithm),
stereo matching (SGBM), and planar interpolation. When
planar-SGBM 3.177850 0.071215
compared to stereo matching alone, the combination of
normal-SGBM 3.200869 0.072295 these technique allowed us to improve depth map accuracy
while at the same time significantly reducing noise and
SAD09t01 3.717277 0.135821 improving sharpness around object boundaries.
6. Future Work We were able to successfully project a 3D object back

into a video. Our algorithm was fully autonomous, and
There are still a few things that we would have liked to
did not require anything other than specifying the object
try to improve our results.
position.
• Implement state-of-the-art algorithms for stereo-
matching and see how they perform with and without You can find our code by going to:
planar interpolation. We only had time to add planar https://bitbucket.org/bajenov1/cs231a/
interpolation on top of SGBM, but there are better al-
gorithms out there. You can find our final augmented video here:
https://www.youtube.com/watch?v=X37SP4Dihhg
• Look into using the DAISY descriptor instead of
stereo-matching for dense 3D reconstruction. See: ”A
Fast Local Descriptor for Dense Matching” by Tola et.
al. [5]
• Try different image segmentation approaches to re-

move spilling out of objects. We started looking into
using the canny edge detector to seed the watershed
algorithm. [14]
• Exploit the fact that we have a video instead of a set of

photos to run SFM in real-time.
• Exploit the fact that we need a depth map of only a

small region where the 3D object is projected into the
scene (to make our algorithm real-time).
• Use a better rectification algorithm, as described in ”A

simple and efficient rectification method for general
motion” [19]
• [17] describes how to add better depth map merging

techniques - ”Metric 3D Surface Reconstruction from
Uncalibrated Image Sequences”.
• Project more interesting geometry into the scene. Ei-

ther by integrating with a ray-tracing library or using
OpenGL and taking advantage of its built-in shaders.
10
References [14] Canny, John. ”A computational approach to edge
detection.” Pattern Analysis and Machine Intelligence,
[1] Lee, J. C., and R. Dugan. ”Google project tango.” IEEE Transactions on 6 (1986): 679-698.
[2] Perry, Simon. ”Wikitude: Android app with [15] Haris, Kostas, et al. ”Hybrid image segmentation
augmented reality: Mind blowing.” digital-lifestyles. using watersheds and fast region merging.” Image
info 23.10 (2008). Processing, IEEE Transactions on 7.12 (1998):
[3] Sweeney, Christopher, Tobias Hollerer, and Matthew 1684-1699.
Turk. ”Theia: A Fast and Scalable [16] Pollefeys, Marc, Reinhard Koch, and Luc Van Gool.
Structure-from-Motion Library.” Proceedings of the ”A simple and efficient rectification method for
23rd Annual ACM Conference on Multimedia general motion.” Computer Vision, 1999. The
Conference. ACM, 2015. Proceedings of the Seventh IEEE International
[4] Ravimal Bandara, Image Segmentation using Conference on. Vol. 1. IEEE, 1999.
Unsupervised Watershed Algorithm with an [17] Pollefeys, Marc, et al. ”Metric 3D surface
Over-segmentation Reduction Technique. reconstruction from uncalibrated image sequences.”
[5] Tola, Engin, Vincent Lepetit, and Pascal Fua. ”A fast 3D Structure from Multiple Images of Large-Scale
local descriptor for dense matching.” Computer Vision Environments. Springer Berlin Heidelberg, 1998.
and Pattern Recognition, 2008. CVPR 2008. IEEE 139-154.
Conference on. IEEE, 2008. [18] Scharstein, Daniel, and Richard Szeliski. ”A
[6] Mur-Artal, Raul, J. M. M. Montiel, and Juan D. taxonomy and evaluation of dense two-frame stereo
Tardos. ”ORB-SLAM: a versatile and accurate correspondence algorithms.” International journal of
monocular SLAM system.” Robotics, IEEE computer vision 47.1-3 (2002): 7-42.
Transactions on 31.5 (2015): 1147-1163. [19] Pollefeys, Marc, Reinhard Koch, and Luc Van Gool.
[7] Furukawa, Yasutaka, and Jean Ponce. ”Accurate, ”A simple and efficient rectification method for
dense, and robust multiview stereopsis.” Pattern general motion.” Computer Vision, 1999. The
Analysis and Machine Intelligence, IEEE Transactions Proceedings of the Seventh IEEE International
on 32.8 (2010): 1362-1376. Conference on. Vol. 1. IEEE, 1999.
[8] Kundu, Abhijit, et al. ”Joint semantic segmentation

and 3d reconstruction from monocular video.”
Computer VisionECCV 2014. Springer International
Publishing, 2014. 703-718.
[9] Bradski, Gary, and Adrian Kaehler. Learning OpenCV:

Computer vision with the OpenCV library. ” O’Reilly
Media, Inc.”, 2008.
[10] Wu, Changchang. ”VisualSFM: A visual structure

from motion system.” (2011).
[11] Snavely, Noah. ”Bundler: Structure from motion

(SFM) for unordered image collections.” Available
online: phototour. cs. washington.
edu/bundler/(accessed on 12 July 2013) (2010).
[12] Eisert, Peter. ”Reconstruction of Volumetric 3D

Models.” 3D Videocommunication: Algorithms,
Concepts and Real-Time Systems in Human Centred
Communication (2005): 133-150.
[13] Scharstein, Damiel, and R. Szeliski. ”Middlebury

stereo datasets.” 2014-04-06]. http://vision,
middlebury, edu/stereo/data (2006).
11
Classroom Data Collection and Analysis using Computer Vision
Jiang Han
Department of Electrical Engineering
Stanford University
Abstract clear as it should be.

So far most camera system on mobile phone has embed-
This project aims to extract different information like ded face detection algorithm, but they have not put all three
faces, gender and emotion distribution from human beings topics including face detection, gender analysis and emo-
in images or video stream. Based on those collected data we tion analysis all together. Thus implementing such a demo
may be able to obtain some useful feedback or information, system will be very exciting to me.
which can be valuable guidance on how we can improve
Good thing is there has already been a lot researches on
the class education quality. In this project, a few computer
the above three areas. As a classical topic in computer vi-
vision topics were touched like face detection, gender clas-
sion, a lot of approaches have been proposed for face detec-
sification and emotion analysis. Technical details tested in-
tion. Authors from [1] [2] concluded there are four types
clude feature extraction strategies like Bag of Words, HOG,
of approaches for face detection: knowledge based method,
LBP, key-point detection, feature reduction etc. Machine
feature based method, template matching approach and ap-
learning classifier tested including Nave Bayes, KNN, Ran-
pearance based method. In particular, an outstanding face
dom Forest and SVM. Classifier parameters are tuned to
detection algorithm was proposed by Viola and Jones [3].
achieve the best accuracy. The system is able to achieve
Viola-Jones algorithm applys adaboosting and cascading
accuracy of 0.8673 for gender classification and 0.5089 for
classifiers and has advantages like fast speed, suitable for
emotion analysis (0.6073 when we prune particular class).
scaling, which makes it as the embedded face detection al-
Real life image and video stream tests also verified the va-
gorithm by toolbox from OpenCV and Matlab. For gen-
lidity and robustness of the system.
der classification, authors from [4] also roughly divide it as
feature-based and appearance-based approaches. Later on,
Lian etc. [5] used LBP feature with SVM and were able to
1. Introduction achieve very high accuracy. Also, Baluja etc. [6] applied
Motivation of this project came from my personal expe- Adaboost classifier and able to achieve more than 93% ac-
rience. One time when I was taking a class at Stanford CE- curacy. Similar to gender classification, emotion analysis is
MEX Auditorium, I often saw a TA came to the second floor also a classifying problem, but with multiple classes instead
and count how many students presented at the class. Atten- of two, which increases difficulty in accuracy. Authors from
dance was not strictly required for this class, so this data is [7] presents an approach using means of active appearance
only used by the instructor to have a better understanding of model to do both gender and emotion classification. SVM
current instruction status. At that time, I thought it will be was used to label four emotion types (happy, angry, sad and
great if we can do this by simply taking a photo. neutral). Y Kim etc. [8] applied deep learning techniques to
When I take CS231A this year, this old idea came to me overcome the linear feature extraction limitation in emotion
so I immediately decided to do some related work. Instead detection, which is able to boost performance. In all, gender
of only counting number of students, I decided to get more classification and emotion analysis have been very hot top-
interesting information like gender and emotion. By de- ics in researches from both feature extraction and learning
tecting faces and doing gender classification, we can have a model optimization aspects [9].
rough number of students attendance and their gender dis- In this report, Section 2 shows problem statement, which
tribution. Emotion analysis maybe more useful in analyzing briefly describes the system framework. Section 3 is tech-
the class quality. Later on from Section 4, it is shown that nical content, which introduces data set used, evaluation
we are able to get the emotion distribution across time with metric, and shows different approaches tried related feature
a video stream. If some emotion like “Surprise” appears a and classifier selection. Multiple numerical simulation re-
lot in the distribution, the class instructions may not be as sults are also shown in Section 3. Section 4 is experimen-
1
Gender Male Female
Training 9,993 10,992
Testing 3,040 2,967
Table 1. Data set for gender classification
Emotion Angry Disgust Fear Happy

Training 3,995 436 4,097 7,215
Testing 958 111 1,024 1,774
Emotion Sad Surprise Neutral
Training 4,830 3,171 4,965
Testing 1,247 831 1,233
Table 2. Data set for emotion analysis

Figure 1. Framework of system.
tal setup and results, which analyzes on class F1 score, also

shows the system performance on real life images and video
stream. Finally, conclusions and future work are shown in
Figure 2. Samples from gender data set (Male: left three images.
Section 5. Female: right three images).
2. Problem Statement
Main framework of system design is shown in Fig.1,
which includes four modules to process the input image.
Image is transformed to gray scale image at the first be- Figure 3. Samples from emotion data set (left to right: Angry, Dis-
ginning since color information is not that important in this gust, Fear, Happy, Sad, Surprise, Neutral).
classification problem. Then face detection is applied for
the image to locate all the human faces positions. Inside
each face box, gender classification and emotion analysis design a system with good trade-off between performance
engine works to generate corresponding labels. As Fig.1 and complexity is one of the project target.
shows, Module-3 and module-4 shares very similar inside
core, which includes: 3. Technical Content
• Image rescale: The subimage inside face box needs to 3.1. Data set and initial analysis
be rescaled for three reasons: (1) This is necessary and
The used data set for gender classification and emotion
will make things much easier to generate consistent
analysis is shown in Table 1 and 2.
feature dimension later on. (2) The source image data
The data set for gender classification is extracted partly
set for training and testing of gender analysis was dif-
from Image of Group (IoG) data set [10], the original data
ferent in scale size. (3) Face box derived from Module-
set includes more properties on each person like: face po-
1 may be different in sizes due to different face size.
sition, eye position, age, gender, pose. In this project, we
• Feature extraction: After rescaling, this step generates only care about the age property. Thus I split the data
features with consistent dimension. Feature extraction into four folders: “Male Training”, “Female Training”,
can use algorithms like Histogram of Gradient (HOG), “Male Testing” and “Female Testing”. From Table I we
Local Binary Pattern (LBP), Bag of Words (BoW), etc. can see that both male and female image number was
roughly balanced to guarantee the best model training
• Model training/classification: Based on the feature
result. With this split, we can use Matlab command
vectors output from feature extraction, we are able
imageSet easily load corresponding images. And the to-
to train the classifier with input image training data.
tal number of images used for both training and testing is
Training step may take long time due to data size. But
26,992. One thing I notice is that source image for genders
once the model is trained, we are able to use it to do
is not scaled to the same size. This is one of the reasons why
classification on the testing image directly.
we add “Image Rescale” before the feature extraction step
From Fig.1 we can see that each module may have multi- to make the input image 48 by 48 gray image. Fig.2 shows
ple algorithm candidates to implement, while being able to six gender sample images, which includes three male and
2
3.3. Face Detection
For face detection, initially I was using the similar
method from our problem set 3 with HOG + SVM + sliding
window. I also tried the following dynamic boxing method
to adjust window size to fit the face scaling.
size = minSize + 2N × step (3)
Here, size value is with constraint of size ≤ maxSize.
N is the N-th time of window expanding, step is a value
controls the window expanding speed. The advantage of
Figure 4. Bar chart of emotion data set. this strategy is that from minSize to maxSize, we at most
need N = dlog2 maxSize−minSize
step e times of expanding.
three female. Also we note that this data set includes peo- And we are giving smaller expanding speed when the win-
ple from different races and different age. dow size is still small. Hence, instead of doing linear time
Furthermore, Table 2 shows the data set for emotion of expanding and apply SVM at each position, here we have
analysis. The data set is from ICML [11], which has 7 cate- cost at O(log) level, eventually we choose the face window
gories of emotions including: Angry, Disgust, Fear, Happy, size with the biggest prediction score (only if this score is
Sad, Surprise and Neutral (both for training and testing). bigger than SVM threshold).
Fig.3 shows samples of the seven types of emotions. We However the test shows that sliding window scheme
notice that this data set includes emotions from different is quite slow because we need to run SVM multiple
gender and ages, which is good to train a robust model. times. Considering face detection is not the main part
However, emotion of human beings is very complicated and of this project, I turned to use the MATLAB embedded
vague. For example the “Surprise” image of Fig.3 may also vision.CascadeObjectDetector() to do the face detec-
be treated as “Angry” in reality. Furthermore, Fig.4 shows tion, which is using Viola Jones object detection framework.
the bar chart of different emotion types image number (in- Viola Jones algorithm is much faster and good at detecting
cluding both training and testing). Notice that this data set scaled faces [3].
is mostly balanced, except that “Disgust” type has signifi- 3.4. Gender Classification and Emotion Analysis
cantly lower number than other categories, which explains
why “Disgust” class has the lowest F1 score in later testing. I put gender classification and emotion analysis in the
same subsection since from Fig.1, we see that the two mod-
3.2. Evaluation metric ules share very similar inside blocks. Therefore in this sub-
section, I’m going to introduce the following three blocks:
To evaluate the performance of classification, accuracy
image rescale, feature extraction and training/classification.
(ACC) [12] is sued as the metric, which defines as:
ACC = (T P + T N )/(P + N ) (1) 3.4.1 Image rescale
Here, P and N are the total number of positive and negative

samples. T P and T N are the true positive and true negative
samples number. Therefore, ACC value of 100% indicates
that the system is able to predict labels exactly same with
the ground truth values. In this project, ACC is used to
evaluate different feature extraction and machine learning
algorithms.
In addition, I also used precision, recall and F1 [13] to do
the analysis on each type of the prediction classes, which is
shown in Section 4.
2 · precision · recall Figure 5. Face rescale example on testing image (not all faces are
F1 = (2) shown here).
precision + recall
Here, precision is defined as true positive (tp) over tp plus Image scale is necessary for both training and classifi-
false positive (fp). recall is defined as tp over tp plus false cation. For the training set, each image of emotion anal-
negative (fn). ysis was originally given with 48 by 48 gray scale, which
3
is ok to use directly. But for gender classification data, it Paras Cell:8,8. Block:2,2 Cell:4,4. Block:2,2
was come with RGB images with different scale size. Thus Feature Size 900 4356
rescale is necessary to make the training images into fixed HOG ACC 0.8369 0.8500
size and gray scale. This will make much easier in the fea- Paras Cell:8,8. Block:3,3 Cell:16,16. Block:2,2
ture extraction step to obtain feature vector with consistent
Feature Size 1296 144
dimension.
HOG ACC 0.8337 0.7718
For the classification step, there may be multiple faces
marked from the original image, and each comes with dif- Table 4. Gender classification HOG ACC performance with vari-
ferent box size. Thus we do RGB to gray scale and rescale ous cell/block settings.
to 48 by 48 before we apply classifier. Fig.5 shows an ex-
ample of classification rescale inside box. Paras Cell:8,8. Cell:12,12. Cell:16,16. Cell:24,24
Size 2124 944 531 236
3.4.2 Feature extraction (BoW) ACC 0.8638 0.8390 0.8274 0.7864
Note in order to save space, for feature extraction part Table 5. Gender classification LBP ACC performance with various
from 3.4.2 to 3.4.5, the table ACC values are based on cell settings.
testing of gender classification using SVM. However,
emotion analysis data also gives similar conclusion.
For the feature extraction, I started with bag of words. parameters we can tune for HOG, like “Cell Size”, “Block
Matlab provides some embedded functions like “bagOf- Size”, “Block Overlap”, “Number of Bins”. In my test,
Features”, “trainImageCategoryClassifier”, “imageCatego- I kept default value of “Block Overlap” and “Number of
ryClassifier”, etc. to be used. The default feature vec- Bins” since that is the typical settings. “Cell Size” and
tor is based on SURF, I also tried dense SIFT features. “Block Size” will be more important parameters which can
Based on the extracted feature vector from each images, control feature vector size and testing performance. Here,
K-means is applied to the feature space with entire train- “Cell Size” defines the box to calculate histogram of gradi-
ing images. Here, K defines the vocabulary size for the ents. Smaller cell size will give us better chance to catch
histogram. Eventually the image feature is defined as a his- small-scale details. On the other hand, increasing the cell
togram which defines the nearest cluster center distribution size will be able to capture large-scale spatial informa-
for every image. tion. “Block Size” defines number of cells inside the block,
Table 3 shows the ACC of gender classification with both smaller block size may reduce the influence due to illumi-
SURF and dense SIFT features. From later on ACC of SVM nation changes of HOG features [15].
we can see that this performance is even slightly worse than In addition to HOG, LBP is another feature I found out to
Naive Bayes. It seems reasonable to me since the testing ob- be very useful, the performance is no worse than HOG. The
jects are all faces with same structures (eyes, nose, mouth principle of LBP is different from HOG, instead of using
etc.). Clustering those features may lose some details. The gradient information. LBP compares the pixels value with
scenario of gender classification and emotion analysis is dif- its neighbors, and based on the binary comparison result
ferent from scenarios where bag of words are most used to construct the histogram. LBP can be easily extended to
(for example object classification like cups, ships, etc.). In rotation invariant version [16].
addition, BoW is giving me very slow training speed with Table 4 and Table 5 shows the HOG and LBP testing
around 3 million feature vectors to e clustered. Thus BoW ACC with different feature dimension of gender classifica-
was not selected after testing. tion (emotion analysis data testing result is having similar
trend, thus not listed here). By setting cell/block size we
Feature SURF Dense SIFT are able to obtain different feature dimension. Unsurpris-
BoW Test ACC 0.7137 0.7326 ingly higher feature dimension is able to give better ACC
but may also slow down the system significantly due to in-
Table 3. BoW ACC performance of gender classification on SURF creased complexity for machine learning models.
and dense SIFT features (vocabulary size is 300).
3.4.4 HOG and LBP feature combination
3.4.3 Feature extraction (HOG and LBP) To get the best trade-off between performance and speed,
and based on the research fact that combination of HOG
HOG and LBP were tested after BoW. HOG is a very well- and LBP features is able to improve the detection perfor-
know feature descriptor in computer vision [14], which ac- mance [17]. I joined HOG feature (Cell:8,8, Block: 2,2,
cumulates local gradient information . There are several dimension of 900) together with LBP feature (Cell: 12,12,
4
dimension of 944) to boost ACC. Table 6 shows the combi- out we are able to use this method reduces feature dimen-
nation feature result. sion with small performance loss.
Specifically, we use Matlab CascadeObjectDetector sys-
Paras HOG LBP Joined tem object to detect nose on the face. If object returns the
Feature size 900 944 1844 nose position successfully, we select K points evenly round
Test ACC 0.8369 0.8390 0.8673 the circle (with predefined radius) with nose as the center.
If the nose is not detected (CascadeObjectDetector may fail
Table 6. LBP and HOG feature combination result. with the 48 by 48 low resolution image), we simply use the
center of image as the circle center. Here, K can be selected
Table 6 shows the ACC performance of combination be- with different values to get the best trade-off between ACC
tween HOG and LBP. From which we can see that the orig- performance and complexity.
inal feature dimension for HOG and LBP were both around
900, ACC performance were between 0.83 - 0.84. By join-
ing HOG and LBP together, we are able to get significant
0.03 ACC boosting. Even though we have doubled feature
dimension after combination, but this performance is still
higher than HOG or LBP alone with similar size. Because
LBP and HOG is using different principles to construct fea- Figure 7. Circle key-point detection (circle center as nose or image
center, K = 5 and K = 10).
tures, this kind of combination is able to get diversity gain.
In Fig.7, we show the results of circle based key-point
3.4.5 Feature dimension reduction detection. Matlab CascadeObjectDetector is able to return
Consider feature size is important to system speed, a reason- nose position on left two images, but failed on two images
able prune or feature dimension reduction will be very help- on the right side (under which situation we use image center
ful. Especially when we use real-time system, we would directly). Also, K = 5 and K = 10 detection are shown
rather lose small performance to have more smooth experi- here. After this, we can only extract HOG/LBP features
ence for users. around those key-points.
The way I did to reduce feature dimension was to only
K 5 10 15 20
extract HOG/LBP features from areas around key-points.
Initially I tried several famous key-point detection methods Feature size 20% 40% 60% 80%
as following: ACC Loss 5.6% 2.68% 1.41% 1.16%
Harris: detects corners using HarrisStephens algorithm. Table 7. Key-point based feature reduction performance.
SURF: detect blobs using Speeded-Up Robust Features.
MSER: detect regions using Maximally Stable Extremal Table 7 shows the testing result with different K values.
Regions. We can see that with only 40% of feature dimension, we
are only losing 2.68% of the ACC performance. For real
time systems, people may would rather to satisfy this 2.68%
ACC to get more smooth using experience.
Figure 6. Key-point detection result (each person left to right: Har- 3.4.6 Model Training and Classification
ris, SURF, MSER, MSER Region)
After feature extraction method is selected, we can now test
Fig.6 shows key-point detection result using Harris, on different machine learning classifiers. In this project,
SURF, MSER. Those detection method will return differ- I tried different models including: Naive Bayes (NB), K
ent number of points for different images. Since we are not Nearest Neighbors (KNN), Random Forest and Support
using BoW, we need to construct a consistent feature dimen- Vector Machine (SVM). Also note that to get better sys-
sion. I tried different ways to do this like doing K-means, tem speed, I used same features for gender classification
or select strongest K points out of N key-points. However, and emotion analysis. However, we need to choose most
the testing result gives significant ACC loss comparing to suitable learner for gender and emotion classification.
sliding window scheme. Naive Bayes: NB method is based on Bayes rule to cal-
Therefore, instead of using corner or blob key-point de- culate the probability of each classes. NB also naively as-
tections, which returns mostly different physical positions sumes the independence of each features.
in the image. I used a fixed key-point feature extraction, K Nearest Neighbors: KNN is taking the majority of la-
which extracts fixed number of key-points on face. It turns bels from the K number of nearest neighbors. KNN distance
5
Gender Classification Emotion Analysis classification and emotion analysis. The advantage of NB
Test ACC 0.7495 0.3589 is that it’s running super fast, which is the fastest model
among all models, but the ACC is poor with only 0.7495
Table 8. Naive Bayes ACC performance. for gender and 0.3589 for emotion. Note that gender has
higher ACC since it only has two labels while emotion has
Neighbor Number 1 5 10 20 7 labels.
Gender-ACC 0.7446 0.8017 0.8062 0.8190 Table 9 shows the testing result of KNN with different K
Emotion-ACC 0.5306 0.4820 0.4727 0.4583 values. For gender classification, we can see that when K =
20, we get gender ACC of 0.8190. But the boost from K =
Table 9. KNN ACC performance. 10 to K = 20 is relatively small, meaning that the neighbors
ranked 10 to 20 are contributing limited influence. However
Tree Number 20 60 100 300 for emotion analysis, ACC is the best when K = 1, and
Gender-ACC 0.7696 0.8022 0.8102 0.8235 performance is reducing significantly when we increase K
Emotion-ACC 0.4388 0.4859 0.4965 0.5074 value. This means that for emotion analysis data, neighbors
outside the first 1 are introducing more noise than positive
Table 10. Random Forest ACC performance. contributions. Note that comparing with NB method, by
using KNN, we are able to boost gender ACC from 0.7495
C=0.0008 C=0.01 RBF Gaussian to 0.8190, and emotion ACC from 0.3589 to 0.5312.
Gender-ACC 0.8673 0.8608 0.4939 0.4950 Table 10 shows the testing result of random forest, I
Emotion-ACC 0.5089 0.5022 0.2064 0.2017 tested on different tree size. Here, we can see that gender
ACC is increasing from 0.7696 all the way to 0.8235 when
Table 11. SVM ACC performance. tree number is 300. Emotion ACC increases from 0.4388
to 0.5074. Also we notice that from tree size of 100 to 300
Pruned 0 1 2 3 the improvement is relatively small, which means the model
SVM-ACC 0.5672 0.5205 0.5911 0.4752 has almost converged. It has been proved that random for-
Pruned 4 5 6 est is able to prevent over-fitting, thus bigger number of tree
SVM-ACC (s) 0.6073 0.5207 0.5558 size should converge to some value. A reasonable tree size
should be chosen to get the best trade-off between perfor-
Table 12. SVM ACC with class selection. mance and complexity.
Table 11 shows ACC performance of SVM. It turns out
Classifier NB KNN Random Forest SVM that SVM is having less improvement in emotion analy-
Gender Time (s) 1.71 63.3 57.1 170.8 sis than gender classification, this mainly because emotion
Emotion Time (s) 4.12 103.2 231.2 457.2 analysis itself is not a binary classification problem as gen-
der. Also, parameter C value seems not influencing ACC
Table 13. Time cost for data set training. that much. I also tried different kernels like RBF and Gaus-
sian. In both gender and emotion problem, RBF and Gaus-
sian kernel are performing very badly, which definitely not
can be calculated with different metric like Euclidean dis-
a good kernel choice. Considering there are some over-
tance, Hamming distance, etc. (K value can be tuned for
laps/similarity between different emotions, and some emo-
KNN).
tion type may have negative influence, i.e., cause some false
Random Forest: random forest is an ensemble model
positive to other emotions. I also tested on pruning class la-
based on decision tree, where decision tree trains and tests
bels. Table 12 shows the SVM ACC result when pruning
based on attribute split, and label with leaf node. (Tree num-
different emotions. We can see that with specific prune, we
ber can be tuned.)
are able to boost ACC to more than 0.60.
Support Vector Machine: well-know method to split
Table 13 shows the training time cost for each models,
samples with minimum distance maximized. Matlab also
which indicates the following training time relation: NB <
provides ClassificationECOC classifier to support multi-
KNN < Random Forest < SVM. However, longer train time
class classification with SVM. (C value can be tuned, which
does not necessarily indicates longer test time.
controls overfitting, different kernels can be tried.)
To select the best model, we need to run and tune each of
4. Experimental Setup and Results
the classifiers. Note that for random guess, gender classifi-
cation will have 50% ACC, and emotion analysis will have Note in section 3, I already showed the majority of nu-
14.28% ACC (7 classes in total). merical testing results (like different feature performance,
Table 8 shows NB ACC testing results of both gender different classifier performance) in multiple tables. In this
6
section, we will shown some experiment results in addition Gender Male Female
to that. Precision 0.8745 0.8602
The simulation tool used for this project was mainly Recall 0.8615 0.8733
Matlab. The total .m files number is around 20. I also F1 0.8679 0.8667
used JAVA and Python for some data/image parsing. I have
several main functions to test on BoW features, data pro- Table 14. Precision, recall and F1 of gender classification.
cessing, gender detection, emotion analysis, etc. Also other
helper functions to do feature extraction, classification, etc. Emotion Angry Disgust Fear Happy
Vlfeat tool was also used in order to test on dense SIFT fea- Precision 0.4004 0.9091 0.3730 0.6789
ture. Recall 0.3779 0.1802 0.2754 0.7627
For convenience, I parsed and split images into the fol- F1 0.3888 0.3008 0.3169 0.7183
lowing format: Emotion Sad Surprise Neutral
emotion-train/test Precision 0.3719 0.6982 0.4405
0-Angry Recall 0.3841 0.6041 0.5345
1-Disgust F1 0.3779 0.6477 0.4830
2-Fear
Table 15. Precision, recall and F1 of emotion analysis.
3-Happy
4-Sad
5-Surprise 2). Note it also has very high precision of 0.9091 and very
6-Neutral low recall of 0.1802, which meas it may be difficult for the
gender-train/test system to retrieve “Disgust” from testing image, but once it
Male is marked, with 90.91% probability that will be correct.
Female However, those numerical results are tested on the testing
Here, each of the classes of emotion analysis and gen- data set, whose resolution was intended to be low and face
der classification has its own corresponding folder, which expression was very complex. My feeling when testing on
makes very easy to load images using Matlab command im- real life image or video stream is that, the system is working
ageSet. far better than the numerical performance on testing data set
(shown in section 4.2 and 4.3).
4.1. Class F1 score analysis
4.2. Real Life Image Test
In Section 3, we already showed majority of numerical
The numerical ACC results from section 3 should al-
results including test on feature extraction, feature reduc-
ready be enough to verify the correctness of the system.
tion, different classifier models etc. In this part, I’m going
But to have a more straightforward view of the perfor-
to show how robust the system is on each type of classes.
mance. This subsection shows some test result on images
Instead of using accuracy, precision, recall and F1 score are
and videos.
used (definition can be found in Section 3.2).
Table 14 shows the precision, recall and F1 of gender
classification. We can see that both of male and female type
have very close performance on the three metrics. Thus we
can conclude the system has no bias, and will have very
similar good performance on male and female.
Table 15 shows the precision, recall and F1 of emotion Figure 8. Emotion image-1 testing result.
analysis. Different from table 14, here we notice each of
the class is highly biased. Among them, “Happy” and “Sur-
prise” have the highest F1 score, indicating those two types
of emotions will have the best performance in real life test-
ing. Some emotions have relatively low F1, like “Angry”,
“Fear” and “Sad”, but this is understandable since those
emotions were essentially kind of vague. As shown later
on in section 4.2 and 4.3, we can tolerate some overlapping Figure 9. Emotion image-2 testing result.
among them. In addition, “Disgust” has the lowest F1 score,
this is also reasonable since we have much less image train- In order to test the emotion classification result, I found
ing data for “Disgust” emotion (as shown in Fig.4 and Table several images (Fig.8 and Fig.9) [18] with human faces of
7
various expressions. Also, the image includes people with gender classification only has one error in the image, since
different races and ages. Note the five faces of Fig.8 and the gender features for the person seems kind of vague.
Fig.9 were passed to the system inside one image. Hence, I also tested the system on a lot of my personal images,
corresponding back to Fig.1, the process will be: my feeling is that the system is giving much better perfor-
mance than the ACC value showed in table of Section 3. For
• Face detection module circles out five face box areas. gender classification, we have quite high ACC here because
• Rescale image inside face box. Then feature extraction people in real lift images mostly have more clear features
generates feature vectors for each face with consistent than training data. While for emotion analysis, as shown in
dimension. Section 4.1. The system is giving much higher F1 score on
emotions like “Happy”, “Surprise”, which are more com-
• Run gender classifier with the provided feature. mon emotions in real life.
• Run emotion classifier with the provided feature. 4.3. Video Stream Test
• Add face box, gender and emotion label to image. In addition to image, I also tested the system on video
stream to see how well it can handle continuous expression
From Fig.8 and Fig. 9 we can see that we actually have a change. Different from image, which is static, video is a
very good test result on both gender and emotion classifica- more flexible method to do the gender and emotion testing,
tion. For gender classification, only one of the 10 faces had since we can show different expressions and observe how
error. For emotion analysis, we marked five different emo- the system handles those changes.
tions labels, which are: Angry, Fear, Surprise, Happy and
Neutral. I would say almost all the emotion classification
results look reasonable to me. However, human being’s face
expression is very complicated. It’s even vague for us some-
times to judge other’s emotion through face expression. For
example, the third face on Fig.9 could be explained as either
“Surprise” or “Happy” by different people.
Figure 10. Classroom image testing result.

Figure 11. Emotion samples from video.
In addition, Fig.10 shows an example image result of
classroom students (original image from [19]). Fig.10 in- We can use Matlab vision.VideoFileReader to get each
cludes scaling and rotation. Different students may have of the video frame. We can also use sparse sampling rate,
different face size in the image, depending on their distance instead of doing classification on each frame, do it for each
to the camera in 3D coordinates. This issue can be handled of the N frames. This may lose some resolution of classifi-
by Viola-Jones detection algorithm. But we do lose some cation but definitely be able to speed up the system.
details of the face with smaller face scale. Another issue is After training from section 3.4.6, classifiers are ready to
rotation, we notice that different from Fig.8/9, where every- use (which can be saved as .mat format). Thus in real sys-
one was looking at the camera center directly. None of the tem testing, we can load classifier into memory directly, no
people in Fig.10 is looking at the camera, we also have sev- training needed.
eral rotated faces. However, there were also a few training I recorded a video with length of 94 frames in total,
data comes with rotation, thus We are able to do the correct which lasts for 15 seconds. Fig.11 shows 6 sample frames
prediction in Fig.10. Almost all the emotion prediction in from the video stream. Each of the sample has different
Fig.10 seems reasonable to me, except one person marked emotion type. The result looks quite reasonable to me. One
with “Sad”, this may due to the scaling of face. And the interesting thing we notice is face box excludes mouth un-
8
der some situations (like “Fear” and “Surprise”), but we are gender. The gender prediction engine seems quite accuracy.
still able to get the correct prediction, indicating the system It should also be very robust since I did a lot of exaggerated
is robust. Again, emotion prediction itself is kind of sub- face expressions during this video. Otherwise we should
jective or vague in real life. It seems the major difference look forward to even higher accuracy.
between “Angry” image and “Sad” is on the mouth feature,
but both predictions are reasonable to me. 5. Conclusion and Future Work
In this project, we touched topics like face detection,
gender classification and emotion analysis. Different fea-
ture extraction method like BoW, LBP and HOG were
tested. Key-point detection based feature dimension reduc-
tion was considered to reduce complexity. Multiple classi-
fiers like NB, KNN, Random Forest and SVM were tested.
Parameters for each model were tuned to get the best per-
formance. Numerical results indicate that we are able to
get 0.8673 accuracy on gender classification and 0.5089 on
emotion analysis (0.6073 when we prune particular class).
Further analysis was done on precision, recall and F1 for
Figure 12. Emotion distribution during video sampling time. each classes. Testing on real life images and video stream
also demonstrates the validity of the system.
Personally speaking, this is a very exciting project,
which lets me familiar with different vision algorithms and
how to connect them with machine learning tools. Being
able to develop a demo system that can be used immedi-
ately is very interesting. I had a lot of fun to test on my
different personal pictures and photos.
ACC on emotion analysis is a part that can be improved
in future work. Also current emotion analysis has biased
performance on different type of emotions. Introducing
deep learning concept should be very help to improve this
multiclass problem.
Figure 13. Gender distribution during video sampling time. Also, to be better used in real life scenarios like class
quality analysis. More information can be collected, like
Instead just focusing on one single image, a more inter- human poses, age information, human recognition etc. A
esting or useful analysis would be doing this on entire video good model to use the collected information generate an
stream time. Fig.12 shows the emotion distribution on the overall summary score (like group analysis) will also be
94 frames of this 15 seconds video. We can clearly see the very interesting.
proportion of each type of emotions. In this demo video, Code link: Follow this link.
I was “Angry” for 6% of the time, never been “Disgust”, Code link also submitted through Google Form.
“Fear” for 25%, “Happy” for 15%, “Sad” for 10%, “Sur-
prise” for 28% and “Neutral” for 16%. We notice that emo- References
tion “Disgust” never appears in this video stream, this is be- [1] Yang M H, Kriegman D J, Ahuja N. Detecting faces in images:
cause training data for “Disgust” was significantly smaller A survey[J]. Pattern Analysis and Machine Intelligence, IEEE
than others (based on Table 2 and Figure 4). Also from 4.1, Transactions on, 2002, 24(1): 34-58.
we can see “Disgust” has a very low recall, which means [2] Zhang C, Zhang Z. A survey of recent advances in face detec-
it’s relatively difficult recognize this emotion from image, tion[J]. 2010.
but once it’s recognized, it will mostly be correct (based on [3] Viola P, Jones M. Rapid object detection using a boosted
the high precision from 4.1). System detects 6 types of emo- cascade of simple features[C]//Computer Vision and Pattern
tions in this short video because I was changing my expres- Recognition, 2001. CVPR 2001. Proceedings of the 2001
sion frequently on purpose. In reality, this kind of emotion IEEE Computer Society Conference on. IEEE, 2001, 1: I-511-
distribution maybe useful to evaluate the quality of a class. I-518 vol. 1.
Fig.13 shows the gender prediction distribution during [4] Mkinen E, Raisamo R. An experimental comparison of gender
the video time. From which we can see that most of the time classification methods[J]. Pattern Recognition Letters, 2008,
(93%) the system is able to make prediction with correct 29(10): 1544-1556.
9
[5] Lian H C, Lu B L. Multi-view gender classification using local
binary patterns and support vector machines[M]//Advances
in Neural Networks-ISNN 2006. Springer Berlin Heidelberg,
2006: 202-209.
[6] Baluja S, Rowley H A. Boosting sex identification perfor-
mance[J]. International Journal of computer vision, 2007,
71(1): 111-119.
[7] Saatci Y, Town C. Cascaded classification of gender and facial
expression using active appearance models[C]//Automatic
Face and Gesture Recognition, 2006. FGR 2006. 7th Inter-
national Conference on. IEEE, 2006: 393-398.
[8] Kim Y, Lee H, Provost E M. Deep learning for robust feature
generation in audiovisual emotion recognition[C]//Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE Interna-
tional Conference on. IEEE, 2013: 3687-3691.
[9] Fasel B, Luettin J. Automatic facial expression analysis: a sur-
vey[J]. Pattern recognition, 2003, 36(1): 259-275.
[10] Gallagher A, Chen T. Understanding images of groups of
people[C]//Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on. IEEE, 2009: 256-263.
[11] https://www.kaggle.com/c/challenges-in-representation-
learning-facial-expression-recognition-challenge/data.
[12] https://en.wikipedia.org/wiki/Accuracy and precision.
[13] https://en.wikipedia.org/wiki/F1 score.
[14] Dalal N, Triggs B. Histograms of oriented gradients for hu-
man detection[C]//Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on.
IEEE, 2005, 1: 886-893.
[15] http://www.mathworks.com/help/vision/ref/extracthogfeatures.html
[16] Ahonen T, Hadid A, Pietikainen M. Face description with
local binary patterns: Application to face recognition[J]. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions
on, 2006, 28(12): 2037-2041.
[17] Wang X, Han T X, Yan S. An HOG-LBP human detector
with partial occlusion handling[C]//Computer Vision, 2009
IEEE 12th International Conference on. IEEE, 2009: 32-39.
[18] http://www.shutterstock.com/s/emotions/search.html
[19] https://jonmatthewlewis.wordpress.com/2012/02/24/posture-
and-gestures-in-the-classroom-and-on-the-date/
10
Computer Vision for Food Cost Classification
Dash Bodington
Stanford University
dashb@stanford.edu
Abstract tion, there has been a great deal of focus on understanding

images beyond their still content, mostly in the form of gen-
This project aims to use various computer vision meth- erating descriptive sentences, and using semantic analysis
ods for the novel task of restaurant cost prediction based to understand language [6, 7] to generate these descriptions.
on food images. It uses the Yelp dataset, a dataset of While generating descriptions is easy for a human to eval-
200,000 images tagged by business, which is significantly uate and judge a perceived intelligence, it is only a subset
pared down to near 15,000 relevant images tagged as food, of what can be understood beyond image content through
and labeled by the cost rating of the business they come computer vision.
from. These food images are partitioned into training and This project aims to tackle a subset of beyond-content
testing datasets, and the test dataset has a uniform sample image understanding through food images and restaurant
distribution over the classes: cheap ($-$$ on Yelp), and ex- ratings from the Yelp dataset. The dataset contains 200,000
pensive ($$$-$$$$ on Yelp). images tagged with descriptors and the businesses they
Once the dataset is defined, several techniques are come from, and businesses are give a cost rating from $
trained and applied to make class predictions on the test to $$$$. Because of the small number of restaurants in
dataset. The prediction model is broken into feature extrac- the two most expensive classes, these cost ratings are com-
tion and classification, and significant exploration was done pressed into 1-2 $ and 3-4 $, hereafter referred to as cheap
to come up with optimal features and classifiers for the spe- or $ and expensive or $$ By filtering images by ’food’ tag-
cific task. The primary goal of this project was to achieve ging, and assigning the business cost rating to the image, a
the highest possible classification accuracy on the reserved dataset containing food photos and an estimated cost rank-
test dataset. ing, or fanciness ranking is created. Many computer vision
The four main feature extractors used were: simple pre- approaches can be used in an attempt to understand these
processing (crop and resize images to uniform dimensions), food images beyond their straightforward content. The sim-
color histograms (a normalized distribution of color inten- ple goal of this project is to train algorithms on the training
sity at all pixels), SIFT bag of words (using normalized dis- data, and maximize accuracy on the test dataset.
tributions of SURF feature common descriptors over each
Of the possible approaches to this problem, this project
image), and deep neural network features (using the trained
focuses on predictive models divided into feature extractors
Alexnet to extract image features).
and classifiers, and explores several options and combina-
Each feature extractor was tested with each of the fol-
tions between the two model parts. Four primary feature
lowing classification methods: K-Nearest Neighbor (KNN),
extractors used were: simple preprocessing (crop and resize
Support Vector Machine (SVM), Naive Bayes, and a custom
images to uniform dimensions), color histograms (a nor-
neural network with softmax loss.
malized distribution of color intensity at all pixels), SURF
The best results were achieved from the combination of bag of words (using normalized distributions of SIFT fea-
Alexnet features and color histograms, independently com- ture common descriptors over each image), and deep neural
pressed using PCA, and classified with a shallow neural network features (using the trained Alexnet to extract image
network, which acheived 70% accuracy on the test set while features). Each feature extractor was tested with each of the
most other methods achieved from 50% - 78% accuracy. following four classification methods: K-Nearest Neighbor
(KNN), Support Vector Machine (SVM), Naive Bayes, and
a custom neural network.
1. Introduction
Very little work similar to this project has been done, but
Since various deep learning methods have shown very the project relies heavily on more general, existing com-
impressive results in classification, detection, and localiza- puter vision tools and research. The remainder of this re-
1
port will discuss the dataset, models, and results, and con- class. Enforcing this 50% distribution cuts the dataset ap-
clusions of this project in detail. proximately in half because of the relative rarity of expen-
sive restaurants.
2. Implementation After the training and test datasets are fully defined, each
image is cropped from the center to the largest square area
2.1. Experimental Setup possible, and is resized to a variable size, depending on the
The implementation of this project was completely com- feature extractor used (sizes range from 64x64 to 227x227).
putational, and involved dataset extraction, data preprocess- With this definition of the training and test datasets, there
ing, feature extraction, and classification. All code for this are several further processing steps which are sometimes
project was written in python 2.7, opencv was used for most used on the training set to improve training and perfor-
image processing [2], tensorflow [1] was used to write all mance.
neural networks, and scikit provided implementations of • For validation or cross validation, which is only used
some other classifiers[5]. when tuning feature or classifier parameters, or train-
The project was run on a fast desktop computer. All neu- ing neural networks, 20% of the training dataset is ran-
ral network models were run on an Nvidia GTX 780 GPU, domly designated as the validation set. All models ex-
and the remainder of the processing was done on a 4.4GHz cept neural networks are trained on the whole training
quad-core CPU. Even with reasonably fast hardware, fea- dataset before testing.
ture extraction was very time consuming, so it was often
done only once, and the features were cached to be used • Depending on the classifier and loss function being
on-demand. used, the training set will usually have images dis-
carded in the same fashion as the test dataset to even
2.2. Dataset Extraction the distribution of images across classes.
Though this project uses a nicely formatted and labeled 2.3. Feature Extraction
dataset, the subset required for training and testing comes
from several extraction steps. The Entire Yelp dataset con- Several feature extractors were used in this project as in-
tains 200,000 color images from business’ Yelp pages. Im- puts to the various classification systems.
ages can be uploaded by either businesses or customers, and
are of varying quality and size. These images are tagged 2.3.1 Images as Features
by Yelp’s own computer vision algorithms, and can be cor-
In some cases, preprocessed (scaled and cropped) images
rected by users, as there are sometimes errors in tagging.
were used as features themselves. This method is usually
Initially, the image database is analyzed, and all images
most appropriate for input into a convolutional neural net-
tagged as ’food’ are extracted along with the ’id’ of the busi-
work classifier, which essentially defines its own features
ness they come from. All images from the same business
internally, but can also be used with other classifiers with
are grouped, and are then labeled with the consolidated cost
varying success. These features use images of size 128x128
label (mapped from Yelp’s $ - $$$$ to the binary $ or $$
or 64x64, which are vectorized unless the classifier is a con-
for this project) of the business. If a business has not been
vnet.
labeled with a cost rating, which happens infrequently, the
corresponding images are discarded. Next the businesses
ids are binned by their attached cost ratings, each bin is 2.3.2 Color Histogram Features
shuffled, and a predetermined train/test ratio (0.7 in this As an initial step beyond images as features, it was thought
case) is used to assign each business id to the training set that the colors in an image could give an indication about
or test set. This binning by business and cost rating is used the cost of food. This feature consists of a length 30, nor-
to ensure that there is a similar distribution of data in the malized (sum 1) vector, which contains a 10-bin histogram
test and training datasets, and to avoid placing images of for each color channel. These features use images of size
the same item from the same business into the training and 128x128 or 64x64 as inputs.
test dataset, which could be considered as mixing the two.
In total, there are 11,530 images available for training, and
2.3.3 SIFT Bag of Words Features
4,412 images available for testing, though these counts de-
crease depending on the desired training and testing distri- SIFT Bag of Words features consist of frequency distribu-
butions. tions of common feature descriptors. During training, a set
In the test dataset, images in each class are shuffled, and of SIFT descriptors from all training images is extracted,
images are discarded from the class with more images un- and N (range: 20-100) ’words’ are extracted with the K-
til exacly 50% of the images in the test set are from each means algorithm [2, 4]. Next, during training and testing,
each image’s feature vector is calculated as a normalized 2.4.2 Naive Bayes
N-length histogram of the words, where each feature de-
scriptor from the image is assigned to one word with the Naive Bayes classification assumes feature indepen-
nearest neighbor algorithm. These features use images of dence and Gaussian probability distribution of fea-
size 128x128 as inputs. tures to make a maximum likelihood estimate ŷ =
argmaxy P (y)ΠP (y|xi ) of the class, and usually can per-
form well on small training datasets because it has few pa-
2.3.4 Alexnet Features
rameters.
Alexnet is a 13-layer pretrained convolutional neural net-
work which previously achieved state of the art perfor-
mance in the Imagenet large-scale image classification chal- 2.4.3 Support Vector Machine (SVM)
lenge [3]. Originally, the network’s output was a 1000-class
The SVM is a linear classifier which defines a class-dividing
softmax layer, but because the network learns very useful
hyperplane f (x) = B0 + B T x, which minimizes 21 ||B||2
features for other tasks, swapping the final layers of the net-
subject to y(B0 + B T x) > 1 on the training set. Generally,
work for custom-trained layers is a common practice, espe-
SVMs have the advantage that they are less likely to overfit
cially for those working without the computational power or
than other methods because of build-in regularization.
large data volume required to train a model with similar per-
formance from scratch. For this project, ’Alexnet Features’
are considered to be the output of one of the final layers of 2.4.4 Neural Networks
the pretrained network.
With Alexnet, multiple feature sets were created from While neural networks are the most general classifier used
multiple layers of the convnet. fc8, fc7, and fc6 (the final in this project, they are also the most difficult to tune, and
fully-connected layers of the network). Because these fea- it requires a great deal of data to train networks with many
tures are effectively sparse (we are feeding the network a neurons. Neural networks are a layered structure of inter-
very small subset of the images it was trained to classify), connected linear and nonlinear operations whose parame-
and very large in size, PCA is often used to reduce the fea- ters can be learned with various gradient descent methods.
ture dimensionality before training. In this project, the classification networks presented have
This network requires image inputs of size 227x227, zero or one hidden layers, and all layers are fully connected.
which are also mean-subtracted. Each fully connected hidden layer (when present) is fol-
Using the GPU allows for a significant speedup in fea- lowed by a Rectified Linear Unit (ReLU), and the final layer
ture generation, over 70x faster than CPU computation on is a softmax layer which takes two inputs and computes
the hardware for this project, but is not enough to train a net- pseudo-probabilities for the input on each class according
work like Alexnet from scratch with limited time and data. to
exi
2.4. Classifiers p̂(xi ) = P xi
e
Because of the many feature inputs used in this project, .
classifiers with different properties and strengths are used to
Though neural networks have been responsible for many
increase the likelihood of good performance with each fea-
recent state of the art results in computer vision, they
ture set. Feature vector lengths range from 30 (Color His-
are among the most difficult models to manage on small
togram) to 12,288 (cropped and rescaled images), so mod-
datasets. Because of this, L2 regularization is sometimes
els which may overfit in some cases, may perform well with
added to the iterative minimization of the cross-entropy loss
fewer features.
X
Loss = − p(xi )log(q(xi )) + λR(W eights)
2.4.1 K - Nearest Neighbor
The K-Nearest Neighbor classifier archives the entire train- where q contains the outputs of the softmax layer, and p is
ing feature set, and at prediction time, calculates the eu- the one-hot label vector for the training sample x. Other
clidean distance from the input feature vector, and makes training tricks, such as dropout and projection of sparse
a prediction based on a majority vote from the labels of feature vectors into lower dimensions with PCA are also
the K closest training examples (K = 20 for this project). attempted to increase robustness and decrease overfitting.
While storage-inefficient, it is one of the simplest classi- Batch gradient descent with momentum was used to train
fiers, and would perform well if images in the training and networks for this project because it provided reliably con-
testing dataset were similar enough, but fails to generalize verging results, especially when changing the class distri-
otherwise. bution (and size) of the training set.
Figure 2. These three images of cheesecake, chocolate dessert, and
steak, are the images with the highest estimated probability of be-
ing expensive by the neural network.
Figure 3. These three images of pizza, a taco, and a sandwich, are

the images with the lowest estimated probability of being expen-
Figure 1. Example of classification neural network. A ReLU is
sive by the neural network.
applied to the hidden layer, and a softmax normalization is applied
to the output.
3.2.1 Details of Best Model

3. Results The best performing model, ’multifeature,’ which achieved
an accuracy of 70% on the test dataset was a multifeature
Though estimating restaurant cost-class from food im- model with Alexnet features, color histogram features, and
ages is a difficult problem, reasonably good results are tab- a neural network classifier.
ulated and discussed below. For this model, 1000-dimensional Alexnet features were
extracted from the 8th fully connected layer of the network
3.1. Accuracy (the layer closest to the original softmax layer) and com-
pressed to 100 dimensions and whitened with PCA. Then,
whitened color histogram features were concatenated with
the Alexnet features, leaving feature vectors of length 130
K-NN NB SVM NN
for the classifier.
Images 0.56 0.60 0.50 The classifier in this model was a fully connected neu-
Color Histogram 0.57 0.63 0.54 0.56 ral network with a hidden layer of size 50 (see Figure 1).
SIFT + BoW 0.59 0.62 0.60 0.60 This model was trained for 10,000 iterations of batches of
Alexnet 0.62 0.65 0.66 0.68 200 images, reporting validation error every 500 iterations.
Multifeature 0.55 0.65 0.68 0.70 After full training, the model from the iteration with the
highest validation accuracy was used for testing.
Table 1. Multifeature methods (Alexnet fc8 and color historgrams)
with a neural network classifier perform the best on the test set.
3.2.2 Unsuccessful Models
In addition to the previously presented models, many other
3.2. Example Classifications less successful attempts were made to solve this task. Some
simply performed poorly, were not unique, or were compu-
Though accuracy results are the goal, actual examples of tationally unfeasible.
results provide a more intuitive look at classification. We
can see that desserts appeared frequently in the $$ class, • Convolutional neural networks were implemented, but
which is not surprising, considering that they are often con- failed to train or generalize well. It is thought that with
sidered to be a luxury item, and $ -classified foods are more the large image size required (greater than 128x128) to
everyday items. resolve objects in detail in these images, the training
dataset was not large enough to train the large filters 4.2. Future Work
also required. At small image sizes (64x63-128x128)
Moving forward, it is possible that discarding one of the
with smaller filters, models did not generalize, likely
middle classes $$ or $$$ from the Yelp dataset would create
due to the lack of recognition at such a pixelized scale.
a set which is better divided between classes. If a dataset of
• Other faster feature extractors were also tested for the photos and menu prices were available, it would be inter-
bag of words method, such as SURF and BRIEF, but esting to combine this cost category recognition with dish
since SIFT features were the best performing, only recognition to allow for prediction of actual dish prices.
they were presented. 70 % accuracy is satisfactory given the time and data
constraints of the project, and every effort was made to
• Linear Discriminant Analysis was used for classifica-
raise this accuracy as high as possible. With more time and
tion as well, but performed similarly to the SVM, so
data, it would be interesting to train end-to-end convolu-
the decision was made to focus on fewer, more unique,
tional neural network models, or focus more on fine-tuning
classifiers.
state of the art models such as Alexnet or Google’s Incep-
• Many attempts were made to train models on imbal- tion. As the state of the art in computer vision progresses,
anced datasets in order to have more training data, it is likely that tools more suited for this problem will arise,
however, with the imbalanced nature of the data (with but for now, the best results of 70% have been achieved with
75% of the images in the $ class), many models simply two neural networks in combination with hand-written fea-
learned to predict $ consistently, since this would lead tures.
to a higher accuracy than varied predictions.
4.3. Replicating Results
• To combat the problem above, various loss functions
and update methods were tried, such as weighting each Code for this project can be found at the author’s Github
misclassification by (1 - the prior probability), and (github.com/DashBodington/cs231aProject), and the Yelp
only performing updates for misclassified images. Ul- Dataset is available from Yelp. Code is written in Python
timately, training on an even dataset was the most suc- 2.7, and requires Tensorflow, OpenCV with the feature ex-
cessful. tractors in opencv contrib, NumPy/SciPy, and scikit-learn.
The GPU implementation of Tensorflow is strongly recom-
4. Conclusion mended, and some CPU feature extraction methods may
take several hours on the full dataset.
High performance in abstract computer vision tasks like
this one is very difficult to achieve, which was a lesson
learned during this project. Though accuracy of 70%, sig- References
nificantly above random (50%) was achieved, higher per- [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
formance would certainly be necessary before using a pre- C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe-
diction system like this for an unsupervised application. mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,
4.1. Challenges R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster,
The greatest challenge in achieving good performance in J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
this task seemed to be working with unbalanced or small V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
datasets. Though near 15,000 food images were available, den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Ten-
sorFlow: Large-scale machine learning on heterogeneous sys-
the rarity of expensive restaurants made training difficult,
tems, 2015. Software available from tensorflow.org. 2
and discarding images to even the datasets led to a signifi-
[2] G. Bradski. Dr. Dobb’s Journal of Software Tools. 2
cantly smaller volume for training, though results improved.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
Beyond data problems, there may also be issues with the
classification with deep convolutional neural networks. In
labeling of the dataset. Yelp uses their own neural network
F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
to tag images, so some images which were tagged as food editors, Advances in Neural Information Processing Systems
(such as salt and pepper shakers, and a hotel key), were not 25, pages 1097–1105. Curran Associates, Inc., 2012. 3
food. Though the frequency of this was small, it may have [4] D. G. Lowe. Distinctive image features from scale-
had an effect on performance. Additionally, the $ rating invariant keypoints. International Journal of Computer Vi-
system is not perfect for this task. A Yelp $$$ restaurant in sion, 60(2):91–110, 2004. 2
an expensive location could serve similar dishes to a Yelp [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
$$ restaurant elsewhere, so there is certainly some blurring B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
of the data across the class boundary, unlike most standard V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
classification tasks. M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Re-
search, 12:2825–2830, 2011. 2
[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
and L. Fei-Fei. ImageNet Large Scale Visual Recognition
Challenge. International Journal of Computer Vision (IJCV),
115(3):211–252, 2015. 1
[7] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
tell: A neural image caption generator. In The IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
June 2015. 1
Using Computer Vision to Solve Jigsaw Puzzles
Travis V. Allen
Stanford University
CS231A, Spring 2016
tallen1@stanford.edu
Abstract 1. Introduction
This project focuses on the creation of an end-to-end
Many in the computer vision community have written “automatic” jigsaw puzzle solving program that assembles
papers on solving the “problem” of jigsaw puzzles for a jigsaw puzzle using an image of the disassembled puzzle
decades. Many of the former papers written cite the po- pieces. The solver does not use the picture of the assembled
tential extension of this work to other more serious endeav- puzzle for matching purposes. Rather, the solver created for
ors like reconstructing archaeological artifacts or fitting to- this project attempts to solve the lost the box conundrum by
gether scanned objects while others simply do it because displaying the assembled puzzle to the user without the need
it’s an interesting application of computer vision concepts. for the reference image. Needless to say, this program was
This author falls in the latter camp. Several methods for created with an eye toward a potential smart phone applica-
constructing jigsaw puzzles from images of the pieces were tion in the future. It should be noted that the puzzle solver
considered from a theoretical standpoint before the comput- is created entirely in Matlab and heavily utilizes functions
ing power and the high-resolution image capturing devices found in the Matlab Image Processing Toolbox.
necessary to employ these methods could be fully realized. Solving jigsaw puzzles using computer vision (CV) is an
More recently, many algorithms and methods tend toward attractive problem as it benefits directly from many of the
disregarding piece shape as a discriminant entirely by us- advances in the field while still proving to be both challeng-
ing “square pieces” and rely instead on the underlying im- ing and intellectually stimulating. Due to time constraints,
age properties to find a solution. The jigsaw puzzle solver manpower limitations, and the fact that I had little to no
described in this paper falls somewhere in between these prior experience with many of these concepts and the Mat-
two extremes. The author of this project paper describes lab Image Processing Toolbox when beginning this project,
the creation of an “automatic” jigsaw puzzle solving pro- several assumptions about the problem were made up front
gram that relies on multiple concepts from computer vision in order to make it simple enough to solve in the time given.
as well as past work in the area to assemble puzzles from The primary assumptions and limitations are as follows:
a single image of the disassembled pieces. While the pro-
gram is currently specifically tailored to solve rectangular 1. The pieces in the source image do not overlap nor do
puzzles with “canonical” puzzle pieces, concepts learned they touch.
from this work can be used in concert with other computer
vision advances to enhance the puzzle solver and make it 2. The source image is captured in a “top-down” manner
more robust to varying piece and puzzle shape. The puzzle with minimal perspective distortion of the pieces.
solver created for this purpose is fairly unique in that it uses
a picture of the disassembled pieces as input, no reference 3. The pieces in the source image comprise one entire
to the original puzzle image, and is done using the Mat- puzzle solution (you can not mix and match pieces
lab Image Processing Toolbox. The solver created for this from other puzzles).
project was successfully used on five separate puzzles with
different rectangular bounds and dissimilar puzzle images. 4. The final puzzle is rectangular in shape and all pieces
These results are similar to others who have created simi- fit neatly into a grid.
lar puzzle solvers in the past. Ultimately, the author hopes
that this work could lay the groundwork for a smart phone 5. The pieces of the puzzle are standard or “canonical”
application. in shape – this means they are square with 4 distinct,
sharp corners and four resulting sides.
1
6. All intersections of pieces in the puzzle will be at the Lastly, a former student of CS231A, Jordan Davidson
corners of the puzzle pieces and all internal intersec- [1], did his project in this very area, though it was in a
tions will be at the corners of four pieces. slightly different vein. He looked at a genetic algorithm that
could solve large “jigsaw” puzzles with square pieces that
7. Each side is characterized by having a “head,” a “hole,” used the information from the pieces to determine if it had
or by being “flat.” found the correct match. While the algorithm is interesting
In the following paper, I will discuss the previous work and probably applicable on some level to my puzzle solver,
that has been done in this area in section 2, emphasizing it was not quite what I was looking to do for this project.
those papers that most influenced the methodology I fol- Jordan’s work appears to be in an area that is growing with
lowed for my puzzle solver. I will then describe my tech- others attempting to solve larger puzzles of this kind. Per-
nical approach to the problem and how my puzzle solver sonally, I wanted to solve the puzzles as they are seen and
works in section 3. In section 4 I will show some of the manipulated in real life. Most of this area of research has
results obtained with the puzzle solver so far and discuss other applications and was not quite what I was looking to
both the experimentation that has been conducted and ar- do. Though, as I said, there is definite applicability of some
eas where more experimentation could occur. Finally, I will of the algorithms to my ultimate solver and future iterations
wrap things up in a conclusion in section 5. may look at something like the algorithm explored in Jor-
dan’s paper.
2. Related Work
3. Puzzle Solver Technical Approach
As one can imagine, such a problem as solving jigsaw
puzzles might attract a good number of people in the com- Creating a program to construct a puzzle using an im-
puter vision community. And indeed it has. The problem age of the pieces requires a number of steps, each of which
has been considered for decades, going back to H. Free- can be executed in a number of ways. In this section, I will
man, and L. Gardner [2] in 1964. They first looked at how describe the methods I used in my final code, but will also
to solve puzzles with shape alone. Then there’s H. Wolf- discuss alternatives that were either attempted with subop-
son et al. [4] who describes the very matching methodology timal results or that were not used but could be in future
that I use. He is able to assemble a 104 piece puzzle us- iterations.
ing his method in 1988. While I do not solve a 104 piece
3.1. Image Capture and Segmentation
puzzle, their solution requires individual pictures of each
piece, though he does solve it with shape alone. One area In order to capture the pieces to be assembled into a
where his method is different than mine is that it can handle final puzzle, I used a fairly high resolution camera – a
two intermixed puzzles, whereas mine can only handle the Canon Rebel T4i DSLR with 18.0 Megapixel resolution.
pieces from one at present. D. Goldberg et al. [3] expanded The pieces were placed face up on an easily segmentable
on Wolfson’s work and developed an even more global ap- background (i.e. a “green screen”) with great care taken to
proach to solving jigsaw puzzles – their method allowed for ensure they were not overlapping (see figure 1 for an exam-
the solution of puzzles that did not necessarily intersect at ple using the Wookie puzzle). The picture was taken from
corners. My inspiration for color matching across bound- a “top-dead-center” position looking straight down onto the
aries comes from D. Kosiba et al. [5] who propose methods pieces in order to reduce perspective distortion. Lighting
for using color in addition to shape in 1994. was kept as neutral as possible with consideration given to
Some insight into how to use Matlab to help solve this sources of glare and to the possible disproportionate light-
problem came from some detailed student papers that were ing of some pieces over others. With more time, the abil-
found with a Google search. A. Mahdi [8] from the Univer- ity to compensate for off-axis image capture of the pieces
sity of Amsterdam and N. Kumbla [6] from the University (i.e. rectification to the ground plane) could be built into the
of Maryland both attempt the problem using methods sim- code, though that was not explored for the current incarna-
ilar to what I end up using, though neither ends up with tion of the solver.
quite full and satisfactory solutions and both rely on high As discussed earlier, with an eye toward an eventual
resolution and highly controlled inputs. Finally, my inspi- smart phone application, the first step I take is to signifi-
ration for creating a smart phone application comes from cantly reduce the resolution of the input image from that of
L. Liang and Z. Liu [7] from Stanford who do not use the the original in order to shrink the memory burden and make
same matching methodology (they use the actual image of the resulting image more comparable to one that might be
the fully constructed puzzle and SURF/RANSAC), but who obtained with a smart phone. Once I have resized the image
do try to implement their solution in a real-time smart phone (960x1440 was the main resolution used for the test cases),
application. This is a possibility in the future for my puzzle I use a Gaussian filter (with σ = 1) to blend the edges prior
solver. to the actual segmentation.
2
Figure 1. Input Image for the Wookie Puzzle
Figure 2. Polar Plot of Boundary for One Puzzle Piece

Since segmentation can be done in a multitude of ways,
the approach I use is to put the pieces on a solid green screen
background and segment based on color. This has been XY-coordinates to polar coordinates and then using this in-
done for ease and efficiency since segmentation is not the formation to find the predominant direction of the points,
entire focus of the project. However, as with the additional taking the average of the intersection, and then taking the
adaptation of the code to varying image capture angles, it closest point on the boundary. While these methods may
would be entirely possible to segment the image using other have eventually been adapted for my purposes, I had is-
means, such as the mean shift or normalized-cut methods sues getting them to work consistently and instead landed
used in class. These other methods have the potential to on a method based on the one presented in [6] where I con-
make the code more robust to varying backgrounds, such vert the border points to polar coordinates and use the Mat-
as wood table tops or off-axis image capture. These meth- lab function imregionalmax to find peaks in the data
ods were considered, but were not completed since the puz- (other matlab functions were tried, but did not return as fa-
zle assembly was considered more important to the overall vorable of results). See Figure 2 for a visualization of what
program completion. is returned for an individual piece. Unfortunately, since the
The primary result of the segmentation is a binary mask border is somewhat noisy, there are many more peaks than
that we use to obtain the location and extent of the individ- actual corners or heads. Thus I wrote a script that steps
ual pieces using the bwconncomp and bwregionprops through the peaks, consolidates the multiples to the true
Matlab functions. There are provisions in the code for peak in the θ region, and tries to discriminate whether the
cleaning up somewhat imperfect masks that may have re- peak is a corner or a head and, if a head, eliminates the peak.
sulted from additional specks of dirt blocking the green The script then goes in and uses the remaining peaks to de-
background or when the green background does not fill the termine the corners, which are two sets of peaks separated
entire image and there is a segment along an edge where the by 180 degrees that alternate.
true background (i.e. the floor or table) peeks through. Once the four corners of the piece are found, the sides of
the piece are simply taken to be the points along the bound-
3.2. Puzzle Piece Characterization
ary between each set of two corners. The next most signifi-
Once the pieces have been segmented from the back- cant task is to assign each side of each piece as a head (+1),
ground and a binary mask created, the individual pieces are a hole (-1), or flat (0), see Figure 3 for a visual. This can
broken down further and many traits are extracted that will be done in many ways. The original method I used would
be used in later matching. take the points along the edge in XY-coordinates, re-align
The first step is to define the corners. As per the as- them such that the two corners each lie on the x-axis, and
sumptions about the puzzle pieces used, each piece has four, then take the integral. If there was significant area either
well-defined corners with an angle approximating 90 de- above or below the axis created by the two corner points,
grees. Several approaches for extracting either the corners then the side could be declared either a head or hole, other-
or the sides were attempted before ultimately landing on wise it was considered flat. However, there were issues with
the method used in the puzzle solver. One of the meth- this method that were brought to light when experimenting
ods involved using the Hough transform to find the pre- with one of the puzzles. Specifically, there was difficulty
dominant direction of the points along the border while an- setting a specific pixel integral threshold for all of the sides
other found in [5] involved the conversion of the borders when the piece was more rectangular than square. One so-
3
method similar to [5], I identify small patches of pixels (I
found three, 2x2 patches to work well) distributed evenly
along the edge, grab the average L*a*b* color information
in each of the patches, and store it for the matching process
(see figure 3 for an example). Of course, I only do this
along the edges with holes and heads – this is unnecessary
along the flat edges. As with the piece information before,
I also gather the HSV values for each of the patches, but
no longer use them in the matching process after receiving
mixed results.
3.3. Puzzle Assembly

When discussing the puzzle assembly, I have decided to
break the process down first and foremost into “global” ver-
sus “local” assembly, and then break the “global” assem-
bly down further into two distinct areas of the puzzle: the
border pieces and the inner pieces. This is a method that
is discussed and accomplished in most of the papers I re-
viewed, but specifically discussed in [4] and [3] and seems
like a reasonable approach to the problem. It also seems
Figure 3. Example Piece with Labeled Sides and Example Color
like a logical approach given how one might go about solv-
Patches
ing a jigsaw puzzle in the physical world. As such, I will be
breaking this section into subsections along those lines, be-
lution to this problem was to divide the resulting integral by ginning with the discussion on how two pieces are matched.
the length of the side, but even that did not completely solve
the issue. So instead I developed a new method that uses the 3.3.1 Local Assembly – Piece-to-Piece Matching
“height” of the midpoint of the side relative to the x-axis as
defined by the rotated side. If the height is beyond a certain I am beginning the discussion on how the puzzle is as-
threshold, it could be considered a head or hole. Otherwise sembled by describing the method by which the pieces are
it is considered a flat side. matched to one another. While the algorithm works slightly
Once the sides of the piece are determined and the infor- differently whether we’re assembling the border or we’re
mation stored, the next step is to gather color information assembling the inner pieces, the specific match parameters
about the piece that will be used in the matching process. remain the same. I will describe the general case of two
By using the mask created by the individual piece, I can pieces being matched together and then discuss the specific
use Matlab functions to back out intensity levels for indi- nuances for each of the two global situations.
vidual color channels for the entire piece. At first I grabbed What this algorithm is ultimately trying to find is a
a lot of information, and much of it is still in the code. This “match” between the edge of one piece and the edge of an-
includes average, minimum, and maximum intensity levels other piece. A “match” is defined as two pieces that “fit”
for each of the RGB color channels as well as each of the together. Ideally, the two pieces that “fit” the best would
HSV color channels. After rather exhaustive experimen- also be the “correct match,” so one would think that if the
tation with the matching algorithm, however, most of this head of one piece is similar to the hole of another piece,
information proved to be hit or miss when it came to true they would fit and we could move on. However, when try-
color coordination across pieces. Since this is information ing to do this process with little to no human input, finding
about the entire piece, it is primarily used to find regional the correct match is not as easy as finding two shapes that
likeness in the final puzzle image. After doing some re- are simply the inverse of one another. Especially when one
search, one method for finding how similar two colors are throws in measurement noise due to imperfect segmentation
in the spectrum is by calculating the ∆E, which requires and image distortion. So instead, I try to use additional in-
the color values to be in CIE’s L*a*b* color space. So, ul- formation about each piece in order to try and compute a
timately, the color values for each piece are translated to “score” for a potential match.
this color space and the average “intensity” in each of those After extensive experimentation, the factors that two
channels is found and stored. neighboring pieces appear to have most in common are side
Not only do I store color information about the entire length, color along the edge, and overall piece color. Ad-
piece, but also color information along each edge. Using a ditionally, I looked at the difference between the overlaid
4
curves along each edge to determine an average overlap (or least two of the categories, so a score threshold of about
gap) – obviously the smaller the overlap or gap, the bet- 280 (again, experimentally determined) can be set in order
ter the match. With these factors in mind, here is how the to weed out incorrect matches.
matching algorithm works. Once the local matching algorithm has pared down to a
Once the algorithm determines the two pieces and the final set of scored matches, it then returns these matches to
side of each piece to be matched, it first checks to make the global algorithm in score-priority order. The primary
sure one is a head and one is a hole. If not, the match is nuances that differ between local matching for the border
discarded. It then compares the side lengths and the in- versus the inner pieces is the orientation of the piece. As I
tegrals along each edge. If the difference in side lengths will soon discuss, the border pieces are aligned according
is significant (beyond approximately 10 pixels for the stan- to the flat edge, so the local matching only considers one
dard resolution image), the match is discarded. The integral potential edge for each piece. And, because the border is
of the curve along each edge is compared. If they are not being matched in the absence of the rest of the puzzle, the
approximately equal and opposite, the match is discarded local matching algorithm does not have to consider any ad-
(the threshold for this varies based on image resolution – ditional sides from other pieces that may come into play.
turns out it is not the best method for weeding out candidate Not so for the inner pieces. For the inner portion of the puz-
matches so it is not the most strict). Next, the overlap is cal- zle, a piece is being matched to a slot in a grid, and that
culated by overlaying the two edges in XY-coordinate space slot has neighboring sides. As will soon be discussed, the
and taking the difference (using pdist). The result is the local matching algorithm will always have at least two sides
overlap, or gap, between the two pieces. If the average, for each internal piece, but could have upwards of three or
minimum, or maximum overlap is beyond specific thresh- four to consider depending on the location of the slot and
olds based on image resolution, then the match is thrown what pieces have been matched so far. In the inner case, the
out. Once we’ve looked at the basic shape discriminants for local matching algorithm has to ensure that the heads and
determining whether a match is likely correct, we then look holes all line up first, as before, but then determines all of
at the color discriminants and determine the ∆E. We do this the matching metrics per side and takes the average over the
both from a regional perspective (piece to piece) and from number of sides. The major difference here is that a single
an edge perspective (using the patches along the edge). ∆E piece could fit into a slot in multiple ways, so piece orienta-
is essentially the “distance” between two colors in the color tion must be accounted for and the piece must be rotated in
spectrum and is determined using the following formula: all valid configurations before the algorithm returns a set of
matches. A single piece could potentially have four possi-
∆L∗ = L∗1Avg − L∗2Avg (1)
ble “matches” to a single slot, depending on hole/head ori-
∆a∗ = a∗1Avg − a∗2Avg (2) entation. The match thresholding and scoring are the same
∗
∆b = −b∗1Avg b∗2Avg (3) across pieces and edges in the inner piece matching as in the
p border matching, the only true difference being that they
∆E = ∆L∗2 + ∆a∗2 + ∆b∗2 (4) are averaged across each piece/edge to which the piece is
matched in the inner matching (which is unnecessary in the
The ∆E is found for each pair of patches along the edge
border case).
and the average of those patch values is used. Obviously,
the lower the ∆E, the closer the colors are in the spectrum.
Once we’ve computed and captured these shape and 3.3.2 Global Assembly – the Border Pieces
color comparisons and have weeded out obviously bad
matches, we then compute a match “score.” This score It is logical to begin the global assembly with the border
is found using experimentally determined weights that are because the border pieces are distinct – they each have at
multiplied by the four key matching criteria: side length least one flat side. Since the piece matching algorithm is not
difference, overlap difference, ∆E along the edge, and ∆E perfect and does not always return the “correct” match as
between the pieces. Since these values tend to vary widely the “best” match, a so-called “greedy” algorithm that sim-
between matches and puzzles, the standard range of values ply places the “best” match between two pieces in the next
found for “correct” matches on one of the test puzzles was available slot will not necessarily result in a coherent solu-
used to develop a set of weights that somewhat normalizes tion (i.e. one might get something that is non-rectangular
each parameter so that one specific criteria is not favored too or even nonsensical). In order to provide for this possi-
much more than another. For most correct matches, when bility while not resorting to a “brute-force” approach that
multiplied by the weights, the values should be no greater runs through every possible combination of pieces, I de-
than 100. This means that, theoretically, a correct match cided to use a so-called “branch-and-bound” algorithm, nor-
could have a total score of up to approximately 400. How- mally used in the solution of the Traveling Salesman Prob-
ever, in reality, most correct matches have low scores in at lem (TSP). In the general description of the TSP, a salesman
5
needs to do business in a number of cities spread out over a can be considered “children” or “branches.”
region with defined distances between each. The salesman
wants to find the shortest overall route that goes to every 3. Since the local matching algorithm returns a rank-
city only once and returns to his starting point – thus its a ordered list of potential matches, start with the first
distance minimization problem. potential match as piece B.
Much like the TSP, each match made between two pieces
along the puzzle border is given a score. Once a solution 4. Next, remove piece B from the list of pieces remaining
is found, the total of all of the match scores that make up to be matched and make piece B the new piece A.
that border solution should reflect how good the solution
is. Ideally, the smallest overall score will be the best and 5. Use the new piece A to find more potential matches.
correct solution. However, that turns out not always being
the case, as will be discussed. 6. This process continues until one of the following oc-
Since there are many ways of reaching a solution, we curs:
need a way of capturing a large number of possible solu-
tions and then finding the best one of those potential so- (a) We run out of pieces remaining – in this case we
lutions. One way of doing this is the branch-and-bound have found a solution or “leaf.” We store this set
method. This algorithm will be discussed shortly, but first I of matches and their scores as a solution, back
will describe the greater methodology for how the border is up, and see if we can find more solutions.
constructed.
The general construction of the border begins with a (b) We run into a border constraint that isn’t satisfied
corner piece. We orient the piece such that the counter- – in this case we back up and see if another match
clockwise-most flat side is “down,” with the other flat side does satisfy the border constraint before moving
to the “left” and a head or hole to the “right.” Matching is on.
then done to the “right” in a sequential manner. The local (c) We do not get any potential matches for the cur-
matching algorithm, then, receives a left piece and a list of rent piece A – in this case we need to back up and
possible right pieces, all with the flat side down. It returns a see if we can find another path using a different
list of potential right pieces. We then choose one of the right piece from an older set of potential matches.
pieces to be the new left piece and continue the process until
we run out of possible right pieces. As we progress around
The algorithm either runs until it has exhausted the search
the border, when we hit another corner, we rotate the en-
space and found all possible solutions based on both the side
tire puzzle and continue as if the flat side of each piece is
constraints and the local matching thresholds, or until it has
“down.”
obtained the number of solutions requested of it. As can
Now, since the potential number of solutions is (n − 1)!
probably be surmised from the basic description above, one
where n is the number of border pieces, we want to con-
can visualize this approach as a tree with the first piece at
strain the number of solutions found through whatever
the root and branches extending upward for each potential
means necessary. The local matching algorithm does a good
match. If we are able to make it all the way up a branch to
job of weeding out very poor matches, but will still return
a leaf, then we have found a solution to the problem. If we
multiple potential matches that could lead to a nonsensical
get stuck on a branch and can’t expand, we come back down
full solution. In order to combat this we also place a set of
the tree until we find another path that looks fruitful. In this
side length constraints on the puzzle such that the border
way we can reduce the total number of solutions tried to
solution must have side lengths corresponding to a rectan-
well below that which would be found through simple brute
gular puzzle. If we’ve gone too long without a corner piece
force.
or if we get sides defined by corner pieces that do not equal
As one can see, because the problem is being solved in
one another, the solution is thrown out and we search for a
a nonlinear fashion, the number of solutions that might be
new one.
found before the “correct” solution is highly dependent on
After all of this preamble, I am now going to describe
several factors, not least of which are the first piece cho-
the basic branch-and-bound algorithm that is used for the
sen and how well the first few pieces match. If incorrect
border matching problem. The algorithm can be described
matches are made early in the process, it can take a long
in the following manner:
time (and a lot of matches) before the correct solution is
1. Choose a starting piece (usually a corner) and call it found. And even when the correct solution is found, it may
piece A. not be the “best” solution as per the scoring system. Ideally
the “best” and “correct” would be the same, but that is not
2. Use piece A to find a set of potential matches – these always the case.
6
3.3.3 Global Assembly – the Inner Pieces Puzzle Total Border Border Inner
Name Pieces Pieces Soln Soln
Once a border solution is found, it is passed to the global as-
Wookie 12 10 1st 1st
sembly for the inner pieces. The global assembly algorithm
Storm Troops 24 16 58th 5th
assembles the border pieces into a grid and then grabs the
Droids 12 10 1st 1st
“upper left” open slot as the new “piece A.” This algorithm
Speeder 16 12 1st 1st
also creates a grid with relative orientations for each piece.
Rey Finn 12 10 1st 1st
When the pieces were first characterized, they were each
Kylo Ren 24 16 N/A N/A
oriented with “side 1” being “up.” Once they are placed
in the final puzzle grid, we create a second grid with the Table 1. Test Puzzle Results
same dimensions that provides the relative 90 degree rota-
tion from “up” for each piece (0-3). Because we know we
have the border completed, we can be assured that that first ing Toolbox, it was not completed in a satisfactory man-
slot will have at least two pieces along its edges. The more ner by the end of this project. Instead, I use the func-
pieces along an edge, the more accurate and discriminating tions vision.AlphaBlender and step along with the
the score should be. The inner piece matching algorithm piece masks and the cropped segmented pieces from the
then uses a similar branch-and-bound algorithm as in the original image to create a quasi-final image “grid” that
border case to find potential matches for this first slot. It shows the extracted pieces oriented per the solution. It’s
then removes the piece from those remaining and moves not ideal, but it at least shows how the final pieces should
across the puzzle filling in all available pieces from left to be laid out and arranged. For reference, see figure 4 for an
right and then top down (like reading a book). Unlike in the example of the final solution.
border case, however, the potential matches could involve
the same piece, just oriented differently. For this algorithm, 4. Experimentation and Results
orientation is very important. As with the border assembly
algorithm, once all of the pieces are found, that solution is I ran my puzzle solver in Matlab on both a home PC with
stored and we then back up and see if we can find more until 4 year old hardware (6 GB RAM, Intel i5 processor, AMD
either we run out of solutions or we have found the number Graphics Card) and a 13-inch MacBook Air, 2015 model
of solutions desired. Ideally, the solution with the lowest to- with 4GB memory and Intel Graphics with little difficulty.
tal score (aka the “best” solution) will also be the “correct” It takes about a minute or so to run one of the test puzzles
solution. (it might take more than a minute for the 24 piece puzzles)
One last note about the global assembly algorithms: from image segmentation through to final construction. I
while it might make sense to have the two algorithms sepa- carried a good amount of information in memory through-
rate during developing and while trying to understand where out the process since I was doing a lot of experimentation
each breaks down, the ideal case would be to combine these and wanted the ability plug and play various modules for
algorithms in order to weed out border solutions that do not both fine-tuning and debugging. This could be pared down
provide for full puzzle solutions. This was thought about, for a future implementation.
though not implemented in the final code. Had there been While the puzzle solver created for this project cannot be
more time, this would have helped to bring down the to- used on every puzzle (per the limitations noted earlier), for
tal number of end puzzle solutions. As it was, in the time those it could be used on, I was able to experiment to find
allowed, I was simply able to get both of these algorithms limitations and weaknesses. I also used this experimenta-
working well enough to tweak the various variables to see tion to find the best criteria for matching.
how best to find matches. The next step would be to link For this project I tested the puzzle solver on six Star
these two algorithms and throw out border solutions that Wars themed children’s puzzles that I bought at Target. The
have no potential solutions based on all of the criteria dis- overview of the results using the final matching parameters
cussed above. are in Table 1.
As one can see, most of the puzzles had 12 to 16 pieces,
3.4. Final Image Construction except for two that had 24 pieces. Of all the puzzles that
Once we have assembled all of the pieces into a grid had less than 24 pieces (four of the puzzles), the correct
with their relative orientations, we now have the solution border solution was the best border solution returned, by
to the puzzle. The next step is displaying that solution score (hence the “1st” in the third column). Then, using the
to a user. Ideally, I would like to display a completed correct solution as the lead-in to the inner puzzle algorithm,
and fully stitched together puzzle image using the puzzle those same puzzles found the correct solution to be the one
pieces as segmented from the original image. While this with the best score.
is most likely possible using the Matlab Image Process- For the Storm Trooper puzzle, there were two primary
7
factors that led to it not doing as well. First, there are more While it would logically seem, and in most cases it would
pieces and therefore more potential matches. Still, if the actually be, that matching the color patches across the
matches were registering scores that reflected the true “cor- boundaries should be one of the best ways to discriminate in
rectness” of the match, then one would expect the overall order to find a true match, due to the variability in how the
score of the completed border to be better than 58th. And, puzzle pieces are carved up, this was not always the case.
even when we fed the correct border solution to the in- For instance one piece was carved almost perfectly along
ner puzzle algorithm the correct solution was 5th best by Wookie’s nose, which is dark, and just on the other side of
score. However, to put this into perspective, the global bor- the edge there was a bright background. The ∆E in this
der solution algorithm returned in excess of 2000 potential case was fairly large even though the match is a correct one.
border solutions for the storm trooper puzzle (of a math- In fact, this was also a case where the puzzle pieces had a
ematically potential 15!, approximately 1.3 trillion, solu- small ∆E between them, but the patch difference was much
tions with brute force), of which the correct one was 58th higher. This is not an expected result. And while one might
by score. Which isn’t all that bad. Additionally, the storm begin to think that maybe color is too volatile and should
trooper puzzle was by far and away the most homogeneous not be considered at all since the perceived variability in the
in terms of color of all the puzzles. It was very difficult shape is much smaller among true matches per my above
to discriminate matches based on color using the methods list, that is not entirely accurate. Due to noise and distor-
I described before, especially because of the way the storm tion, the actual length of the sides and the measured overlap
trooper line discontinuities happen to match up along the is not exact. And while the correct match is always small,
border of the pieces, making cross border color matching so are many other matches. These criteria are best for weed-
very difficult indeed. And finally, the pieces were fairly ing out those pieces whose shape isn’t even close to correct.
square, so all four sides were very even and comparable in It can also help when the head of one piece is bent in one
length. If they were instead more elongated with one pair direction while the hole of the other piece is expecting it to
of sides longer than the other, the side length discriminant be bent in another – then the overlap will suffer. For the
would have knocked down potential matches. most part, however, the size helps get you close. Unfortu-
This was a case where the experimentally determined nately, many of the pieces have differences that fall within
match scoring algorithm broke down. While it worked in the acceptable ranges above. That is why color is then used
the other test cases very well, one can see quite clearly that to help with the ranking of those potential matches. And
other methods would need to be pursued in order to get the in most cases, the color does help. There are just a few in
storm trooper puzzle to be solved correctly. every puzzle where there are large transitions in both the
The other puzzle in the table that was looked at as a test puzzle region or just along the border that cause the match
case but does not have a rank for a solution is the Kylo scoring to return some interesting values.
Ren puzzle. This puzzle highlighted the need for a better Additional parameters that were used early on for color
corner-finding or edge-finding process. While the code I scoring were RGB and HSV channel averages. While in
developed to find the corners repeatedly on the other puz- some cases there was clear correlation, in many others there
zles worked quite well, the pieces of the Kylo Ren puzzle didn’t appear to be any correlation whatsoever. Color vari-
were extremely elongated and many of the “heads” were so ance was also considered, though it was also disregarded
small as to be mistaken for corners. Needless to say, the au- because, after some thought, I could not see how it would
tomatic corner-finding was not able to find the corners, and return a marked improvement. The extreme variability in
without them the rest of the algorithm just doesn’t work as the color scoring led to a rethinking about how the colors
is. were being compared and the eventual use of ∆E.
When experimentally determining the criteria to use for Additional methods for finding border matches that were
the piece matching, individual matches were observed with considered but were unable to be implemented before this
special attention paid to the values for correct matches. The report were:
Wookie puzzle was used as the baseline case and the val-
ues derived from this puzzle were applied to the others with • Segmentation along the border (such as meanshift) –
general success (except for the Storm Trooper puzzle). Here if we could determine there are a certain number of
were the typical values and the final weights applied: segments along one border that coincide with a cer-
tain number of segments along another, then maybe
• Side Distance Difference: 0-8 pixels (wt = 12.5) we could find a potential match.
• Overlap Average Difference: 0-14 pixels (wt = 7.0) • Find lines along a border that break at the border, then
• ∆E Patches: 7-45 (wt = 2.9) look for the continuation of these lines on the other
side. Would have something to do with the flow of
• ∆E Pieces: 4-35 (wt = 3.2) pixels – seems difficult to implement, though would
8
really help with the Storm Trooper puzzle.
• Grab features within the head piece or along the edge
and build a Bag of Words model. Then try to find
matches on the other side of the piece around the hole.
This has potential, but is potentially computationally
expensive.
While the matching algorithm used isn’t perfect, it worked
for the test cases considered. And while others may have
been able to solve larger puzzles [4][3] with their algo-
rithms, my code proved to be fairly robust and efficient at
solving the puzzles provided. Since there appear to be no
standardized “jigsaw puzzle metrics” against which to com-
pare by puzzle solver for puzzles with irregular shapes, I
cannot say exactly how my puzzle solver compares to oth-
ers that have been developed. However, it is one of the few
that I’ve seen that takes a raw image of all of the pieces
at once and produces a fully constructed solution. Most of
the puzzle solvers found in official papers and in student
submissions cited earlier produce only partial or theoretical
solutions, or solutions that require even greater initial con-
straints than my own (i.e. the pieces have to lie in a grid at
Figure 4. Solution Created by the Automatic Puzzle Solver
the outset or each piece has to be scanned individually with
a high resolution scanner). Still others rely on the original
image to find the location of the pieces in the final image,
which is not the problem I set out to solve. One example
using the Wookie puzzle can be seen in with the original
image in figure 1 and the final solution as found by my puz-
zle solver in figure 4.
5. Conclusion
I have created an end-to-end “automatic” jigsaw puzzle
solver that uses Matlab and the Matlab Image Processing
Toolbox to piece together a jigsaw puzzle using only an
image of the pieces. I used this puzzle solver on six test
puzzles and proved that it works on five of them quite reli-
ably, but also found where there were weaknesses in the cur-
rent implementation. Certain design decisions were made
early on that simplified the problem such that I could com-
plete the entire project by the deadline. Unfortunately, this
also meant it was hard to go back and try a completely new
method once I had begun going down a certain path.
I learned a lot over the course of this project. I learned
about how to think about a 3-dimensional, physical world
problem in terms of a 2-dimensional perception of that
problem. I learned how to think about manipulating ev-
ery ounce of information I could glean from a single photo- Figure 5. The “Truth” – A Picture of the Assembled Wookie Puz-
graph to help the computer “think” like a human and make zle
matches that would result in a correct solution. I learned
about many functions inherent within Matlab, especially the
Matlab Image Processing Toolbox. As I developed the pro- differently. This knowledge will certainly be helpful in the
gram, I learned new tools and tricks that, had I known them future and could be applied to improving the puzzle solver.
earlier, I may have approached certain parts of the project Several of the papers I read where people have attempted
9
this problem in the past did not make much sense to me un-
til I went and attempted it myself. I had believed it would
be easier to extract the corners of the four-sided pieces, and
therefore decided to go with canonical piece jigsaw puzzles.
However, this then meant I was fairly limited in the types
and numbers of puzzles to which my program could ap-
ply. It also meant that extraction of this information was ab-
solutely essential to everything my program did afterward.
Some of the other methods, like the use of fiducial points
as in [3], may have proven more difficult at first, but could
have paid dividends in its ability to scale.
If my original end goal was to create a program that be-
gan to explore the possibility of creating an automatic jig-
saw puzzle solver smart phone application, which was the
original idea, then I believe I have achieved a pretty great
stride in that direction. However, my code is not yet robust
enough to the kinds of inputs a smart phone might provide,
nor is it efficient enough in both memory allocation and
processor requirements to be feasible for that application.
Many changes would have to be made before I can get to
that end goal, which is something I realized about halfway
through the project. While I believe I have created a solid
and workable solution within the constraints of the problem
as I originally set forth, I see many areas where it could be
improved for future incarnations. All in all, I did what I set
out to do, I learned a lot, and I enjoyed the process.
References
[1] J. Davidson. A genetic algorithm-based solver for very large
jigsaw puzzles: Final report.
[2] H. Freeman and L. Garder. Apictorial jigsaw puzzles: The
computer solution of a problem in pattern recognition. IEEE
Transactions on Electronic Computers, EC-13(2):118–127,
April 1964.
[3] D. Goldberg, C. Malon, and M. Bern. A global approach to
automatic solution of jigsaw puzzles. Comput. Geom. Theory
Appl., 28(2-3):165–174, June 2004.
[4] A. K. Y. L. H. Wolfson, E. Schonberg. Solving jigsaw puzzles
by computer. Annals of Operations Research, 12:51–64, 1988.
[5] D. A. Kosiba, P. M. Devaux, S. Balasubramanian, T. L.
Gandhi, and K. Kasturi. An automatic jigsaw puzzle solver. In
Pattern Recognition, 1994. Vol. 1 - Conference A: Computer
Vision amp; Image Processing., Proceedings of the 12th IAPR
International Conference on, volume 1, pages 616–618 vol.1,
Oct 1994.
[6] N. Kumbla. An automatic jigsaw puzzle solver.
[7] L. Liang and Z. Liu. A jigsaw puzzle solving guide on mobile
devices.
[8] A. Mahdi. Solving jigsaw puzzles using computer vision.
10
Database-Backed Scene Completion
Alex Alifimoff
aja2015@cs.stanford.edu
Author’s note: I liberally use the third person pronoun, ”we”, in this paper, as I’m used to
working in group projects. Rest assured, I am the only author of this project.
Introduction
Ever have an almost-perfect photo? Maybe it’s that photo of the beach that your dweeb
uncle stepped in front of, or that wedding photo ruined by the donut truck driving in the
background. Scene completion is the task of taking a photo and replacing a particular
region of that photo with an aesthetically sensible alternative. In this work we demonstrate
our implementation of a method originally implemented by Hayes & Efros which produces
interesting scene completions utilizing a very large database of images.
Previous Work
There are many different approaches to the problem of scene completion. One possible
approach is to use multiple images (either from multiple cameras, multiple pictures, or
video) to determine exactly what kind of information was in the masked part of the image,
and then adjust that information appropriately and place it back into the original image.
[2, 3]
A second common approach is to utilize information from the image itself to attempt to
guess what kinds of missing information should be used to fill the masked part of the image.
[3] The majority of these approaches involve utilizing nearby textures and other patterns
from the input image to fill the scene.
This project follows the methodology outlined by Hayes and Efros [1], who differ from previ-
ous approaches in that they try to complete the scene by finding plausible matching textures
from other photographs. In particular, the implemented system searches thousands to (ide-
ally) millions of photographs to find globally matching scenes, and then utilizes texture
patterns from those scenes to fill the missing hole in the input image.
Key Improvements
The significant improvement of the Hayes & Efros system over previous image completion
software is the ability to produce ”novel” scene completions through the use of the large
image database that is searched to find global scene matches.
Additionally, the Hayes & Efros system does not place wholly stringent restrictions on which
pixels must be used from the source image and which pixels must be completed. All of the
masked pixels must be replaced, but through the use of a novel application of min-cost
graph cutting, the system may decide to replace more pixels in the original image if it
makes for a better fit. This allows interesting completions that simply aren’t possible with
more stringent restrictions. We discuss this in depth in the following sections.
1
Figure 1: An example input image with corresponding mask (black region)
Technical Approach
The input to the algorithm is an image and a corresponding mask, which indicates which
part of the image needs to be filled.
Generally, we then follow three steps:
1. Semantic Scene Matching. Quickly identify possible images that we could use to fill
the hole in our scene.
2. Local Context Matching. Identify the local context and search all of the remaining
scene matches to find the best local matches.
3. Blending. Perform graph cut and poisson blending to merge the two images.
Semantic Scene Matching
The first part of the method involves finding images which represent similar scenes to the
image being filled. However, since there are potentially millions of images to search, any
comparison must be done extremely quickly. To implement this part of the algorithm,
we rely upon a scene descriptor called GIST. GIST descriptors build a low-dimensional
representation of the scene which is designed to capture a couple of perceptual dimensions.
The authors of the original GIST paper describe these dimensions as ”naturalness, openness,
roughness, expansion, ruggedness” [4, 5].
GIST descriptors are pre-generated for every image in the database. Once a masked image
is provided, the GIST descriptor for the masked image is calculated. I use GIST descriptors
with 5 oriented edge-responses at 4 scales aggregated to a 4x4 spatial resolution. These
descriptors are slightly smaller than the original ones in Hayes and Efros, but the slight
reduction did not impact performance while slightly improving the time it took to build
the GIST descriptors for the database. We augment each GIST descriptor with a color
histogram with 512 dimensions to capture color information.
The search is then performed by using the weighted combined GIST and color descriptor
and an l2-distance metric. Each generated descriptor is compared to the masked input
image, and the best 100 images are kept for local context matching.
2
Figure 2: Best GIST matches for leftmost image
Figure 3: A local context mask and an example of a possible local context
Computational Limitations
One of the main difficulties in pursuing this project was the computational resources nec-
essary to implement it in the same manner as the original authors. The original paper
utilized a network of 15 computers to examine millions of images simultaneously. Since
this computing power wasn’t available to me, I downloaded a pre-filtered subset of closely
matching images from Hayes and Efros’ project site to augment the thousands of images
I downloaded independently. This allowed me to get high quality matches to ensure the
rest of the scene completion pipeline worked appropriately. The graphics in this report
were generated from my own database of 200,000 images and the additional images from
the original project site. My database primarily consisted of photos downloaded from the
image sharing website, Flickr, that were tagged ”outdoors”. I restricted the category for the
purpose of getting quality completions for images within the same category.
It took about 12 hours to generate all of the GIST descriptors for the small dataset with
liberal use of multiprocessing. However, this computational only needs to be performed
once. Performing a single nearest-neighbor search of the dataset takes approximately five
minutes.
Local Context Matching
The next step is to find appropriate patches in semantically similar images to use as the
scene completion content. The first step of this process is to determine exactly what the
local context of a particular image is. To do this, I first dilate the mask. This effectively
produces a second, slightly larger mask. Then the mask is cropped so that it is only the
width and height of the dilated mask. I then subtract away the original mask and are
left with an image patch which corresponds to the local context of a particular image. The
remainder corresponds to the local context that we will be examining in each image. We will
use this patch to find the ”optimal” patch to use in our scene completion for each matching
image.
We illustrate this process. On the left we show an example dilated mask, where the red
corresponds to the area of local context we will be consider. Then, as we move it across the
image, we consider patches like the one on the right.
3
We take each patch and compute a HOG descriptor and a color histogram. We utilize these
as texture and color features and compare them to the local context of our source image.
We use sum-of-squared distances of this feature set to select which patch to use from a
particular image.
Blending
There are two main parts to the blending step. Given a mask and a dilated mask, we have
to use all of the pixels from the patch image for the area of the image covered by the mask.
However, for the dilated mask, we have a decision to make. One particular innovation of
Hayes and Efros’ method is actually choosing to remove more of the original image than the
mask requires.
To determine which part of the original image to keep and which part to patch, we use a
min-cost graph-cut algorithm. We assign each pixel a label, ”original” or ”patch”. We call
the set of all labels L, and we minimize:
∑ ∑
C(L) = Cunary (p, L(p)) + Cpair (p, q, L(p), L(q)) (1)
p p,q
We define the unary cost functions as follows. For any pixel in the space removed
by the original mask, Cunary (p, original) >> 0 (any very large number) and we set
Cunary (p, patch) = 0. For any pixel that is not covered by the mask or the dilated mask,
Cunary (p, patch) >> 0 and Cunary (p, original) = 0. The intuition here is that for the for-
mer category, we must choose pixels from the patch. In the latter category, we must choose
pixels from the original picture. For all of the rest of the pixels, we define
Cunary (p, patch) = (k ∗ ||f (p) − f ′ (mask)||)3 (2)
where f is a function returning the location of a pixel and f ′ is a function returning the
nearest pixel in the original mask. Intuitively, we want to punish choosing pixels that are
not in the original the further away we get from the hole. Like Hayes and Efros, we use
k = 0.002.
The remaining part of the graph cut algorithm is determining how to define Cpair . In our
implementation, each pixel is connected in a four-way neighbor set-up, so for pixels that are
not adjacent, we have zero cost. For other pixels, we use:
Cpair (p, q, L(p), L(q)) = ||h(ppatch ) − h(poriginal )|| + ||h(qpatch ) − h(qoriginal )|| (3)
where h is a function returning the vectorized (RGB) representation of a pixel. The intution
here is that we want to minimize the gradient of the image difference as opposed to the
intensity difference along the seam.
We include a figure demonstrating the change from before the graph-cut is applied and
afterwards. Generally this causes a small expansion in the size of the mask.
Finally, poisson blending is applied to the image and its patch to seamlessly blend the two
images. This ensures that slight differences in color do not ruin the completion attempt.
The poisson solver is allowed to run on the entire domain of the image and not just the
local region it is attempting to patch.
Results
Here are a number of possible good completions from various input masks and input images.
Generally, when I used the pre-seeded database of gist matches compiled from the Hayes &
Efros site, I got reasonable performance. Additionally, when I used images that were ”out-
doorsy” (this was the image category I primarily downloaded from Flickr), I got reasonable
4
Figure 4: The patch before and after applying graph-cut
Table 1: Runtime comparison to original implementation
Phase Avg. Runtime StdDev Runtime Hayes & Efros

Semantic Scene Matching 5.2 minutes 0.2 minutes 50 minutes 2
Local Context Matching 22.3 minutes 1.2 minutes 20 minutes
Composition 2 minutes 0.1 minutes 4 minutes
completions. However, when trying to complete images that didn’t have particularly good
GIST descriptor matches in the dataset, the matches could be comically bad.
One of the main take-aways from this project is that this method highly relies upon having
a large dataset available to search for completions. Hayes and Efros required a significant
amount of computation power to search their dataset of 2 million images, and even they
largely restricted the semantic categories in which they downloaded images. Utilizing this
method as a production system for image completion would only be reasonable for companies
that have significant computation power and access to many images, like a search engine
provider or photo-sharing website.
Runtime is another issue of concern with this particular algorithm. As discussed previously,
Hayes and Efros required fifteen CPUs to process a single image in five minutes. On a single
CPU, their algorithm took 74 minutes to run. The average runtime for my implementation
across a sample set of 100 photographs was as follows. For this particular experiment, we
chose to use the 200 best matching scenes for local context matching. 1 My implementation
was comparable to Hayes and Efros, despite being implemented in Python.
Quantitatively evaluating the performance of the algorithm in regards to how effectively
it completes images is difficult, as there is no good metric for evaluating the ”realness” of
photographs without doing human evaluation. This evaluation is done by Hayes and Efros,
but sadly I did not have the time or the access to resources to adequately conduct human
trials.
Areas of Improvement
There are numerous situations in which this system fails. We generally classify these errors
into a couple of different categories.
1. The first category are failures of scene matching. These are situations in which the
GIST descriptor identifies scenes that just don’t belong together (i.e. filling a mask
in a tropical scene using an image from the snow)
2. The second category are blending issues. These errors occur typically when a sub-
optimal image patch is chosen that contains superfluous artifacts, or when the
graph-cut algorithm chooses to include something it should not.
1
This was the same number as Hayes and Efros for the purposes of comparison, although for
smaller databases we suspect this number should be reduced as many of the matches beyond the
20th were quite bad
5
Figure 5: Mask and possible completions for grassy/forest scene
Figure 6: Mask and possible completions for ocean scene
6
Figure 7: An example of a high-level semantic issue. Notice the partial rabbit in the grass.
Figure 8: An example of a blending issue
3. The final category of errors includes issues with high level semantics. These are
situations in which partial objects are included in the patch, such a part of a rabbit
filling in a grassy scene because there is otherwise a good local context match. Since
the algorithm has no notion of objects, this happens quite often.
We include some illustrative examples of each error category.
Final Thoughts
In general, I was quite happy with the output of the system. There was generally at least
one reasonable completion for the vast majority of images that I would input, provided I
was using input images that fell into the same semantic category as input images in my
database. Largely, this algorithm is effective with lots of data, but not generally effective
for solving the scene completion problem on a resource-limited budget.
Acknowledgments
I would like to thank Silvio Savarese for an awesome class and to the entire teaching staff
for making a really strong effort to improve the class, even throughout the quarter.
References
[1] Hayes, J. and Efros, A. Scene Completion Using Millions of Photographs. SIGGRAPH, 2007.
http://graphics.cs.cmu.edu/projects/scene-completion/scene-completion.pdf
7
Figure 9: An example of a scene matching issue. The forest does not belong in the city!
[2] Irani, M., Anandan, P., and Hsu, S. 1995. Mosaic based representations of video sequences and
their applications.
[3] Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B., Salesin, D.,
and Cohen, M. 2004. Interactive digital photomontage. ACM Trans. Graph. 23, 3, 294–302.
[4] Oliva, A., and Torralba, A. 2006. Building the gist of a scene: The role of global image features
in recognition. In Visual Perception, Progress in Brain Research, vol. 155.
[5] Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene: a holistic repre-
sentation of the spatial envelope. In International Journal of Computer Vision, vol 42 (3).
https://people.csail.mit.edu/torralba/code/spatialenvelope/
8
Deep Drone: Object Detection and Tracking for
Smart Drones on Embedded System
Song Han, William Shen, Zuozhen Liu

Stanford University
Abstract system pipeline and its performance along with accuracy on

multiple hardware platforms, and we analyzed the trade-off
In recent years, drones have been widely adopted for between accuracy and frame rate.
aerial photography at much lower costs. However, cap-
turing high quality pictures or videos using most advanced 2. Related work
drones requires precise manual control and are very error-
Deep neural networks, object detection and object track-
prone. We are proposing Deep Drone, an embedded system
ing are the three major components in our work. We first
framework, to power drones with vision: letting the drone
present an overview of past work, and then describe our im-
to do automatic detection and tracking. In this project, we
provements.
implemented the vision component which is an integration
Deep neural network is the state-of-the-art technique in
of advanced detection and tracking algorithms. We im-
computer vision tasks, including image classification, de-
plemented our system onto multiple hardware platforms,
tection, and segmentation. AlexNet [11] is the classic net-
including both desktop GPU (NVIDIA GTX980) and em-
work proposed in 2012 that has 8 layers and 60 million
bedded GPU (NVIDIA Tegra K1 and NVIDIA Tegra X1)
connections, and won the Imagenet contest in 2012, spawn-
and evaluated frame rate, power consumption and accu-
ing a lot of improvements later. After that, VGGNet [16]
racy on several videos captured by the drone. Our sys-
was proposed, which has 16 layers and 130 million parame-
tem achieved real time performance at 71 frames per sec-
ters. Both AlexNet and VGGNet have bulky fully connected
ond(fps) for tracking and 1.6 fps for detection on NVIDIA
layers which results in huge model sizes. GoogleNet [17]
TX1. The video demo of our detection and tracking algo-
is a more compact CNN that consists mostly of conv lay-
rithm has been uploaded to Youtube: https://youtu.
ers, it uses the inception model that has 1x1, 3x3 and
be/UTx2-5a488s.
5x5 convolution kernels with different scale. It also has
multiple loss layers to prevent vanishing gradient problem.
1. Introduction ResNet [6] was proposed in 2015 and greatly improved the
image recognition accuracy, it adds bypass layers to let the
Modern drones are be equipped with cameras and are network learn the residual rather than the absolute value.
very prospective for a variety of commercial uses such as SqueezeNet [9] was proposed recently that is aggressively
aerial photography, surveillance, etc. In order to massively optimized for model size. It has 50x less connections and
deploy drones and further reduce their costs, it’s neces- half the computation than AlexNet but has higher accuracy.
sary to power drones with smart computer vision and auto- After Deep Compression [5, 4], the model is only 470KB
pilot. In the application of aerial photography, object detec- and fits well into the last level cache.
tion and tracking are essential to capturing key objects in a Fast R-CNN [3] is a Fast Region-based Convolutional
scene. Object detection and tracking are classic problems Network method (Fast R-CNN) for object detection. In this
in computer vision. However, there are more challenges algorithm, an input image and multiple regions of interest
with drones due to top-down view angles and real-time con- (RoIs) are input into a fully convolutional network. Each
straints. Additionally, a challenging problem is the strong RoI is pooled into a fixed-size feature map and then mapped
weight and area constraint of embedded hardware that lim- to a feature vector by fully connected layers (FCs). The net-
its the drones to run computation intensive algorithms, such work has two output vectors per RoI: softmax probabilities
as deep learning, with limited hardware resource. and per-class bounding-box regression offsets. The archi-
Deep Drone is a framework that intends to tackle both tecture is trained end-to-end with a multi-task loss.
problems while running on embedded systems that can be Faster R-CNN [15] made further improvements on Fast
mounted onto drones. For this project, we present our vision R-CNN that introduce a region proposal Network (RPN)
1
that shares full-image convolutional features with the detec- a relatively shallow and small network to extract image fea-
tion network, thus enabling nearly cost-free region propos- tures and to detection. The architecture is shown in Fig 1.
als. An RPN is a fully-convolutional network that simulta- Even using our small network architecture, the detection
neously predicts object bounding boxes and scores at each frame rate is still low on TK1 mobile GPU. To compensate
position. RPNs are trained end-to-end to generate high- for the slow speed of detection, we used the cheap KCF
quality region proposals. Because of the region proposal tracker, although it’s less accurate than MDNet, to track on
network are fused and could be trained end to end, the net- the bonding box returned by the detection pipeline. Thus
work is faster than fast R-CNN. we have the accurate but slow Faster R-CNN for detection,
YoLo Detector [13] is a new approach to do object de- and have the less accurate but super fast KCF for tracking.
tection. Prior work on object detection re-purposes classi- Detection is only called when the confidence of the tracker
fiers to perform detection. Instead, YoLo use object detec- is below certain threshold, which is very infrequent. This
tion as a regression problem to spatially separated bounding architecture makes the pipeline accurate, robust and fast.
boxes and associated class probabilities. A single neural Accuracy is not our sole target in this project. We have
network predicts bounding boxes and class probabilities di- a thorough evaluation with respect to accuracy, power con-
rectly from full images in one evaluation. Since the whole sumption, speed, and area of different hardware running de-
detection pipeline is a single network, it can be optimized tection and tracking algorithm. Balancing these hardware
end-to-end directly on detection performance. The unified constraints, rather than only optimizing for MAP, is our top
architecture is extremely fast but not as accurate as Faster priority.
R-CNN.
KCF[7, 8] is the kernelized correlation filters used for 4. System Architecture
detection. It uses the characteristic that under some condi-
The software architecture of our vision system consists
tions, the resulting data and kernel matrices become circu-
of two components. The first component is a detection al-
lant. Their diagonalization by the DFT provides a general
gorithm running Convolutional Neural Network (CNN) and
blueprint for creating fast algorithms that deal with trans-
the second part is a tracking algorithm using HOG feature
lations, reducing both storage and computation by several
and KCF. These two algorithms are seamlessly integrated
orders of magnitude, obtaining state-of-the-art trackers that
to ensure smooth and real-time performance. The detec-
run at 70 frames per second on NVIDIA TK1 and is very
tion algorithm, e.g. Faster RCNN, is expensive to compute
simple to implement.
since CNN-based detection has lots of GOPs per frame and
MDNet [12] is the state-of-the-art visual tracker based
is only called to initialize a bounding box for key object in
on a CNN trained on a large set of tracking sequences, and
the scene. The tracking algorithm, e.g. KCF, is relatively
the winner tracker of The VOT2015 Challenge. The net-
inexpensive to compute and can run at a high frame rate
work is composed of shared layers and multiple branches of
to track the bounding box provided by detection algorithm.
domain-specific layers, where domains correspond to indi-
The main algorithm loop is shown in the pseudo code be-
vidual training sequences and each branch is responsible for
low, and we discuss the detection and tracking details in the
binary classification to identify the target in each domain.
next two subsections.
The network is trained with respect to each domain itera-
tively to obtain generic target representations in the shared Algorithm 1 Detection and Tracking Pipeline for Deep
layers. Online tracking is performed by evaluating the can- Drone
didate windows randomly sampled around the previous tar- boxF ound ← false
get state. while true do
However, the drawback of MDNet is that it needs to run f ← new frame
CNN to extract image features, making it very slow. Con- while boxF ound == false do
sidering the frame rate required by real time tracking, we detection(f) . Invoke detection algorithm
used the KCF algorithm for tracking, which achieves 70 if Box is detected then
frames per second on our hardware: a NVIDIA Tegra K1. boxF ound ← true
end if
3. Contribution end while
Fast and Faster R-CNN originally used VGGNet for fea- tracking(f) . Invoke tracking algorithm
ture extraction. It is accurate but slow. Drones have lim- if Tracking is lost then
ited hardware resource both in memory and in computation boxF ound ← false
power, so we need to have smaller network. In order to run end if
the CNN fast enough on embedded device, we didn’t use end while
those off-the-shelf network architectures. Instead, we used
2
4.1. Detection 4.2.1 KCF
Drones are mainly used to take pictures of human, so KCF [7][8] is a more old school tracking algorithm than
we focus on detecting people as first step. We made fur- MDNet, but it’s supposed to be faster and more succinct.
ther assumption in this project that there is only one person The algorithm uses Discrete Fourier Transform to diagonal-
of interest to track, so that the person with the highest de- ize data matrix, which is then processed to train a discrim-
tection score is our target. We have used two detectors to inative classifier through linear regression and kernel trick;
do people detection: Faster RCNN and Yolo detector. We This new approach is called Kernelized Correlation Filter
analyze them both in below sections. (KCF).
We found out that KCF runs very fast on video, it takes
4.1.1 Using Faster R-CNN on average around 8.8 milliseconds to check per frame on a
Macbook pro CPU.
We used a 7-layer convolutional neural network on Faster The downside of KCF is the requirement that the video
R-CNN[15] for people detection. The framework takes raw has to be continuous. If the video fades to black, the ob-
image frames from a video stream and outputs bonding ject moves very fast or a jump cut occurs, KCF will have
boxes and target classes for detected objects. a hard time to recover. When this occur, the peak value of
We used a in-house model trained on the KITTI[2] detection score from running Gaussian kernel on correla-
dataset. KITTI contains a rich amount of training samples tion filter will suffer a significant drop; it will also return
that include objects such as cars, pedestrians and cyclists negative bounding box value if it can’t find any match. We
and can easily generalize to our task. In this project, our in- leveraged this feature to combine KCF with faster RCNN
terested detection target is people so we modified the script or other detection algorithm to solve the problem. Namely,
to detect people only. A detailed architecture is shown in when KCF is not confident or fail entirely, it will call a de-
Figure 1. tection algorithm in hope of recover.
We measured the accuracy(mAP) and speed of our in-
house model and compare it with the baseline, shown in Table 2. Speed of detection and tracking on different hardware
platforms
table1. Our in house model has slightly worse accuracy than
Hardware Platform GTX 980 TX1 TK1
the baseline, but has 12x faster speed.
Power 150W 10W 7W
Table 1. Accuracy and runtime for our detection network (runtime Detection 0.17s 0.6s 1.6s
is measured on GTX980 GPU) Tracking 5.5ms 14ms 14ms
Model mAP Runtime Tracking
Baseline[15] 65.9 % 2s 182fps 71fps 71fps
Frames/Sec
Ours 62.0 % 0.17s
4.3. Hardware Platform

Drones are special hardware platform with limited space
4.1.2 Using Yolo Detector and weight, so computation power is not free lunch. we
can’t afford to use desktop GPUs to do the detection com-
We also tried Yolo detector[14], because it’s easier to computation, although it’s much faster. To compare hardware
pile to mobile. It’s also a faster alternative, at the cost of platforms, we measured the computation time for detection
worse accuracy. In our experiments, we found Yolo de- and tracking in Figure 1, and we show the power consump-
tector unable to detect the cases where the people is small tion of these hardware platforms in the same table. GTX980
and remote, as shown in Figure 4.1.2, so we didn’t use this is roughly 10x faster than the TK1, but consumes 20x more
method. power. TX1 takes roughly the same power consumption,
but is 3x faster than TK1. So, in conclusion, TX1 would be
4.2. Tracking
the ideal hardware platform to use for drone’s detection.
For tracking we chose KCF algorithm over other state- Looks like TX1 is ideal with respect to speed and power
of-the-art tracking algorithm like MDNet [12], SRDCF [1] consumption, but it’s form factor limited us to put it on the
or EBT [18] despite them having a better accuracy and per- drone. See in Figure 4.3, the TX1 development board is
formance on the VOT 2015 challenge [10]. The reason be- much larger and much heavier than the TK1 board, making
hind this decision is that we want real-time tracking for our it very hard to put on the drone. However, TK1 is smaller
drone so that we could give consistent control command to and DJI provides the Manifold box to hold the TK1, which
the drone while MDNet, SRDCF and EBT all have perfor- makes it very easy to use, and we managed to put our algo-
mance under 5 frames per second. rithms on the drone.
3
Figure 1. The CNN architecture that we used for detection
In order to deal with the large form factor of the TX1 number of customized region pooling layers. We spent
development board, we bought a small carrier board for great effort in installing CUDA and Faster RCNN onto all of
TX1. It is shown in Fig 4.3. This carrier board has the our desktop and embedded platforms. During installation,
size as small as the heat sink, we can unplug the TX1 cen- we ran into the following issues.
tral board from the full development board and only plug
the central board that contains the actual TX1 chip to this 1. We flashed our TK1 with the latest L4T (Linux for
carrier board. This makes the size even smaller than TK1. Tegra), however, TK1 doesn’t support CUDA version
However, there’s no free lunch. The interface, especially higher than CUDA 6.5, so we only installed CUDA 6.5
the power supply isn’t compatible with the DJI drone, we dev kit on TK1. However, the latest Caffe is not back-
haven’t got a chance to connect the carrier board with the ward compatible with CuDNN v3 and before, if we
drone yet, which could be future work. The TK1 is fully revert the Caffe to an ealier branch, it won’t support
working, if not optimized for speed, the TX1 carrier board the new layers required by faster r-cnn. So we turned
is not the critical path of our project. off the CuDNN switch for Caffe installation to bypass
this problem.
5. Implementation and Experiments 2. Some of Faster RCNN libraries were written in python
5.1. Detection and compiled into native c++ code using Cython.
When installing these libraries onto embedded sys-
Faster RCNN builds on top of Caffe, a deep learning tems ,i.e. TK1 and TX1, we ran into compilation er-
framework that requires multiple dependencies, and has a ror on gpu nms.cpp, a GPU implementation for non
4
Figure 2. Faster RCNN performs really well on detecting people from drones perspective. Even when the object is far away and twisted
(last figure).
Figure 3. Yolo detector performs not as well as Faster RCNN. When the target person is small, the detector fails.
maximum suppression. The root cause was identified as compiler incompatibility in the embedded systems.
5
Figure 4. KCF tracker performs very well on videos that doesn’t have jump cuts. It fails if the video fades to black. However, this is not a
problem since detection will be called on this scenario.
ifying source code in the generated cpp file and suc-

cessfully passed the compilation.
Finally, we trained a 7-layer CNN Caffe model using

GTX 980 and ported the model onto all platforms to evalu-
ate testing performance using both offline and online video
streams.
Figure 5. NVIDIA TX1 and TK1 development board we used
5.2. Tracking
For tracking, the original algorithm for KCF is imple-

mented in Matlab. However, since the hardware limitation
of TK1 (storage space and computing power), we imple-
ment the algorithm using C++. As the original model pro-
posed, the tracker interface provides an init method that
takes in a frame and the starting bounding box, then it learns
a model using linear regression from sample patches (with
help of cyclic shift) of the frame. Upon receiving a new
frame, the tracker’s update function is called, it first detects
patches of the same size; if the resulting detection score
Figure 6. In order to reduce the size of TX1, we bought a small doesn’t reach a threshold, different patches under different
carrier board. scales will be detected. We used 110% and 90% re-scaling
factor. Finally we return the one with highest score (the
new positive patches and negative patches are used here for
We eventually found a workaround by manually mod- online learning).
6
5.3. Handshake between detection and tracking bounding box result will affect how well tracking algorithm
performs. If the bounding box is not tight enough, it might
Since the detection algorithm is implemented in Python,
incorporate irrelevant subject (in the nunchaku case incor-
we uses the Python.h to realize a C++/Python binding be-
porating big chunk of shadow, Figure 2 sub-image 4), track-
tween the two algorithms. The initialization of detection
ing will then mistakenly think that it’s the irrelevant subject
algorithm (loading neural network under Caffe) is stored in
it wants to track. Second, detection doesn’t work well when
the C++ main program as a PyObject. Upon receiving a
the person is disguised as a bulkier figure. As in the snow-
new frame, C++ main program converts the frame (stored in
boarding video, when the viewpoint is on the side of the
byte array) to a Python ndarray object and passes the result-
person, with the help of helmet, face mask and bulky cloth,
ing array to a call to the detection method. It then parse the
it’s hard to detect the person. We address the two problem
result and determine if the detection result matches a thresh-
by adjusting the confidence threshold for detection score,
old (how confident the object is a person). An interesting
and retrain the neural network with images from different
bug that arise is when importing a Python Module under
angles and viewpoints.
sudo(which is needed for activating and using DJI drone’s
camera); this is because some of the PythonThe NVIDIA 5.6. Online detection and tracking on DJI live cam
TX1 and TK1 development board that we used packages are
installed with root rw permission only, we worked around We then adjust the detection and tracking module to
this problem by command sudo su, and hacked the privilege function with DJI M100’s live cam. As mentions in sec-
of using the DJI live camera. tion earlier, the camera read in data as raw byte array in
NV12 format, the format has three components for each
5.4. Interacting with DJI camera library pixel: a luma component (the brightness) Yánd two chromi-
We use two sets of DJI libraries. It only contains nance (color) components U and V; since the detection and
interface code to talk to the camera, we didn’t use tracking algorithm are all based on RGB pixel values. We
any DJI code for vision algorithms. First, the cam- do the following conversion to obtain a RGB representation
era input module provided by djicam.h, which lever- of the frame (clamping means restrict the value within the
age libdcam to read in from the built-in camera on the 0-255 range for RGB):
drone. The library only provides three simple func-
tions (manifold cam init, manifold cam read R = clamp(Y + 1.4075(V − 128))
and manifold cam exit) which means we need to ma-
G = clamp(Y − 0.3455(U − 128) − 0.7169(V − 128))
nipulate all data from raw pixel arrays; we initialize the
camera with TRANSFER MODE (transferring image input B = (int)(Y + 1.779(U − 128));
to controller and mobile app) and GET BUFFER MODE
(store the video input to local buffer byte array in NV12 After obtaining the frame encoded in RGB format, we
format, more on format conversion in later subsection). We use similar approach as offline video and detect and track
use the CAM NON BLOCK mode (which means not wait- video frame by frame. A challenge surfaces in this step is
ing the camera to fully initialized) to ensure that we can give that since we are doing detection on a live stream, the sub-
constant control to the drone even if the camera is not set ject might be moving rapidly while we are detecting, this
up. We sleep the program and wait for the camera to exit at means that there is a possibility that the object has already
the end when we return. The specifics of the library can moved out of the bounding box while detection finishes an-
be found here: https://github.com/dji-sdk/ alyzing a frame from ∆t seconds ago. We addressed this
Manifold-Cam/blob/master/djicam.h issue by initialize the tracker with the frame from ∆t sec-
onds ago instead of the current frame. The tracker will then
5.5. Offline Detection and Tracking train it’s positive patches and negative patches with correct
We first tested our detection and tracking module on our bounding area. This solved the problem unless the object
own offline video recordings from different perspectives (a has deformed too much in the ∆t time frame.
DJI Inspire recording of Song Han playing nunchaku and a
5.7. Controlling the Drone
GoPro recording of Song Han snowboarding) with the TK1
board mounted on the drone. For detection using faster r- We are controlling the camera using the OnBoard SDK
cnn, each frame takes 1.6 second on average (comparing to provided by DJI. We first need to send an activation data
0.6s on TX1 which is small enough to mount on the drone to the CoreAPI driver. Since DJI recently upgraded their
but not yet supported by DJI). For tracking under KCF, each drone operating system and their Onboard SDK not up-
frame only takes 14ms (71 fps). Running our detection and dated accordingly, this step causes tremendous trouble. We
tracking module on the two videos exposes several inter- had to contact DJI’s engineer who wrote their Onboard
esting problems to us. First, the tightness of detection’s SDK to get the new version of encryption key (a magic
7
number) to successfully activate. Then we can gain con- [10] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Ce-
trol of the camera by sending GimbalAngleData to it. hovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and
A GimbalAngleData contain three import field corre- R. Pflugfelder. The visual object tracking vot2015 challenge
sponding to the three spatial degree of freedom for the cam- results. In Proceedings of the IEEE International Conference
era: yaw, roll and pitch. on Computer Vision Workshops, pages 1–23, 2015.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
6. Conclusion Advances in neural information processing systems, pages
We present Deep Drone, a detection and tracking sys- 1097–1105, 2012.
tem running real time on embedded hardware, that powers [12] H. Nam and B. Han. Learning multi-domain convolu-
tional neural networks for visual tracking. arXiv preprint
the drones with vision. We presented our software architec-
arXiv:1510.07945, 2015.
ture that combines the accurate but slow detection algorithm
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
and the less accurate but fast tracking algorithm, to make only look once: Unified, real-time object detection. arXiv
the system both fast and accurate. We also compared the preprint arXiv:1506.02640, 2015.
runtime, power consumption and size of different hardware [14] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.
platforms, and discussed implementation issues and corre- You only look once: Unified, real-time object detection.
sponding solutions dealing with those embedded hardware. CoRR, abs/1506.02640, 2015.
[15] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
Acknowledgment towards real-time object detection with region proposal net-
works. CoRR, abs/1506.01497, 2015.
We thank Amber Garage for equipment support. [16] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
References arXiv:1409.1556, 2014.
[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[1] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Learning spatially regularized correlation filters for visual Going deeper with convolutions. In Proceedings of the IEEE
tracking. In Proceedings of the IEEE International Confer- Conference on Computer Vision and Pattern Recognition,
ence on Computer Vision, pages 4310–4318, 2015. pages 1–9, 2015.
[2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets [18] G. Zhu, F. Porikli, and H. Li. Tracking randomly
robotics: The kitti dataset. International Journal of Robotics moving objects on edge box proposals. arXiv preprint
Research (IJRR), 2013. arXiv:1507.08085, 2015.
[3] R. Girshick. Fast r-cnn. In International Conference on Com-
puter Vision (ICCV), 2015.
[4] S. Han, H. Mao, and W. J. Dally. Deep compres-
sion: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
[5] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights
and connections for efficient neural network. In Advances in
Neural Information Processing Systems, pages 1135–1143,
2015.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. arXiv preprint arXiv:1512.03385,
2015.
[7] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Ex-
ploiting the circulant structure of tracking-by-detection with
kernels. In Computer Vision–ECCV 2012, pages 702–715.
Springer, 2012.
[8] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-
speed tracking with kernelized correlation filters. Pattern
Analysis and Machine Intelligence, IEEE Transactions on,
37(3):583–596, 2015.
[9] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and¡ 1mb model size. arXiv
preprint arXiv:1602.07360, 2016.
8
Eigenfaces and Fisherfaces – A comparison of face detection techniques
Pradyumna Desale Angelica Perez
SCPD, NVIDIA Stanford University
pdesale@nvidia.com pereza77@stanford.edu
employed to match faces using these measurements.

Abstract The distinct disadvantage of feature based
techniques is that since the extraction of feature
points precedes the training and classification, the
implementer has to make an arbitrary decision about
In this project we compare different subspace
which features are important and that’s why we are
based techniques of face recognition. Face
evaluating statistical methods of face recognition in
recognition is considered a relatively mature
this work.
problem with decades of research behind it and there
is a lot of interest because face recognition, in
We use various methods in our two-stage face
addition to having numerous practical applications
recognition systems: PCA (Principal Component
such as access control, Mug shots searching,
Analysis), 2D PCA, and LDA (Linear Discriminant
security monitoring, and surveillance system, it is a
Analysis) for feature extraction and SVM (Support
fundamental human behavior that is essential for
Vector Machines) for classification.
effective communications and interactions among
Specifically, we compare the methods for
people.
accuracy of face identification between different
In the literatures, face recognition problem is statistical methods +SVM in presence of following
defined as: given static (still) or video images of a variations
scene, identify or verify one or more persons in the a. Training size variation.
scene by comparing with faces stored in a database. b. Variations in number of principal components.
We focus on the classification problem which is a c. Presence of noise in image – Gaussian, salt-and-
superset of identification problem. pepper, speckle noise.
d. Variations in facial expressions of subjects
e. Variations in angle and structural changes to
1. Motivation
face such as beard, glasses.
Face recognition is a mature problem and even f. Variations in illumination.
though computers can not pick out suspects from Accuracy can be measured for identification or for
thousands of people NCIS style but the ability of classification of a person against training set. We use
computers to differentiate among a small number of the latter as metric of accuracy.
family member and friends is considered better than
humans. Face recognition has additional 2. Technical solution
applications, including human-computer interaction We describe PCA, 2D PCA, LDA, and SVM briefly
(HCI), identity verification, access controls etc. in this section. The mathematics behind all these
Feature based face recognition methods rely on techniques is proven so we don’t include any
extracting processing of input image to identify and derivations. References can be reviewed for
extract distinctive facial features such as the eyes, mathematical proofs.
mouth, nose etc., and the geometric relationship
among the facial points, thus reducing the input
facial image to a vector of geometric features.
Standard statistical pattern recognitions are then
1
2.1. PCA and illumination, since it considers the global
information of each face image and represents them
Sirovich and Kirby were the first to utilize
with a set of principal components. Under the
Principal Components Analysis (PCA) to
variation of pose and illumination, statistical features
economically represent face images. They argued
will vary considerably from the weight vectors of the
that any face image can be reconstructed
images with normal pose and illumination; hence it is
approximately as a weighted sum of a small
difficult to identify them correctly. On the other hand
collection of images that define a facial basis
if the face images were divided into smaller regions
(eigenimages), and a mean image of the face. Turk
and the weight vectors are computed for each of
and Pentland presented the well-known Eigenfaces
these regions, the weights will be more
method for face recognition.
representative of the local information of the face.
Suppose there are M training face images for each
When there is a variation in the pose or illumination,
of K subjects. Let each face image A(x, y) be a 2-
only some of the face regions will vary and rest of
dimensional N-by-N array of pixel values. The image
the regions will remain the same as the face regions
may also be represented as a vector of dimension N 2.
of a normal image. Hence weights of the face regions
Let’s denote each face image of the training set as fij,
are not affected by varying pose and illumination
and corresponding average of all training images as
will closely match with the weights of the same
g. Principal components per image are eigen values
individuals face regions under normal conditions. We
of wi = eigen values of covariance matrix of (fij – g).
implemented segmented PCA variations of 1D and
Principal components transform input image to
2D PCA methods that assign equal weight to all the
lower dimensional feature vector
sub images.
yk = wiT*fij
2.4. LDA Fisher’s Linear Discriminant Analysis
2.2. 2D PCA
PCA methods reduce the dimension of input data
In the PCA-based face recognition technique, the
by a linear projection that maximizes the scatter of
2D face image matrices must be previously
all projected samples. Fisher’s Linear Discriminant
transformed into 1D image vectors. The resulting
(FLD) shapes the scatter with the aim to make it
image vectors of faces usually lead to a high
more suitable for classification. A computation of the
dimensional image vector space, where it is difficult
transform matrix results in maximization of the ratio
to evaluate the covariance matrix accurately due to
of the between-class scatter.
its large size and the relatively small number of
In choosing the projection which maximizes total
training samples. As opposed to conventional PCA,
scatter, PCA retains some of the unwanted variations
2DPCA is based on 2D matrices rather than 1D
due to lighting and facial expression. The variations
vectors.
between images of the same face due to illumination
Image covariance or scatter matrix is computed
and viewing direction are almost always larger than
from M training images as
M image variations due to change in face identity. Thus
1 T
while PCA projections are optimal for reconstruction
G= ∑
M 1
( A j− Á ) ( A j − Á )
from a low dimension basis, they may not be optimal
Eigenvectors of scatter matrix G corresponding to from a discrimination standpoint.
the d largest eigen values are the 2D principal In this project we implement a variant of LDA
components of image Aj, If we denote the optimal called D-LDA. The basic premise behind the D-LDA
projection vectors as {X1(j), X2(j), …, Xd(j)}, approaches is that the information residing in (or
corresponding features of image are computed as close to) the null space of the within-class scatter
Yk= A*Xk. The matrix of features is called the matrix is more significant for discriminant tasks than
feature matrix and B = [Y1, Y2, .. Yd]. the information out of (or far away from) the null
space. Generally, the null space of a matrix is
2.3. Segmented PCA determined by its zero eigenvalues. However, due to
insufficient training samples, it is very difficult to
The PCA based face recognition method is not
identify the true null eigenvalues. As a result, high
very effective under the conditions of varying pose
variance is often introduced in the estimation for the
2
zero (or very small) eigenvalues of the within-class eyes, smiling / not smiling) and facial details (glasses
scatter matrix. Note that the eigenvectors / no glasses). All the images were taken against a
corresponding to these eigenvalues are considered to dark homogeneous background with the subjects in
be the most significant feature bases in the D-LDA an upright, frontal position (with tolerance for some
approaches. side movement).
The AT&T Face database is good for initial tests,
2.5. Support Vector Machines but it’s a fairly easy database. The Eigen faces
The goal of SVM classifiers is to find a hyperplane method already has a 90+% recognition rate, so we
that separates the largest fraction of a labeled data set didn’t expect to see considerable improvements with
{(x(i), y(i)); x(i) ϵ RN; y(i) ϵ {-1,1}; i=1,2…,N}. other algorithms. The Yale Face database A is a more
appropriate dataset for our experiments, because the
The most important requirement, which the
recognition problem is harder. The database consists
classifiers must have, is that they must maximize the
of 15 people (14 male, 1 female) each with 11
distance or the margin between each class and the
grayscale images sized 320 × 243 pixel. There are
hyperplane. In most of real applications, the data
changes in the lighting conditions (center light, left
cannot be linearly classified. To deal with this
light, right light), facial expressions (happy, normal,
problem, we transform data into a higher
sad, sleepy, surprised, wink) and glasses (glasses, no-
dimensional feature space and assume that our data
glasses).
in this space can be linearly classified.
We select first M images of each subject for
The discriminant hyperplane is defined as the
computing features of that subject class. These
following
N features are then used for training SVM with RBF
y(x) = ∑ α y ( i) K ( x (i ) , x ) + b where K is the kernel and the images not used for training are used
1 for classification testing.
kernel function for SVM. In this paper we use a
radial basis function kernel for SVM classification. Graph 1: PCA method accuracy vs number
of training images
2.6. Algorithm description
PCA or LDA method is used to identify features of
training images. To apply SVM for classification, we
use one-against-all decomposition to transform a
multi-class problem into a two-class problem.
Training set D= {(x(i), y(i)); x(i) ϵ R N; y(i) ϵ
{1,2,3,…,K}; is transformed into a series of D k=
{(x(i), yk(i)); x(i) ϵ RN; y(i) ϵ {k, 0}.
We use the MATLAB function svmtrain to
compute discriminant function of SVM and in the
classification phase, we use the following rule with
4. Results
MATLAB function svmclassify to identify the class
of test probe image x. In this section we present and discuss a comparison
of each of the previously mentioned feature
3. Experimental Setup extraction techniques with SVM classifiers.
We use two databases for our experiments
4.1. Effect of training size variations and number of
- AT&T/ ORL face database
principal components variations
- Yale face database
The AT&T Face database, sometimes also known Training size significantly affects the accuracy of all
as ORL Database of Faces, contains ten different methods but there is a ceiling on the highest level of
images of each of 40 distinct subjects. For some accuracy obtained from any of the methods. ORL
subjects, the images were taken at different times, database of faces achieves 90% accuracy within 3
varying the lighting, facial expressions (open / closed training images for 2D PCA and LDA methods
3
achieve 95% accuracy with 4 training images as Graph 6: Segmented PCA against number of
shown in the graphs below. principal components for various training sets Yale
Graph 7: LDA against number of principal
Graphs 2,3,4: Graph 2:2D PCA against number of components for various training sets Yale
principal components for various training sets from
ORL
Graph 3: Segmented PCA against number of
principal components for various training sets ORL
Graph 4: LDA against number of principal
components for various training sets ORL
It is hard to compare the accuracy of 2DPCA and

LDA methods with just ORL database but Yale
database (graphs above) clearly shows that LDA
method is superior to 2D PCA when in class
variation is large.
Statistical learning methods PCA and LDA-based
ones seem to perform very well with small sample
Graphs 5,6,7: Graph 5:2D PCA against number of and test size but often suffer from the so-called
principal components for various training sets from ‘‘small sample size’’ (SSS) problem if number of test
Yale samples is very large.
4
Looking closely at the inaccurate face detection in Gaussian noise of different variance in the
Yale database provides us insight into this problem. experiment.
When first 4 images of each subject are picked for
training, only subject 15 and 3 has dark glasses while
other subjects don’t have glasses. When test images
are run against such training set, all the probe images
of subjects with glasses are categorized into either
We also use the Wiener filter which is a MSE-
class 3 or class 15. Since all subject images with
optimal stationary linear filter for suppressing the
glasses were not used for training, our training model
degradation caused by additive noise and
is heavily biased towards two classes. That being
blurring. Fourier transforms are unable to recover
said, increasing the number of training images
components for which Fourier transform of point
improves the accuracy and we get around the bias.
spread function is 0. This means they are unable to
undo blurring caused by band limiting of Fourier
4.2. Variations in facial expression of subjects,
transform. We can see that some of the accuracy of
structural changes to the faces, and variations
face recognition is recovered when Weiner filters are
in pose
used for correction of additive noise in Graphs 8 and
Yale database also includes facial expressions of 9.
subjects and graphs 5,6,7 show that LDA method is Weiner filter noise suppressed images look like
more immune to variations in facial expressions but below
PCA method’s accuracy is not substantially lower
than LDA. LDA method is definitely superior when
Overall we are very surprised by how resilient
both sets of algorithms are to significant changes in
facial poses.
4.3. Presence of noise in probe image Graphs 8, 9:
Noise and distortions in face images can seriously

affect the performance of face recognition systems.
Analog or digital capturing the image has come a
long way and very good quality photo captures are
possible even with cell phone camera but biometric
system will need to be resilient to tampering so we
explore noise immunity of different algorithms now.
Noise in probe image degrades the performance of
all algorithms substantially. In this section we only
evaluate the 2D PCA and LDA based methods since
1D PCA has lower accuracy compared to 2DPCA
and segmented PCA methods are essentially just
PCA methods so the effect of noise on their
performance is easily studied by understanding the
effect of noise on PCA methods only. We also restrict
the description of results to only Yale database even
though ORL database was also studied and we found
the effects of noise had less dependency on the
database.
4.3.1 Gaussian noise

Gaussian noise is the most common noise
occurring in everyday life. We use zero mean
5
4.3.2 Speckle Noise
This granular noise occurs in ultrasound, radar and
X-ray images and images obtained from the
magnetic resonance. The multiplicative signal
dependent noise is generated by constructive and
destructive interference of detected signals. The
wave interference is a reason of multiplicative noise
occurrence in the scanned image. The speckle noise
is image dependent. Therefore it is considered hard
to find a mathematical model that describes the
removal of this noise, especially if we expect the
randomness of the input data Structural changes to
the face image such as beard, glasses etc. We had
identified Lee’s filter as the method to counter the
Speckle noise but speckle noise is not greatly
affecting the recognition performance so we
prioritized the study of Gaussian and S&P noise over
correction to Speckle noise.
4.3.3 Salt & Pepper Noise

Salt & pepper noise is perceived as a random
Graphs 10, 11: Additive speckle noise effect on occurrence of black and white pixels in a digital
accuracy for PCA and LDA. image.
It can be caused by incorrect data transmission or by

a damage of already received data. In CCD and
CMOS sensors or LCD displays, the salt & pepper
noise can be caused by permanently turned-on or
turned-off pixels. Remaining pixels are unchanged.
Usually, the intensity (frequency of the occurrence)
of this noise is quantified as a percentage of incorrect
pixels. The median filtering (as a specific case of
order-statistic filtering) is used as an effective
method for elimination of salt & pepper noise from
digital images. We can see that almost all of the
algorithm performance is recovered by use of median
filters.
Graphs 12, 13: Additive S&P noise effect on

6
accuracy for PCA and LDA and corrected with and that were taken in different sessions after longer
Median filter time periods. We also presented recognition results
for noisy images and compared them to results for
non-distorted images with correction for two types of
noises.
We started this project with the intent of
implementing face recognition algorithms with SVM
and definitely succeeded in that goal. Computer
vision and analytics systems performance is far
superior when combined with deep learning models
CNNs, combining deep-learning and multivariate
wide-learning together with improved feature
descriptor models can enable extraction of more
information from facial images such as facial
expression. During the research of this project, we
have stumbled upon papers that study separation of a
true smile from fake smile using PCAs for feature
recognition and CNNs for training and classification.
We would like to build fundamentals of CNNs and
machine learning and combine the power of
statistical models used in this project with deep
learning methods to create more interesting projects
like machine recognition of facial features.
https://pereza77@bitbucket.org/pereza77/cs231a_final_
project_face_recognition.git
5. Summary
We examined different subspace methods of face
recognition in this project. Two-stage recognition
systems include PCA, LDA for feature extraction
followed by SVM for classification. All methods are
significantly influenced by different settings of
parameters that are related to the algorithm used (i.e.
PCA, LDA or SVM).
For methods working in ideal condition both PCA
and LDA achieve greater than 90% accuracy within
three training images.
This project dealt with ‘closed’ image set, so we
did not have to deal with issues like detecting people
who are not in the training set. On the other hand our
two test databases contain images of the same
subjects that often differ in face expressions,
hairstyles, with or without beard, or wearing glasses
7
References
[1] M. A. Turk and A. P. Pentland, "Face recognition
using eigenfaces," in Computer Vision and Pattern
[2] Recognition, 1991. Proceedings CVPR '91., IEEE
Computer Society Conference on, 1991, pp. 586-591.
[3] Y. Jian, et al., "Two-dimensional PCA: a new
approach to appearance-based face representation and
recognition," Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 26, pp. 131-
137, 2004.
[4] A new LDA-based face recognition system which can
solve the small sample size problem. Pattern
Recognition 33, 1713–1726.
[5] Belhumeur, P. N., Hespanha, J., and Kriegman, D.
Eigenfaces vs. fisherfaces: Recognition using class
specific linear projection. IEEE Transactions on
Pattern Analysis and Machine Intelligence 19, 7
(1997), 711–720.
[6] B. Moghaddam and Y. Ming-Hsuan, "Gender
classification with support vector machines," in
Automatic Face and Gesture Recognition, 2000.
Proceedings. Fourth IEEE International Conference
on, 2000, pp. 306-311
[7] P. Phillips, et al., "The FERET evaluation
methodology for face-recognition algorithms," Pattern
Analysis and Machine Intelligence, IEEE
Transactions on, vol. 22, pp. 1090-1104, 2002.
[8] P. Viola and M. J. Jones, "Robust real-time face
detection," International Journal of Computer Vision,
vol. 57, pp. 137-154, 2004.
[9] T. H. Le and L. Bui, "Face Recognition Based on
SVM and 2DPCA," International Journal of Signal
Processing, Image Processing and Pattern
Recognition Vol. 4, No. 3, September, 2011
[10] MATLAB.com
[11] Brunelli, R., and Poggio, T. Face recognition through
geometrical features. In European Conference on
Computer Vision (ECCV) (1992), pp. 792–800
[12] Kanade, T. Picture processing system by computer
complex and recognition of human faces. PhD thesis,
Kyoto University, November 1973.
[13] H. Yu and J. Yang, "A Direct LDA Algorithm for
High-dimensional Data with Application to Face
Recognition," Pattern Recognition, Vol.34, pp.2067-
2070, 2001.
[14] F. Song, D. Zhang, J. Wang, H. Liu, and Q. Tao, "A
parameterized direct LDA and its application to face
recognition," Neurocomputing, Vol.71, pp.191-196,
2007.
[15] ORL face database -
http://www.uk.research.att.com/facedatabase.html
[16] Yale face database -
http://cvc.yale.edu/projects/yalefaces/yalefaces.html
8
Emotion AI, Real-Time Emotion Detection using CNN
Tanner Gilligan Baris Akis

M.S. Computer Science B.S. Computer Science
Stanford University Stanford University
tanner12@stanford.edu bakis@stanford.edu
Abstract tion we though of is to use emotion labels and

prediction scores combined with social science
In this paper, we describe a Convolutional
on emotion research led by Paul Ekman [Ekman,
Neural Network (CNN) approach to real-
1992] to predict emotion intensities. As indicated
time emotion detection. We utilize data
in Frijda et al. [1992] emotion intensity prediction
from the Extended Cohn-Kanade dataset,
is a really hard problem and a very valuable in-
Japanese Female Facial Expression data
sight for the field of psychology. The main rea-
set, and our own custom images in or-
son we didn’t pursue emotion intensity prediction
der to train this model, and apply pre-
is that there were no existing data sets or research
processing steps to improve performance.
that can serve as the ground truth.
We re-train a LeNet and AlexNet imple-
mentation, both of which perform with Therefore we concentrated on building a suc-
above 97% accuracy. Qualitative analysis cessful emotion recognition model that can work
of real-time images shows that the above in real-time. In this project we built a model that
models perform reasonably well at classi- uses Convolution Neural Networks to successfully
fying facial expressions, but not as well as classify faces as one of the core seven emotions:
the quantitative results would indicate. anger, contempt, disgust, fear, sadness, happiness,
surprise, and neutral [Darwin et al., 1998].
1 Introduction
The ability to confidently detect human emotions 2 Background
can have a wide array of impactful applications,
2.1 Previous Works
and therefore emotion recognition has been a core
area of research in computer vision. We wanted Emotion Recognition in the Wild (EmotiW) Chal-
to focus on the issue of emotion recognition, and lenge is the leading academic challenge on emo-
build a real-time emotion detection system. tion recognition and labeling. We concentrated on
When we began to work on the area of emotion the winning papers of the 2014 and 2015 chal-
detection, we quickly realized that there is an in- lenge. Its important to highlight that the papers
nate problem which is that all data sets are based demonstrating the best results for the 2015 Image
on ”acted” emotions instead of ”real” emotions. based Static Facial Expression Recognition Sub-
Many of these data sets such as CK+ ([Lucey et al., challenge used CNNs. Yu and Zhang [2015] pro-
2010]) and JAFFE ([Lyons et al., 1998]) are col- poses a CNN architecture specialized on emotion
lections of actors who demonstrated core emotions recognition performance. They propose two novel
in front of a camera. Therefore the field isn’t constrained optimization frameworks to automati-
detecting real emotions, but rather detecting the cally learn the network ensemble weights by min-
emotion that the subject is acting or the observer imizing the loss of ensembled network output re-
is perceiving. This problem was also very obvi- sponses. Kim et al. [2016] took a different ap-
ous while testing our model, as we saw confidence proach by creating a committee of multiple deep
scores increase as the subject portray very exag- CNNS. They also created a hierarchical architec-
gerated facial expressions that would be defined ture of the committee with exponentially-weighted
as ”fake” by a human. decision making process.
When we discussed possible applications of a There were also a wide variety of other papers
successful emotion recognition tool, one applica- that suggested alternative methods to CNN, but
didnt perform as well. Most of these papers use 93.6% (surprise). Jeni et al. [2013] also used CK+
support vector machines (SVM) or largest mar- data set and they were able to reach 86% average
gin nearest neighbor (LMNN) for classification. accuracy on continuous AU (Action Unit) inten-
The main difference between these were the fea- sity estimation. Velusamy et al. [2011] used most
ture descriptors. [Dhall et al., 2011] used a sys- discriminative AUs for each emotion to predict
tem that extracts pyramid of histogram of gradi- emotions and reached to 97% with CK+ data set
ents (PHOG) and local phase quantization (LPQ) but only 87.5% with JAFFE. Since this approach
features for encoding the shape and appearance was really dependent on discriminative physical
information. [Yao et al., 2015] used AU (Ac- expression of emotions it didn’t do as strongly in
tion Unit) aware features that were generated after JAFFE database. Islam and Loo [2014] Utilized
finding pairwise patches that are significant to dis- displacement of points on the face between neutral
criminate emotion categories. Yaos main insight and expressive emotions and was able to correctly
was that previous research groups neglected to ex- classify 83% (fear) to 97% (happiness) of the emo-
plore the significance of the latent relations among tions. Again this approach was able to success-
changing features resulted from facial muscle mo- fully classify emotions like happiness and surprise
tions. This approach delivers results better than since large displacements in the mouth region cre-
the winning team of 2014 but falls short compared ated discriminative results.
to 2015 winners results. [Shan et al., 2009] con- Our final results are very promising since they
centrated on person-independent facial expression result in accuracies that are less dependent on the
recognition and used Local Binary Patterns (LBP) actors or emotions being portrayed.
descriptors. Many of these algorithms that used
feature based models aimed to mimic Ekman’s 3 Approach
suggestions for human emotion at [Ekman et al., 3.1 Overview
2013].
In order to accomplish our task of developing a
real-time emotion detection system, we had to get
2.2 Improvements on Previous Works several components working independently. We
Since many of the most successful emotion recog- also had to conduct research to better understand
nition applications used CNN’s, we also decided the basis for our problem, and how we could im-
to use CNNs as our model to target the emotion prove our results once we got things working. Be-
recognition problem. We already started seeing low summarizes our general approach:
over 90% train and test accuracy by only using • Build Dataset: We collected labeled facial-
CK+ data set. Similar to many other papers shared expressions data sets from multiple sources,
below, we added JAFFE data set and this increased and processed their labels and images into a
our accuracies. Different than most other papers common format. We then introduced custom
in this area we also created data samples on our images of ourselves and a friend to further en-
own, and added these to our data sample as well. rich the data set.
Surprisingly this increased accuracies even fur-
ther. We were able to use these additional sam- • Pre-process Images: We ran facial-detection
ples since we were using CNNs, and the accuracy software to extract out the face in each im-
aren’t dependent on very specific features like AU- age. We then re-scaled the croppings, and
intensities. manually eliminated poor images. As a pre-
It was hard to find accurate benchmarks since a processing step for the CNN, we also applied
lot of the papers were data set dependent. There- a Gaussian filter to the images, and subtracted
fore we concentrated on research that also trained the mean-image of the training set from each
their models on CK+ data set, and ideally added image. In order to get more out of our limited
JAFFE as well. In these papers, we saw results training data, we also augmented the images
ranging from 45% and 97%, which is in line with to include reflections and rotations of each
our final results of around 97% test-set accuracy. image, with the hope that this would improve
robustness
Chew et al. [2011] used face tracking and con-
strained local models (CLM) with CK+ data set • Construct CNN: We utilized pre-trained ver-
and had testing accuracy ranging 45.9% (sad) and sions of AlexNet and LeNet in Caffe on
Page 2 of 9
AWS, where we re-trained the first and last class, as they say it is merely a combination of fear
layer. We also had to experiment with vari- and disgust. Taking this into account, we dropped
ous learning rate methods and parameters in all instances of contempt from our data set, and
order to generate a non-divergent model. re-split it. Thankfully contempt was the smallest
class in terms of image count, so we didn’t lose a
• Develop real-time Interface: OpenCV al- substantial part of our data set.
lowed us to get images from our laptop’s we-
bcam. We then extracted the face as before, JAFFE Data Set After eliminating contempt,
pre-processed the image for the CNN, and we again tested our model qualitatively and quan-
sent it to AWS. On the server, a script would titatively. We found that we were doing very well
run the image through the CNN, get a predic- quantitatively, our precision, recall, and accuracy
tion, and the results would be pulled back to were all well over 90% on both test and train, but
local. our qualitative results were still rather poor. Since
the network was doing very well on data it was
3.2 Implementation given, but was not generalizing well, we decided
3.2.1 Dataset Development to find additional data sources. One of the research
papers we investigated combined the CK+ with the
CK+ Dataset The first step in developing our
Japanese Female Facial Expression (JAFFE) data
emotion-detection system was to acquire data with
set, and was able to achieve improved results. Un-
which to train our classifier. We sought to find the
fortunately, the data set only contained around 250
largest data set we could, and we selected the Ex-
images, but it was still able to boost the model’s
tended Cohn-Kanade (CK+) data set. This data
performance by a few percent.
set is composed of over 100 individuals portray-
ing 7 different labeled emotions: anger (1), con- Custom Images Since the real-time interface
tempt (2), disgust (3), fear (4), happy (5), sad (6), was being tested solely by us, we also decided to
and surprise (7). In addition, we also introduced add ourselves to the data set to see if it would im-
an addition class-0 to represent a neutral expres- prove qualitative results. We found a friend to help
sion. One feature we really liked about this data us, and the 3 of us proceeded to take an additional
set is that for each person displaying an emotion, 20 images for each class, further increases our data
the directory contains 10 to 30 images demonstrat- set size. After this final inclusion, our final data set
ing that individual’s progression from a neutral ex- sizes were:
pression to the target emotion. This is good be- Set Size
cause it allows us to have multiple degrees of in- Train 2104
tensity for each emotion represented in our data Val 300
set, as opposed to only the most extreme exam- Test 601
ples. We originally elected to take the first two
images of each sequence and label them as neu- Including the images from the JAFFE data set
tral, and the last three and label them as the tar- and the ones we custom-made, we were able to
get emotion. We found that this greatly limited again boost our quantitative results by a small mar-
our training set size, however, as we were left with gin, and our qualitative results also noticeably im-
fewer than 1000 training images. To combat this, proved.
we looked more closely at the images and decided
3.2.2 Data Pre-processing
to take the last third of each sequence as the target
emotion, as opposed to just the last three. Given the non-homogeneity of the data set, we had
to pre-process the data into a common format. We
Excluding Contempt Upon testing this 8-class first converted all images to grayscale. We then
classifier, we found that it tended to over-predict utilized OpenCV to detect faces within each im-
”contempt”. This manifested in quantitatively age, which returned to us a set of bounding boxes
lower recall and precision scores for the con- to examine. In cases where no bounding box was
tempt class, as well as qualitative worse predic- found, we set that image aside and ran it again us-
tions when we fed it live images. We conducted ing different detection parameters until we were
further research on this and found that many pa- able to successfully detect the face. In cases where
pers on emotion detection ignore the contempt multiple bounding boxes were returned, we ana-
Page 3 of 9
lyzed the sizes and locations of the boxes, and se- transferred from these data-rich environments to
lected the one that was largest and/or most central our data-poor environment.
in the image. Using this approach, we were able to Since LeNet and AlexNet were trained with dif-
extract the facial component of every image in our ferent intentions than our own, we needed to tweak
data set. Once extracted, we re-scaled the images the networks slightly. First, since our input images
to a common 250-by-250 size. were neither color nor the 227-by-227 dimension
In order to help the CNN perform better, we utilized by these networks, we had to change the
also applied statistical pre-processing to the im- input data-layer and retrain the first convolutional
ages. The first step we took was applying a small layer to account for this. Second, since we are
5-by-5 Gaussian filter over the images, which is only predicting 7 classes rather than the 1000 orig-
meant to help smooth-out noise while still preserv- inally used, we needed to retrain the final softmax
ing image’s edges and distinctive features. The layer. As a result of us having far fewer training
second step we took was subtracting the training- images than these networks originally had, we also
set’s mean image from every image. This is ben- had experiment with different learning rate hyper-
eficial because the distribution of pixel values be- parameters in order to induce convergence, as the
comes centered at 0, and is common practice for original hyper-parameters often diverged. We ul-
training data fed to any machine-learning model. timately settled on a ”fixed” learning rate policy,
Since we only have 2104 distinct training im- with base learning rate of 0.001 with a momen-
age’s, and Convolutional Neural Networks tend to tum of 0.9 and weight decay of 0.0005. Note that
perform better with more data, we sought to find even though a ”fixed” learning rate policy is used,
ways to enrich this data set. To do this, we aug- the SGD solver of Caffe still uses the momentum
mented each image in two ways. First, we mir- and weight decay to steadily reduce weight up-
rored the image across the Y-axis, which produced dates over time.
a similar but not identical training point. In ad-
3.2.4 Real-Time Interface
dition, we also introduced slight rotations of 10
degrees in either direction for each image, which In order to create the real-time interface, we
helped to boost our training-set size, and improve needed to gather local images and run them
robustness. through the CNN on AWS. To accomplish this,
we utilized OpenCV to extract the images from
3.2.3 CNN Construction out laptop’s webcam. The images were them pre-
In order to develop our Convolutional Neural Net- processed in the same manner as our data set: con-
work, we decided to utilize pre-trained models. vert it to grayscale, extract the facial component,
We believed that this would lead to better results and re-scale to 250-by-250. We chose to use 250-
for our project, since these pre-trained networks by-250 images because AWS would only allow us
are much deeper that we could develop, and would to send at most 65KB in a single file, so the 250-
thus have much better feature-detection power. In by-250 images fit within this constraint.
researching existing networks, we couldn’t find On AWS, we had a server script that has our
any that dealt directly with facial detection or trained model loaded into memory, and waits for
recognition, so we chose to use networks with an incoming file. When received, the image has
varying initial applications. a Gaussian filter applied, and the mean image is
The first network we used was LeNet, which subtracted, just as with the rest of the data set. The
was trained on the MNIST data set. The MNIST image is then augmented via a mirroring and ro-
data set is composed of hand-written numbers, and tation, and the set of images is fed into the neural
the objective of the model is to classify each im- network. A prediction is produced for each image,
age as a digit. The second network we looked at and we select our prediction to be the most com-
was AlexNet, which was developed and trained for mon class label among the images. If there is a tie,
the ImageNet Challenge. This challenge seeks to we select the class with the highest sum of class
classify images into one of 1000 categories, rang- scores among the maximal classes.
ing from animals to beverages. Even though nei- One limitation of AWS is that it does not allow
ther of these networks deals directly with faces, you to send data directly to a local computer, so
our hope is that the lower level features learned by our script could not simply send the results back
these networks, such as edges and curves, can be when the computation was finished, or even sig-
Page 4 of 9
nal to us that it was done. We had our first script Confusion matrix
write the results to a file and have a second socket 149 0 2 1 0 0 1 0
that listens the file. We had another local script 1 61 0 0 0 0 1 0
that called the second AWS server after sending 4 0 5 0 0 0 0 0
images were done to retrieve the results and this 3 0 0 51 0 0 0 0
combinations of four scripts and two sockets cre- 0 0 0 0 27 0 0 0
ated a close to real-time interface. 1 0 0 0 0 82 0 0
0 0 0 0 0 0 33 0
4 Experiments and Results 1 0 0 0 0 0 0 80
4.1 Result Metrics Precision

.94 1.0 .71 .98 1.0 1.0 .97 1.0
For our project, we have two metrics we use to
evaluate the performance of our model. The first Recall
is the obvious quantitative results, such as pre- .97 .98 .56 .94 1.0 .99 1.0 .99
cision, recall, and accuracy, in which we exam-
ine the statistic success of our model in predicted
our labeled data set. The second success measure Dataset Acc.
we use is how well it is able to classify our live- Train 0.988
streamed images. Since there are no labels for Val 0.972
these images, only us knowing which expression Test 0.972
we are trying to portray, it is difficult to quantita-
tively examine these results without have to hand- From the above results, specifically the pre-
label a second data set (which we do in some in- cision and recall of class 2 (contempt), we can
stances). Furthermore, since neither of us are ac- see that this class is clearly performing the worst.
tors, our expressions could also be poor portrayals Upon inspecting the same statistics on both our
of the target emotions, but this is also a more real- training and validation set, we find similar results.
istic application of the system. In the below anal- In addition, our qualitative analysis also indicates
ysis, we describe our results with respect to both that the contempt class was causing issues, as most
of these metrics. images we sent it to predict were classified as con-
tempt, even though it is the minority class. On
4.2 8-class Prediction a sample of 20 images we sent to be predicted,
In our first round of experimentation, we used 4 were classified correctly, 3 were classified in-
the full 8-class CK+ data set including contempt. correctly, but not as contempt, and the remaining
Here, the classes are: 13 were all labeled as contempt. This revelation
is what drove us to investigate literature further,
and found that most emotion detection researchers
0: Neutral
tend to discard the contempt class. We followed
1: Anger
this example, and retrained our model excluding
2: Contempt
this class
3: Disgust
4: Fear
4.3 7-class Prediction
5: Happiness
6: Sadness After retraining out model, we achieved improved
7: Surprise results. Due to the fact that contempt made up
such a small portion of the training data, the
We used only LeNet for our initial exploration, changes in accuracy and precision aren’t very
and after tuning hyperparameters, achieved the much, but they are substantial when considering
following results on our test set: that only about 1% of the data set was altered.
Upon extracting our summary statistics, from the
model, and using the same class labels as above
(dropping contempt), we obtained the following
results:
Page 5 of 9
Confusion matrix it includes images for the remaining 6 emotions.
128 0 0 1 0 0 1 The inclusion of this data set resulted in the fol-
1 51 0 0 0 0 0 lowing:
3 0 57 0 0 0 0
0 0 0 34 0 0 0 Precision
0 0 0 0 78 0 0 0.98 0.95 0.88 0.97 0.99 0.75 1.0
0 0 0 0 0 32 0
Recall
1 0 0 0 0 0 100
0.97 0.98 0.92 0.91 0.93 1.0 0.92
Precision
0.96 1.0 1.0 0.97 1.0 1.0 0.99
Recall Dataset Acc.
0.98 0.98 0.95 1.0 1.0 1.0 0.99 Train .979
Val 0.969
Dataset Acc. Test 0.945
Train 1.0
Val 0.975 We can see that the JAFFE data set noticeably
Test 0.986 reduced our accuracies across all three splits. In
addition, the precision of class 5 (sadness) took a
It is difficult to directly compare the results from big hit, dropping from 1.0 to 0.75. In an attempt
the 7-class and 8-class case since the data was to better understand why this occurred, we looked
re-segmented when removing contempt. Despite at some of the JAFFE images that were labeled
this, one can clearly see that the precision, recall, as sad. We found that some of these images were
and accuracy for every class increased between rather poor or subtle examples of a sad expression,
the two runs. This indicates that excluding con- and could easily be confused with neutral by just
tempt not only improve performance by not mis- looking at them. Below is an example of an image
classifying contempt images, but also preventing that is labeled as sad, but was incorrectly classified
other images from being confused with contempt, as neutral by our model:
and were thus classified correctly. Our qualitative
results also improved as a result of this change,
and we were able to get more correct predictions.
In a sample of 20 webcam images we sent to the
network, 8 were classified correctly, and we had
100% accuracy on surprise images. An example
of a correctly classified image is below:
We decided to leave these bad images in the

data set because, even though they are poor exam-
ples of the target emotion, they are still valid ex-
pressions of them. By excluding them, we would
be hand-picking our data set to only train on ex-
aggerative expressions that would likely never be
present in the real world. Instead, we elected
4.4 JAFFE Predictions to simply add more data to our data set in the
Once we removed the contempt class from our hopes that it would both disambiguate some of the
data set, we were able to add the Japanese Female JAFFE images, as well as improve our overall per-
Facial Expression (JAFFE) data set as well, since formance.
Page 6 of 9
4.5 Custom Images network, the results were much better than with
In the final iteration of constructing our data set, our previous models. When portraying extreme
we added a total of 420 images equally split expressions of surprise or sadness, the model cor-
among the 7 classes. In addition to providing addi- rectly classified them with near perfect accuracy.
tional, unique images to the model, it also helped When making expressions that were less exagger-
to balance the class distribution, which is an is- ative, the model was not able to classify the im-
sue we had previously been unable to address. By ages very well, as one might expect. This is be-
adding these images into our data set, we were cause the key features that differentiate the classes
able to achieve results similar to the 7-class model are not readily apparent, so the model can not pre-
in nearly every category, which is significant given dict as well. One class that the model qualitatively
we retain the JAFFE data set which previously de- performs very poorly at is happiness. In our nu-
creased our performance. In addition, this was the merous attempts to elicit a prediction of happiness
first instance where AlexNet outperformed LeNet, from our model, we nearly always failed. Interest-
so below we show AlexNet’s results: ingly, when a friend of ours attempted to do the
same, she was able to consistently get predictions
Confusion matrix of happiness when we couldn’t. We are unsure
136 0 0 0 0 0 0
why this would occur, but it could be caused by
0 74 0 0 0 0 0
an underlying artifact of our data set, such as a
0 1 59 0 0 0 0
woman’s facial features being more highly associ-
1 1 1 40 0 0 1 ated with a happy expression.
0 0 0 0 117 0 0
In an attempt to quantify our relatively qualita-
0 0 0 1 1 54 0
tive results, we tried to classify 50 live-stream im-
1 0 0 1 0 0 112 ages. Of those we sent, the network correctly clas-
Precision sified 28, typically being those with the most ex-
0.99 0.97 0.98 0.95 0.99 1.0 0.99 aggerative expressions. As previously discussed,
Recall surprise and sadness performed the best, while
1.0 1.0 0.98 0.91 1.0 0.96 0.98 happiness performed the worst. In addition to
looking at the predicted class, we also analyzed
the class scores output for incorrectly classified
Dataset Acc. image. In comparing the class scores produced
Train 1.0 by our earlier models to those produced our fi-
Val 0.99 nal model, we noted a respectable increase in the
Test 0.985 score for the correct class. In every case we ex-
amine, the final model produced class scores such
that the correct class was either the second or third
AlexNet Loss
maximal score, while our previous models had no
such guarantee. This shows that even though we
couldn’t correctly classify the images, our predic-
tions were at least closer to being correct. In sum-
mary, we were able to achieve improved results on
our life-streamed images, but not nearly as well as
our quantitative results would indicate.
5 Conclusion
In this paper, a CNN-based emotion detection
model is proposed that utilizes facial-detection
software and cloud computing to accomplish its
In addition to quantitative improvements in our task. The final model resulted in accuracies com-
model, we also experienced qualitative improve- parable to the state-of-the-art papers in the field,
ments results as well. When observing the live- reaching as high as 98.5% accuracy on our cus-
stream of predictions being returned to us by our tom data set, and 97.2% on the original CK+ data
Page 7 of 9
set. Our code base can be fount at https:// Gesture Recognition and Workshops (FG 2011),
github.com/barisakis/cs231a_eai. In 2011 IEEE International Conference on, pages
addition, our model also exhibits more balanced 878–883. IEEE, 2011.
accuracy results across the emotion spectrum. Paul Ekman. An argument for basic emotions.
Lastly the proposed model still worked signifi- Cognition & emotion, 6(3-4):169–200, 1992.
cantly well with non-actor subjects, especially for
physically expressive emotions like sadness, hap- Paul Ekman, Wallace V Friesen, and Phoebe
piness and surprise. Ellsworth. Emotion in the human face: Guide-
One future area of work is to create a user in- lines for research and an integration of findings.
terface where users can iteratively train the model Elsevier, 2013.
through correcting false labels. This way the Nico H Frijda, Andrew Ortony, Joep Sonnemans,
model can also learn more from real world users and Gerald L Clore. The complexity of inten-
who express various emotions in different ways. sity: Issues concerning the structure of emotion
In addition, including a layer in the network that intensity. 1992.
accounts for class imbalance could provide addi- Md Nazrul Islam and Chu Kiong Loo. Geo-
tion improvements over our results. We attempted metric feature-based facial emotion recognition
implement latter of these, but were unable to get it using two-stage fuzzy reasoning model. In
working. Neural Information Processing, pages 344–351.
Another area of interest to explore is predicting Springer, 2014.
on a continuous scale the intensity of emotions be-
L. A. Jeni, J. M. Girard, J. F. Cohn, and F. De La
ing portrayed. We believe that we already have a
Torre. Continuous au intensity estimation us-
reliable recognition algorithm, so by incorporat-
ing localized, sparse facial feature space. In
ing knowledge from the social sciences on emo-
Automatic Face and Gesture Recognition (FG),
tion, a more powerful predictor could be built. In
2013 10th IEEE International Conference and
order develop and train such a predictor, however,
Workshops on, pages 1–7, April 2013. doi:
one would need an annotated data set with which
10.1109/FG.2013.6553808.
to work. One possible means for creating such a
data set would be to aggregate people’s opinions Bo-Kyeong Kim, Jihyeon Roh, Suh-Yeon Dong,
of an image using a service like Amazon Mechan- and Soo-Young Lee. Hierarchical committee of
ical Turk, and then average the responses together deep convolutional neural networks for robust
to produce an intensity measure for each emotion. facial expression recognition. Journal on Mul-
Even though many state of art algorithms are timodal User Interfaces, pages 1–17, 2016.
very good at detecting facial expressions, these Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Ja-
images are often exaggerative and un-realistic. As son Saragih, Zara Ambadar, and Iain Matthews.
such, we believe that there is need for better data The extended cohn-kanade dataset (ck+): A
sets aimed at understand ’real’ emotions. complete dataset for action unit and emotion-
specified expression. In Computer Vision
References
and Pattern Recognition Workshops (CVPRW),
S. W. Chew, P. Lucey, S. Lucey, J. Saragih, J. F. 2010 IEEE Computer Society Conference on,
Cohn, and S. Sridharan. Person-independent fa- pages 94–101. IEEE, 2010.
cial expression detection using constrained lo- Michael Lyons, Shota Akamatsu, Miyuki Ka-
cal models. In Automatic Face Gesture Recog- machi, and Jiro Gyoba. Coding facial expres-
nition and Workshops (FG 2011), 2011 IEEE sions with gabor wavelets. In Automatic Face
International Conference on, pages 915–920, and Gesture Recognition, 1998. Proceedings.
March 2011. doi: 10.1109/FG.2011.5771373. Third IEEE International Conference on, pages
Charles Darwin, Paul Ekman, and Phillip Prodger. 200–205. IEEE, 1998.
The expression of the emotions in man and ani- Caifeng Shan, Shaogang Gong, and Peter W
mals. Oxford University Press, USA, 1998. McOwan. Facial expression recognition based
Abhinav Dhall, Akshay Asthana, Roland Goecke, on local binary patterns: A comprehensive
and Tom Gedeon. Emotion recognition using study. Image and Vision Computing, 27(6):803–
phog and lpq features. In Automatic Face & 816, 2009.
Page 8 of 9
S. Velusamy, H. Kannan, B. Anand, A. Sharma,
and B. Navathe. A method to infer emotions
from facial action units. In 2011 IEEE In-
ternational Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 2028–
2031, May 2011. doi: 10.1109/ICASSP.2011.
5946910.
Anbang Yao, Junchao Shao, Ningning Ma, and
Yurong Chen. Capturing au-aware facial fea-
tures and their latent relations for emotion
recognition in the wild. In Proceedings of
the 2015 ACM on International Conference on
Multimodal Interaction, pages 451–458. ACM,
2015.
Zhiding Yu and Cha Zhang. Image based static fa-
cial expression recognition with multiple deep
network learning. In Proceedings of the 2015
ACM on International Conference on Multi-
modal Interaction, pages 435–442. ACM, 2015.
Page 9 of 9
End-to-end learning of motion, appearance and interaction cues for multi-target
tracking
Amir Sadeghian Khashayar Khosravi Alexandre Robicquet

Stanford University Stanford University Stanford University
amirabs@stanford.edu khosravi@stanford.edu arobicqu@stanford.edu
Abstract crowded scenes. We propose an online unified deep neural

network tracker that jointly learn to reason on a strong
The task of Multi-Object Tracking (MOT) largely con- appearance model, strong individual motion model, and
sists of locating multiple objects at each time frame, and object interactions (dynamic scene knowledge). In this
matching their identities in different frames yielding to a project we will not study the interaction model and only
set of object trajectories in a video frame. There are sev- focus on appearance and motion model.
eral cues used for representing the individuals in a crowded
scene. We demonstrate that in general, fusing appearance, Our strong appearance model is a Siamese convolutional
motion, and interaction cues together can enhance the level neural network (CNN) that is able to find occlusions and
of performance on the MOT task. In this paper we combined similarity of objects in different time frames in addition to
appearance, motion and interaction cues in one deep uni- object bounding box prediction in next time frame. We also
fied framework. An important contribution of this work is use two seperate Long Short-Term Memory (LSTM) model
a generic scalable MOT method that can fuse rich features for our motion prior and interactions model that tracks
from different dynamic or static models. the motion and trajectory of objects for longer forecasting
period (suited in presence of long-term occlusion). These
models extract appearance cues, motion priors, and inter-
1. Introduction active forces which are critical parts of the MOT problem.
We then integrate these parts into a coherent system using
Multiple Object Tracking (MOT), or Multiple Target a high-level LSTM that is responsible to reason jointly
Tracking (MTT), plays an important role in computer on different extracted cues. We show our model is able
vision and is a crucial problem in scene understanding. to fuse and use different data modalities and get a better
The objective of MOT is to produce trajectories of objects performance. The magic is this scalability, one can add
as they move around the image plane. MOT covers a another cue component (e.g. exclusion model) to the model
wide range of application such as pedestrians on the street and finetune the model to reason jointly on the new cue and
[26, 19] sports analysis (e.g. sport players in the court previous cues.
[14, 17, 24], bio tracking (birds [15], ants [9], fishes
[20, 21, 5], cells [16, 12], and etc), robot navigation, or
autonomous driving.). In crowded environments occlu- 2. Related Work
sions, noisy detections (false alarms, missing detections,
non-accurate bounding), and appearance variability (Same In recent years tracking has been successfully extended
person, different appearance or different people, same to scenarios with multiple objects [18, 11, 8, 23]. Differ-
appearance) are very common. As a result, multi object ent from single object tracking approaches which have been
tracking has become challenging task in computer vision. constructing a sophisticated appearance model to track sin-
gle object in different frames, multiple object tracking does
Recent works have proven that tracking objects jointly not mainly focus on appearance model. Although appear-
and taking into consideration their interaction in addition ance is an important cue but in crowded scenes relying only
to their appearance can give much better results in crowded on appearance can lead to a less accurate MOT system. To
scenes. The focus of this paper is to marry the concepts of this end, different works have been improving only the ap-
appearance model, object motion, and object interactions pearance model [6, 3], some works have been combining
to obtain a robust and scalable tracker than works in the dynamics and interaction between targets with the tar-
1
get appearance. 3. Multi Object Tracking Framework
As shown in Figure 1, MOT involves three primary com-
2.1. Appearance model ponents. Our model includes modeling of appearance, mo-
tion, and interaction. These components will be described
Technically, appearance model is closely related to vi- in more details.
sual representation features of objects. Depending on how
MOT
precise and rich the visual features are, they are grouped Components
into three sets of single cue, multiple cues, and deep cue.
Because of efficiency and simplicity single cue appearance
model is widely used in MOTs. Many of single cue mod-
els are based on raw pixel template representation for sim- Appearance Motion Interaction
plicity [25, 2, 22, 19], while color histogram is the most
popular representation for appearance modeling in MOT
Figure 1. MOT components
approaches [4, 11, 28]. Other single cue approaches are
using covariance matrix representation, Pixel comparison
representation, or SIFT like features. The multi cues ap- 3.1. Appearance
proaches combines different kinds of cues to make a more
rebust appearance model. The final appearance cue used in In this section, we now describe the appearance model
tracking is the deep visual reperesentation of objects. These that we integrate into our framework for multi-object track-
high-level features are extracted by deep neural networks ing. As we recall, our problem is fundamentally based on
mostly convolutional neural networks trained for a specific addressing the challenge of data association: that is, given a
task [7]. Our model shares some characteristics with [7], set of targets Tt at time step t, and a set of candidate detec-
but differs in two crucial ways: first, we are learning to han- tions Dt+1 at timestep t + 1, we would like to compute all
dle occlusion and solve the re-identification task in addition of the valid pairings that exist between members of Tt and
to David’s work that is bounding box regression only. We Dt+1 .
output the similarity score (same object or not) and bound- The idea underlying our appearance model is that we can
ing box. Second, there are differences in the overall archi- compute the similarity score between a target and candi-
tecture, e.g. the number of fully connected layers on top date detection based on purely visual cues. More specif-
of two networks for fusing, loss function, inputs and out- ically, we can treat this problem as a specific instance of
puts and hence the training and testing procedure is different re-identification, where the goal is to take pairs of bound-
since we want to address re-identification as well as bounding boxes and determine if their content corresponds to the
ing box to help tracking. same person. We thus desire our appearance model to rec-
ognize the subtle similarities between input pairs, as well as
be robust to occlusions and other visual disturbances.
2.2. Motion model To approach this problem, we construct a Siamese Con-
volutional Neural Network (CNN), whose structure is de-
Object motion model describes how an object moves. picted in Figure 2. Let BBi and BBj represent the two
Motion cue is very important for multiple object tracking bounding boxes we wish to compare – in our case, BBi
since knowing the potential position of objects in the fu- might be a target bounding box at frame t, and BBj would
ture frames will reducing search space and help the appear- be a candidate detection at frame t+1. We first crop the im-
ance model on better detectation of similar objects. Popular ages containing BBi and BBj to contain only the bound-
motion models used in multiple object tracking are divided ing boxes themselves, while also ensuring that we include
into linear motion models and Non-linear motion models. some amount of the surrounding image context. The net-
As the name ”linear motion” indicates objects following the work then accepts the raw content within each bounding
linear motion model move with constant velocity. This sim- box and passes it through its layers until it finally produces
ple motion model is the is the most popular model in MOT a 500-dimensional feature vector for each of the two inputs.
[3]. There are many cases that linear motion models can not Let φi and φj thus be the final hidden activations ex-
deal with, in this cases non-linear motion models are pro- tracted by our network for bounding boxes BBi and BBj .
posed to produce a more accurate motion model for objects In order to compute the similarity, we then simply con-
[27]. We present a new Long Short-Term Memory (LSTM) catenate the two vectors to get a 1000-dimensional vector
model which jointly reasons based on the past movements φ = φi ||φj , and pass this as input to a final fully-connected
of an object and predicts the future trajectorys of that object layer. We lastly apply a Softmax classifier, which outputs
[1]. the probabilities for the positive and negative classes, where
2
task of data association, in which we can match members
of Tt and Dt+1 based on which detections are closest to the
motion prior’s next predicted location for each target.
To thus incorporate this information, we construct
a Long Short-Term Memory (LSTM) network over the
3D velocities of each target. More concretely, let
(xi0 , y0i , z0i ), (xi1 , y1i , z1i ), . . . (xit , yti , zti ) represent the 3d tra-
jectory of the i-th target from the timestep 0 through
timestep t. Assuming a point (xit+1 , yt+1 i i
, zt+1 ), we want to
see whether this point belongs to the trajectory of i-th target.
Figure 2. Our appearance model
Let use define the velocity of target i at the j-th timestep i to
be vji = (vxij , vyji , vzji ) = (xij −xij−1 , yji −yj−1 i
, zji −zj−1
i
).
positive indicates that the inputs match, and negative indi- This can be done by assigning an score to this point and see-
cates otherwise. ing whether it is large enough or not. For this purpose, we
The actual network structure we use for this challenge train our LSTM to accept as inputs the velocities of a single
consists of the 16-layer VGG net, which won the ImageNet target for timesteps 1, . . . , t and produces H-dimensional
2014 localization challenge. In our case, we begin with outputs. We also pass the t + 1 velocity vector (which
the pre-trained weights of this network, but remove the last we wish to determine whether it corresponds to a true tra-
fully-connected layer so that the network now outputs a jectory or not) from a fully-connector layer that brings it
500-dimensional vector. to H-dimensional vector space. The last LSTM output is
We then fine-tune this network by training the overall then concatenated with this vector and the result is passed
network on positive and negative samples extracted from to another fully connector layer which brings the 2H di-
our training sequences. For positive pairs, we use instances mensional vector to the space of k features. Finally, another
of the same target that occur in different frames. For neg- fully connector layer, reduces the dimension to 2 which will
ative examples, we use pairs of different targets that may be used as the 0/1 classification problem during the train-
span across all frames. ing.
We trained this model on MOT3D dataset which con-
tains 2 scenes with more than 950 frames that contain more
than 5500 objects. We extracted more than 100k of pos-
itive and negative samples. We did the training on one
scene and validated on the other scene. The result was
84 percent accuracy on the binary classification problem of
positive/negative pairs. We used CUHK03 dataset [13] as
the sanity check for our prediction. This dataset contains
13164 images of 1360 pedestrians and contains 150k pairs.
FPNN method which got rank 1 of identification MAP rate
were able to achieve 19.89 percent accuracy. Our method
achieves 18.61 percent of accuracy and outperforms several
other methods such as LDM, KISSME, SDALF.
3.2. Motion
The second component of our overall framework is the
inclusion of an independent motion prior for each target.
The intuition is that the previous movements for a particular
target can strongly influence what position a target is likely
to be at during a future time frame. Figure 3. Our 3D motion prior model
Additionally, a nuanced motion prior can help our model
when tracking objects that are occluded or lost, since it pro- Note that training occurs from scratch, and weights are
vides a heuristic as to where these objects might generally shared across all targets. Once we train the network, then
be located. Thus, formulating a sophisticated model for the given a query target i at timestep t0 , the LSTM will output
motion prior of a target will be valuable in achieving robust a predicted velocity vti0 +1 . We can then simply add the ve-
performance during tracking. locity to the query target’s position at t0 in order to compute
We can therefore use this information to aid us in the the motion prior’s predicted position for frame t0 + 1. That
3
is, probabilities of whether the t and d correspond to the same
object.
(xit0 +1 , yti0 +1 , zti0 +1 ) = (xit0 +vxit0 +1 , yti0 +vyti0 +1 , zti0 +zti0 +1 ) The inputs to the LSTM are feature vectors that we ex-
tract from our individual models. Let φA represent the hid-
We therefore obtain the predicted position from the motion den activations extracted from our appearance model before
prior, and can use this to filter out candidate detections that the final fully connected layer of the network, where we in-
are not sufficiently close to the prior. put the bounding boxes surrounding target i and detection
For training this model, we used MOT3D dataset, which d. Let φM j be the hidden state of the Motion Prior LSTM
only consists of true trajectories. We considered trajecto- extracted at timestep j, and likewise let φIj be the hidden
ries of length t + 1 = 7 and we assumed H = 128. For feature vector of the Interaction model extracted at timestep
each true trajectory, we changed the last element of it by a j. Then, the input to our integrator is given by
randomly chosen object among all other objects that exist at
the same frame. By doing this we were able to reach to the φj = φA ||φM j ||φIj
same number of invalid trajectories as the valid trajectories
(it is not good to have unbalanced distributions for train- where we thus concatenate the individual feature vectors
ing). After training this model, we were able to achieve the output by the modules. Therefore, when we set up the
accuracy of 95 percent for the 0/1 classification problem. model we use these features as inputs to the LSTM and
train it to output either a positive or negative label for each
3.3. Integration timestep (indicating whether there is a valid match) using a
standard Softmax classifier and cross-entropy loss.
Given these three components of our framework for
An important point to note is that we train this LSTM
Multi-object Tracking, we now describe the method by
without fine-tuning the weights of the individual compo-
which we integrate these parts into a coherent system. To
nents of the framework, which are each in fact trained sep-
recall, we have identified appearance cues, motion priors,
arately. The overall model, composed by the previous com-
and interactive forces as critical parts of the MOT problem.
ponents is illustrated by figure 5 and the output of the model
We believe a sophisticated framework should merge these
is a similarity score which is used as a weight for the edges
pieces together in an elegant way. You can find the graphi-
of matching graph for matching the detections between time
cal model of our approach in figure 4. Each human has an
frames.
appearance edge and motion edge, and between every pairs
of humans there is an interaction edge. Similarity Score
H Occluded
0.23
A . . . ]
[
0.95
Appearance Edges
A M I
Human Node
h1 Interaction Edges Motion Interaction
0.12
0.23
Motion Edges A
Appearance
Feature Feature Feature
False alarms
Extractor Extractor Extractor T
T+1
M
A hn
Figure 5. Our overall model
h2 M For training this model, we used the pretrained compo-

nents described in previous sections and fine-tune the whole
M model end to end using MOT3D dataset.
4. Experiments
Figure 4. The graphical model of our approach
In this section, we now describe our various experiments
Our overarching model is a Long Short-Term Memory and results, and then later peform a qualitative analysis on
network which we construct over the already pre-trained model’s performance.
appearance, motion, and interaction modules. This LSTM
4.1. Baselines
is trained to perform the task of data association: once
again, suppose we are at timestep t and wish to deter- We first discuss the various baselines that we use to es-
mine whether target i is matched to a detection d found tablish a standard for comparison against our more nuanced
in timestep t + 1. We then train the LSTM to output the model.
4
• Markov Decision Process Tracker people into the 3D world. It consists of two publicly avail-
able datasets: a crowded town center, and the well known
In [23], authors demonstrated success on 2D Multi-
PETS2009 dataset.
object tracking by formulating the tracking problem as
a Markov Decision Process (MDP). They represented 4.3. Results
every target as being in either an active, tracked,
lost, or inactive states, and learned the appropriate The accuracy and results of each component of our sys-
transition probabilities and rewards based on extracted tem is described at each of the experimental sections. Here
features. In order to evaluate this method on the 3D we see the final results of the tracker in table 6 for results of
challenge, we project the bottom-midpoint of the our tracker compared to other baselines on MOT3D chal-
predicted 2D bounding boxes to the ground plane lenge. The last 3 rows are our cross validation on MOT
(using the provided calibration parameters given in the challenge training set.
data sequences). Tracker MOTA (H) MOTP (H) MT (H) ML (L)
DBN (State of art) - 1st 51.1 61.0 28.7% 17.9%
KalmanSFM (Baseline) - 5th 25.0 53.6 6.7% 14.6%
• MDP Tracker with Linear Motion Prior Yu’s 3D 45.5 61.0 6% 3%

Appearance only (Cross validation) 38.1 54.1 15% 20%
Though the MDP described above can obtain reason- LSTM only (Cross validation) 28.9 48.3 9% 28%
Appearance and LSTM (Cross validation) 40.3 57.1 16% 19%
able results on the problem of multi-object tracking, Appearance and LSTM (MOT Challenge) 28.3 51.7 29.1% 21.8%
we additionally incorporate a simple linear 3D motion
prior into the feature vectors associated with each state
in the MDP. More specifically, we use the normalized Figure 6. Primary Results on MOT3D challenge
distance between a candidate detection and the motion
prior prediction as a feature in that state.
5. Conclusion
This paper proposes a deep neural network designed for
• MDP Tracker with LSTM Motion Prior multi object tracking. Quantitative results show that the
As a final baseline experiment, we realize that incorpo- tracking performance is superior to the baseline tracking
rating a simple linear motion prior may be too simplis- methods. Our tracker can also be fine-tuned for various
tic of an approach to accurately model the movements applications by providing more training videos of certain
of a target. A more reasonable method is to use an types of objects. Overall, our real-time neural network
LSTM similar to our own motion model to output the tracker opens up many possibilities for different applica-
predicted 3D coordinates for every target, and then use tions and extensions, allowing us to learn from several cues
these values in the feature vectors as described above. used for representing the individuals in a crowded scene.
We demonstrate that in general, fusing appearance and mo-
We report the results using the proposed method on the tion cues together can enhance the level of performance on
3D MOT 2015 Benchmark which includes the PETS09- the MOT task. We show our model is able to fuse and use
S2L21 and the AVG-TownCentre2 sequences. The sensi- different data modalities and get a better performance. One
tivity of the method to the omission of single variables is of the main advantage of our tracker to others is the scala-
evaluated on the PETS09-S2L1 dataset (available for train- bility, one can add another cue component (e.g. exclusion
ing in the 3D MOT 2015 Benchmark). The corresponding model) to the model and finetune the model to reason jointly
results of an evaluation in 3D image space (correct detection on the new cue and previous cues.
requires at least 50% intersection-over-union score with the
reference) and in 3D world coordinates (correct detection References
requires at most 1m offset in position) are reported in fol- [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei,
lowing section, respectively. and S. Savarese. Social lstm: Human trajectory prediction in
crowded spaces.
4.2. Datasets [2] S. Ali and M. Shah. Floor fields for tracking in high density
crowd scenes. In Computer Vision–ECCV 2008, pages 1–14.
We test our tracking framework on the Multiple Ob- Springer, 2008.
ject Tracking Benchmark [10] for people tracking. The [3] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier,
MOT Benchmark collects widely used video sequences in and L. Van Gool. Robust tracking-by-detection using a de-
the MOT community and some new challenging sequences. tector confidence particle filter. In Computer Vision, 2009
We evaluate the proposed algorithm on MOT3D challenge IEEE 12th International Conference on, pages 1515–1522.
which provides the 3D coordinate of position of the feet of IEEE, 2009.
5
[4] W. Choi and S. Savarese. Multiple target tracking in world [19] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll
coordinate with single, minimally calibrated camera. In never walk alone: Modeling social behavior for multi-target
Computer Vision–ECCV 2010, pages 553–567. Springer, tracking. In Computer Vision, 2009 IEEE 12th International
2010. Conference on, pages 261–268. IEEE, 2009.
[5] E. Fontaine, A. H. Barr, and J. W. Burdick. Model-based [20] C. Spampinato, Y.-H. Chen-Burger, G. Nadarajan, and R. B.
tracking of multiple worms and fish. In ICCV Workshop on Fisher. Detecting, tracking and counting fish in low quality
Dynamical Vision. Citeseer, 2007. unconstrained underwater videos. VISAPP (2), 2008:514–
[6] H. Grabner, M. Grabner, and H. Bischof. Real-time tracking 519, 2008.
via on-line boosting. In BMVC, volume 1, page 6, 2006. [21] C. Spampinato, S. Palazzo, D. Giordano, I. Kavasidis, F.-P.
[7] D. Held, S. Thrun, and S. Savarese. Learning to track at 100 Lin, and Y.-T. Lin. Covariance based fish tracking in real-
FPS with deep regression networks. CoRR, abs/1604.01802, life underwater environment. In VISAPP (2), pages 409–414,
2016. 2012.
[22] Z. Wu, A. Thangali, S. Sclaroff, and M. Betke. Coupling
[8] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by
detection and data association for multiple object tracking.
hierarchical association of detection responses. In Computer
In Computer Vision and Pattern Recognition (CVPR), 2012
Vision–ECCV 2008, pages 788–801. Springer, 2008.
IEEE Conference on, pages 1948–1955. IEEE, 2012.
[9] Z. Khan, T. Balch, and F. Dellaert. An mcmc-based particle
[23] Y. Xiang, A. Alahi, and S. Savarese. Learning to track: On-
filter for tracking multiple interacting targets. In Computer
line multi-object tracking by decision making. In Interna-
Vision-ECCV 2004, pages 279–290. Springer, 2004.
tional Conference on Computer Vision (ICCV), pages 4705–
[10] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. 4713, 2015.
MOTChallenge 2015: Towards a benchmark for multi-
[24] J. Xing, H. Ai, L. Liu, and S. Lao. Multiple player tracking in
target tracking. arXiv:1504.01942 [cs], Apr. 2015. arXiv:
sports video: A dual-mode two-way bayesian inference ap-
1504.01942.
proach with progressive observation modeling. Image Pro-
[11] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool. Cou- cessing, IEEE Transactions on, 20(6):1652–1667, 2011.
pled object detection and tracking from static cameras and [25] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who
moving vehicles. Pattern Analysis and Machine Intelligence, are you with and where are you going? In Computer Vision
IEEE Transactions on, 30(10):1683–1698, 2008. and Pattern Recognition (CVPR), 2011 IEEE Conference on,
[12] K. Li, E. D. Miller, M. Chen, T. Kanade, L. E. Weiss, and pages 1345–1352. IEEE, 2011.
P. G. Campbell. Cell population tracking and lineage con- [26] B. Yang, C. Huang, and R. Nevatia. Learning affinities and
struction with spatiotemporal context. Medical image anal- dependencies for multi-target tracking using a crf model.
ysis, 12(5):546–566, 2008. In Computer Vision and Pattern Recognition (CVPR), 2011
[13] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter IEEE Conference on, pages 1233–1240. IEEE, 2011.
pairing neural network for person re-identification. In Pro- [27] B. Yang and R. Nevatia. Multi-target tracking by online
ceedings of the IEEE Conference on Computer Vision and learning of non-linear motion patterns and robust appear-
Pattern Recognition, pages 152–159, 2014. ance models. In Computer Vision and Pattern Recogni-
[14] C.-W. Lu, C.-Y. Lin, C.-Y. Hsu, M.-F. Weng, L.-W. Kang, tion (CVPR), 2012 IEEE Conference on, pages 1918–1925.
and H.-Y. M. Liao. Identification and tracking of players IEEE, 2012.
in sport videos. In Proceedings of the Fifth International [28] A. R. Zamir, A. Dehghan, and M. Shah. Gmcp-tracker:
Conference on Internet Multimedia Computing and Service, Global multi-object tracking using generalized minimum
pages 113–116. ACM, 2013. clique graphs. In Computer Vision–ECCV 2012, pages 343–
[15] W. Luo, T.-K. Kim, B. Stenger, X. Zhao, and R. Cipolla. Bi- 356. Springer, 2012.
label propagation for generic multiple object tracking. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1290–1297, 2014.
[16] E. Meijering, O. Dzyubachyk, I. Smal, and W. A. van Cap-
pellen. Tracking in cell and developmental biology. In Semi-
nars in cell & developmental biology, volume 20, pages 894–
902. Elsevier, 2009.
[17] P. Nillius, J. Sullivan, and S. Carlsson. Multi-target tracking-
linking identities using bayesian network inference. In Com-
puter Vision and Pattern Recognition, 2006 IEEE Computer
Society Conference on, volume 2, pages 2187–2194. IEEE,
2006.
[18] K. Okuma, A. Taleghani, N. De Freitas, J. J. Little, and D. G.
Lowe. A boosted particle filter: Multitarget detection and
tracking. In Computer Vision-ECCV 2004, pages 28–39.
Springer, 2004.
6
Near-Eye Display Gaze Tracking via Convolutional Neural Networks
Robert Konrad Shikhar Shrestha Paroma Varma

rkkonrad@stanford.edu shikhars@stanford.edu paroma@stanford.edu
Abstract area, and might be subject to occlusions and only partial

views of the pupil and iris. Currently, one company, SMI,
Virtual reality and augmented reality systems are cur- offers to hack a an Oculus DK2 for a large cost (roughly ten
rently entering the market and attempt to mimic as many thousand dollars). We seek a more elegant and cost effective
naturally occurring stimuli in order to give a sense of im- solution for providing the benefits of gaze tracking for near
mersion. Many aspects such as orientation and positional eye displays.
tracking have already been reliably implemented, but an im- Gaze tracking not only provides a new intuitive interface
portant missing piece is eye tracking. VR and AR systems to near-eye displays, but also allows for important tech-
equipped with eye tracking could provide a more natural niques increasing immersion and comfort in foveated ren-
interface with the virtual environment, as well as opening dering and gaze contingent focus.
the possibility for foveated rendering and gaze-contingent
focus. In this work, we approach an eye tracking solu-
tion specifically designed for near-eye displays via con- Applications Depth of field, the range of distances
volutional neural networks, that is robust against light- around the plane of focus that still appear sharp, is a cue
ing changes and occlusions that might be introduced when that humans use to detect depth of objects in a scene. Given
placing a camera inside of a near-eye display. We create a user’s gaze position on the screen, a depth of field can be
a new dense eye tracking dataset specifically intended for rendered into a scene, creating an adaptive depth of field
neural networks to train on. We present the dataset as well system. As the gaze moves around the scene, different ob-
as report initial results using this method. jects come into focus and other objects fall out of focus, cre-
ating a more realistic experience than an all-in-focus image
that current VR and AR displays present. Such as system
1. Introduction has been shown to reduce image fusion times in stereo sys-
tem in [8], as well as increasing subjective viewing experi-
Immersive visual and experiential computing systems ences [4, 9]. In a first person environment, rendering a depth
are entering the consumer market and have the potential of field effect showed an increased sense of immersion [5].
to profoundly impact our society. Applications of these Because the area of real-time depth of field rendering is well
systems range from entertainment, education, collaborative studied [3, 18, 17], an adaptive depth of focus system could
work, simulation and training to telesurgery, phobia treat- be immediately created with the implementation of a robust
ment, and basic vision research. In every immersive experi- gaze tracking system suitable for near eye displays.
ence, the primary interface between the user and the digital A second potential use of gaze tracking in near-eye
world is the near-eye display. Thus, developing near-eye displays is gaze contingent focus, which improves com-
display systems that provide a high-quality user experience fort of VR and AR displays by reducing the vergence-
is of the utmost importance. Many characteristics of near- accommodation conflict. When we look at objects our eyes
eye displays that define the quality of an experience, such as perform two tasks simultaneously when looking at an ob-
resolution, refresh rate, contrast, field of view, orientation ject: vergence and accommodation. Vergence refers to the
and positional tracking, have been significantly improved oculumotor cue created by the muscles in our eyes rotat-
over the last years. However, an important missing inter- ing our eyeballs so that they verge to the point we look at.
face is gaze tracking. Accommodation is the oculomotor cue created by the mus-
Although the area of gaze tracking has been studied for cles in our eyes bending the lenses in our eyes such that
decades, they have been studied in the context of tracking the object of interest comes into focus. In normal, real-
a user’s gaze on a screen placed a distance away from the world, conditions our eyes verge and accommodate to the
eyes. Many such techniques are not suitable for near-eye same distance. However, in near-eye stereoscopic display,
displays where the imaging of the eye is restricted to a small the user the user is able to verge at any distance in the
1
scene but is forced to focus to a fixed distance. This dis-
tance is a function of the distance between the lenses and
the display in the head-mounted display, as well as the fo-
cal length of the lenses themselves. This forces a mismatch
between vergence (able to verge anywhere) and accommo-
dation (only able to accommodate to one distance), known
as the vergence accommodation conflict. When exposed to
such a conflict for extended periods of time, users develop
Figure 1. Table of existing gaze tracking datasets, which are
symptoms of headache, eye strain, and, in extreme cases,
mostly tailored towards model assumed methods.
nausea[13].A system capable of determining the distance
to which the user is verged at, through gaze tracking, can
either use focus-tunable optics or an actuated display to re- where (sx , sy ) are the screen coordinates, and (x, y) are
duce the vergence accommodation conflict by changing the the pupil-glint vector components. A calibration procedure
distance to which users focus to, as explained in [7]. is performed to estimate the unknown variables a0 , a1 , ...b5
via least squares analysis by asking a user to looking at (at
2. Related Work least) 9 calibration targets. The accuracy of the best IR-
based methods fall somewhere between 1◦ to 1.5◦ accuracy.
A variety of remote eye gaze tracking (REGT) algo-
rithms have been reported in literature over the past couple
View-Based REGT In view-based REGT, only intensity
of decades. For our purposes, the general body of knowl-
images from traditional cameras are used without any ad-
edge can be divided into two categories: ones which assume
ditional hardware. These techniques rely more on image
a model of the eye and ones which learns an eye tracking
processing techniques to extract features from the eyes di-
model.
rectly, which can then be mapped to 2D pixel coordinates.
2.1. Model Assumed REGT Tan et. al [16] uses an image as a point in high dimensional
space and through an appearance-manifold technique is able
Methods assuming a model of the eye generally extract to achieve a reported accuracy of 0.38◦ . Zhu and Yang [19]
features from an image of the eye and map them to a point proposed a method for feature extraction from intensity im-
on the display. This type of work generally either uses in- ages and using a linear mapping function are able to achieve
tensity images captured from a traditional camera as seen in a reported accuracy of 1.4◦ .
[19], or uses illumination from an infrared light source and
captures the eye with an IR camera. 2.2. Model Learned REGT
Model learned REGT techniques use some sort of ma-
Infrared-Based REGT IR illumination creates two ef- chine learning algorithm to learn an eye tracking model
fects on the eye. Firstly, it creates the so-called bright-eye from training data consisting of input/output pairs, i.e. 2D
effect, similar to red-eye in photography, which results from coordinates of points on the screen and images of the eyes.
the light “lecting off of the retina. The second effect, a glint In the work by Baluja and Pomerleau [2], an artificial
on the surface of the eye, is caused by light reflecting off the neural network (ANN) was trained to model the relation-
corneal surface, creating a small bright point in the image. ship between 15x40 pixel images of a single eye and their
This glint is often used as a reference point, because if we corresponding 2D coordinates of the observed point on the
assume that the eye is spherical, it does not move as the eye screen. In their calibration procedure the user was told to
rotates in its socket. look at a cursor moving on the screen along a path made of
two thousand positions. They reported a best accuracy of
After grabbing an image of the eye, the glint and pupil an
1.5◦ .
be extracted via image processing algorithms described in
Similar work by Xu et. al. [12] was presented, but in-
[6]. A glint pupil vector can be calculated, and mapped to a
stead of using raw image values as inputs, they segmented
2-D position on the screen via some mapping function. Al-
out the eye and performed histogram equalization in order
though many have been proposed, [20, 19], the most com-
to boost the contrast between eye features. They used three
monly used function is the 2nd order polynomial defined in
thousand points for calibration and reported an accuracy of
[10], defined as :
around 1.5◦ .
2.3. Our Approach
sx = a0 + a1 x + a2 y + a3 xy + a4 x2 + a5 y 2
A key motivation behind this work is that gaze tracking
sy = b0 + b1 x + b2 y + b3 xy + b4 x2 + b5 y 2 systems lack the required robustness for commercial ap-
2
Figure 2. Image of setup, sample captures, and path of point on screen. The left image depicts our setup comprising of an LCD monitor,
webcam, and chin rest to keep the head roughly stable. The upper right images show samples images from what the webcam captures. The
bottom right image displays a part of the path that the moving target follows during its trajectory.
plications. High variability in the appearance of the pupil Instead of using one of the above datasets, we decided
and lack of a clear view cause accuracy issues with detec- to create our own dataset suitable for neural network train-
tor based tracking systems. Deep learning performs well ing. Our strongest criteria was to have a large number of
on problems with high intra-class variability. The intuition calibration points densely sampling the entire screen, with
behind this work was to use a CNN model instead of the corresponding images of the eye. With such training data,
parametric calibration since it would probably be more ro- we would expect the CNN to learn the fine differences be-
bust to variations in skin/eye color, HMD fixation etc. tween points on the screen.
In particular the key contributions of this work are: In our setup, as seen in Figure 2, we placed a user with
his/her head resting on a chin rest 51 cm away from a 1080p
• Introduction and implementation of an end-to-end
24 inch monitor. A webcam was placed very close to the
CNN based approach to gaze tracking
chin rest imitating what a camera placed inside of a near-
• Creation of a new gaze tracking dataset with a near-eye eye display would see.
camera covering five different subjects and dense po- Asking a user to fixate on a series of targets is infeasi-
sition sampling of the screen based on smooth pursuit ble for the large number of calibration points we wanted
to collect. Instead we used the fact that humans are able
• Performance evaluation of the chosen method, conclu- to track moving objects well, up to some angular velocity.
sions and suggestions for improvement that can be in- This is the notion of smooth pursuit. We moved a point
corporated in future work about a screen at 7.5◦ /s in a winding pattern from left to
right, top to bottom, as seen in Figure 2 and were able to col-
3. Dataset lect 7316 calibration points during a single 4 minute sitting.
Many eye-gaze tracking datasets exist [14], as seen in At this angular velocity the eye is able to smoothly track the
Figure 1, however they are mostly tailored for model as- point as it moves about without sacades (which occur when
sumed REGT systems. The majority of the datasets use the eye attempts to ’catch up’ when a point is moving very
target based calibration with large spacing between targets. quickly). We chose this particular angular velocity based on
Neural networks are not able to train on such few, and [11], which introduced the concept of a pursuit based cali-
sparsely, sampled points and learn a good relationship be- bration and found that points moving about between 3.9◦ /s
tween image data and pixel coordinates. The few datasets and 7.6◦ /s resulted in best accuracy.
that used continuous targets also allowed for continuous In order to achieve a smoothly moving point, we dis-
head movement, which is not a good representation for a played four points per angular degree, giving us a total of
near-eye display where the display is strapped to the user’s 7316 points. We captured webcam video frames at approxi-
head (with the camera rigidly fixed inside of it). mately 30 fps, which roughly corresponded to one frame per
3
1/4 angular degree point shift. Because our goal was around 5. CNN Implementation
1◦ of accuracy, we binned the calibration points into 1◦ bins.
For example, points displayed between 0.5◦ and 1.5◦ would 5.1. Caffe
be considered the same as 1◦ . We found that with a 4x re- The CNN implementation was performed in both Caffe
duction in classes, and a corresponding 4x increase in points and Tensorflow for comparison. Caffe has LeNet in the
per class, the CNN was able to better learn. model repository implemented for MNIST digit classifica-
The cropped and downsampled captured dataset can be tion. The .prototxt file was reconfigured to point to the
found at [1] lmdb file for our dataset. The image files and binary with
class labels are converted to the Caffe compatible lmdb file
4. CNN Learning Approach and then training is performed. The figure below shows the
setup workflow for the Caffe implementation.
In this work we explore the use of Convolutional Neural
Networks (CNN) for gaze tracking in an end-to-end fash-
ion. Traditional approaches like the one mentioned in the
previous section rely on hand-engineered feature detectors
and then use a parametric model to track the gaze direction
of a user.
Recently CNNs have outperformed traditional feature
engineering-based computer vision methods in a variety of
tasks. This work explores their use for gaze tracking. The
key benefit is that this approach is fully data driven. We
train the CNN model to take images of the users eye (taken
from a camera very close to their face) as input and estimate
the gaze direction in terms of x and y pixel coordinates on
the screen.
4.1. Description of CNN

CNNs are a class of artificial neural network algorithms
with multiple hidden layers that are built using convolu-
tional layers. In this work we treat the gaze tracking prob-
lem as a multi-class classification problem, where each
class is a specific point on the screen, and use a simple CNN
(i.e. LeNet) to learn the mapping from images to gaze posi- Figure 4. Caffe Model Architecture
tion.
LeNet consists of two convolutional layers and two pool- As the results were not very promising and training was
ing layers followed by a fully connected layer at the end. suspending abruptly, we later focused on the TensorFlow
For our implementation, we modify the fully connected implementation to complete the project.
(FC) layer to the desired number of output classes for the
gaze tracking problem. 5.2. TensorFlow
Tensorflow also consists of the LeNet model as an ex-
ample trained for MNIST. The example was reconfigured
by changing the number of classes to train on, size of con-
volutional kernels etc. to work with the CAVE and captured
gaze tracking dataset. The figure below shows a visualiza-
tion of the TensorFlow model.
Some of the parameters that were explored were the size
of the convolutional kernels and number of convolutional
layers. Unfortunately, due to limited computing power, it
remained difficult to add too many layers to the network.
6. Data Organization
Figure 3. LeNet Architecture Due to the large size of the data captured and the limited
computational power available, a few steps were taken to
4
Figure 6. Augmented Images
4. Save images and respective labels
The size 28x28 matched the dimensions of the MNIST

dataset. More importantly, we found that downsampling to
a slightly large size, like 50x50, did not necessarily lead to a
better classification accuracy. Similar results were found in
[] where a larger image did not mean better neural network
performance.
Cropping out the eye region was done differently for the
CAVE and captured dataset. For the CAVE dataset, a very
Figure 5. TensorFlow Model Architecture basic estimation of the bounding box around both eyes was
formed. The same bounding box was applied to all images
for a particular pose. This was an error-prone methodology
make the data manageable while not compromising the data
but did better than using eye recognition software, which of-
available for the CNN.
ten failed on subjects with glasses. For the captured dataset,
6.1. Dataset Augmentation we manually selected the region around the eyes per sub-
ject and then applied the same crop region to the rest of the
The dataset consisting of cropped images of the eye and images for that particular subject. This manual processing
their respective pixel coordinates for the gaze was aug- was also not perfect, since subjects moved slightly between
mented before training the CNN. The augmentation con- frames.
sisted of adding different types of noise (Poisson, speckle, The labels were created using one-hot vectors - the
Gaussian etc.) to make the model more robust to variations length of the vector was equal to the number of classes per
in the image during inference. The augmentation extends experiment and the index of the class the image belonged
the dataset which is very useful as many samples are needed to was a 1 while the rest of the entries were 0. The images
to fully train the CNN model. The augmented dataset is then and the label vector pairs were converted to Tensors/lmdb
spit into training, test and cross-validation sets depending format for ease of processing.
on the requirements.
6.2. Image Preprocessing 7. Results
The pipeline to feed both the CAVE and the captured The CNN was initially tested with the CAVE dataset to
dataset were as follows: ensure the model learned something. Next, the same model
was applied to the captured dataset, where the images were
1. Convert Images to Grayscale significantly better (they were focused just on the eyes) but
2. Downsample Images to 28x28 (for efficiency) the number of classes was significantly greater (21 classes
for the CAVE dataset and 1829 classes for the captured
3. Crop out eye region from each image dataset).
5
Figure 7. CAVE Dataset Downsampled Images
Figure 9. CAVE Classification Accuracy with 21 Classes
Figure 8. Captured Dataset Downsampled Images
7.1. CAVE Dataset

Our initial gaze tracking implementation was based on
the CAVE (Columbia Gaze Data Set) [15] dataset. The
dataset has 5,880 images collected for 56 subjects for 21
gaze directions for 5 distinct head poses. To simulate the
near-eye display conditions we only use one head pose
i.e. the subject directly facing the screen. Even though
the dataset was not built for fine-grained tracking and the
captured images are highly inconsistent, the results were Figure 10. CAVE Classification Accuracy with 3 Classes
promising.
The problem was framed as a gaze localization problem
where the screen is divided into 21 (7X3) grids which form learn as well as it would be expected to. Out of the 28x28
the 21 output classes for the CNN model. We show results pixels, a very small fraction actually comprises of the eyes
obtained on the CAVE dataset with 3 output classes (one of and the pupils. In the results, we look at training accuracy
the three horizontal bands on the screen) and with 21 out- which can be interpreted as the inverse of the loss functions
put classes. The graphs show how the classification accu- and shows the progress of the model as it learns a better
racy (measured as the fraction of gazes correctly classified) model. The trained model is then tested on a batch from the
changed across iterations of training the CNN. Each train- test set to get an accuracy which represents the percentage
ing batch was 100 images and the CNN was trained for 500 of correctly classified images. For the 3 and 21 class case,
iterations. The figures 9 and 10 show the training accuracy the final test accuracy was 0.893 and 0.236.
increasing across iterations. The red dotted line shows the
7.2. Captured Dataset
baseline classification accuracy - number 1of classes .
While the accuracy for the 21 class case isn’t significant, Since the number of classes with the real dataset was
the model does learn a mapping and performs better than significantly large, the classification accuracy is not a good
a random guess over the output classes. From the CAVE measure of how well the model does. If the true and esti-
dataset images, it is easy to see why the model does not mated classes are next to each other, that is a misclassifi-
6
Figure 11. Correctly and Incorrectly Classified Images from
CAVE Dataset
Figure 13. Captured Average Angular Error
cation and measured the same as if the true and estimated

classes were on opposite corners of the viewing screen. 8. Conclusion and Future Work
Since our goal is to track the user’s gaze as accurately as
We explored an end-to-end deep learning approach for
possible, we also look at angular error i.e. actual difference
gaze tracking in a highly constrained environment (camera
in angle between the inferred gaze and the actual gaze of the
very close to the eye) and obtained promising results with a
user. Looking at this particular metric, we can see that the
vanilla implementation. This demonstrates the viability of
gaze can be localized to a few degrees in accuracy but there
this approach and also highlights some of its shortcomings.
is scope for improvement. Looking at confusion matrix or
CNN needs a lot of data to be properly trained and lack
classification accuracy is not appropriate for this problem
of publicly available datasets for fine gaze tracking was a
as it does not capture the degree of misclassification or how
major impediment. We implemented our own data capture
far the inferred gaze is from the actual gaze direction.
setup and created an appropriate dataset for this purpose but
The testing classification accuracy was 0.003. Even
given the severe time constraints were only able to gather
though this is ridiculously low, the testing angular error is
data for 5 subjects. This dataset will be extended in the
only 6.7 degrees. And in the gaze tracking sense, the angu-
future for more subjects and made publicly available for fu-
lar degree matters much more than the classification accu-
ture experiments.
racy.
The model used for this work was the simplest instanti-
ation of a CNN and the eye images had to be significantly
downsampled even with the simple model to facilitate quick
training. This is one of the limitations of using CNNs and
prevents quick iterations or cross-validation.
We believe a higher quality input image of the users eye
that also has support features (fixed in the image irrespec-
tive of the subject) will help improve the gaze tracking ac-
curacy. This is also the situation that will be useful when
doing tracking in a near-eye display. Given that the HMD is
fixed to the user’s head, the camera should see a consistent
images across different users. The model can also be more
complex but due to lack of training data, it did not make
sense to increase model complexity in this work.
To conclude, the work shows promise in using a deep
learning approach for gaze tracking and has potential to
outperform the feature based methods based on parametric
models. However, further exploration is needed to achieve
Figure 12. Captured Classification Accuracy with 1829 Classes state of the art results.
7
References [16] K.-H. Tan, D. J. Kriegman, and N. Ahuja. Appearance-based
eye gaze estimation. In Proceedings of the Sixth IEEE Work-
[1] Captured Dataset. https://www.dropbox.com/s/ shop on Applications of Computer Vision, WACV ’02, pages
mxbis1osiedclkd/data.zip?dl=0. 191–, Washington, DC, USA, 2002. IEEE Computer Society.
[2] S. Baluja and D. Pomerleau. Non-intrusive gaze tracking [17] S. Xu, X. Mei, W. Dong, X. Sun, X. Shen, and X. Zhang.
using artificial neural networks. Technical report, Pittsburgh, Depth of field rendering via adaptive recursive filtering. In
PA, USA, 1994. SIGGRAPH Asia 2014 Technical Briefs, SA ’14, pages 16:1–
[3] B. A. Barsky and T. J. Kosloff. Algorithms for rendering 16:4, New York, NY, USA, 2014. ACM.
depth of field effects in computer graphics. pages 999–1010, [18] T. Zhou, J. X. Chen, and M. Pullen. Accurate depth of
2008. field simulation in real time. Computer Graphics Forum,
[4] S. Hillaire, A. Lecuyer, R. Cozot, and G. Casiez. Using an 26(1):15–23, 2007.
eye-tracking system to improve camera motions and depth- [19] J. Zhu and J. Yang. Subpixel eye gaze tracking. pages 124–
of-field blur effects in virtual environments. In Proc. IEEE 129, May 2002.
VR, pages 47–50, 2008. [20] Z. Zhu and Q. Ji. Novel eye gaze tracking techniques under
[5] S. Hillaire, A. Lcuyer, R. Cozot, and G. Casiez. Depth- natural head movement. IEEE Transactions on Biomedical
of-field blur effects for first-person navigation in virtual en- Engineering, 54(12):2246–2260, Dec 2007.
vironments. IEEE Computer Graphics and Applications,
28(6):47–55, Nov 2008.
[6] T. E. Hutchinson, K. P. White, W. N. Martin, K. C. Reichert,
and L. A. Frey. Human-computer interaction using eye-gaze
input. IEEE Transactions on Systems, Man, and Cybernetics,
19(6):1527–1534, Nov 1989.
[7] R. Konrad, E. Cooper, and G. Wetzstein. Novel optical con-
figurations for virtual reality: Evaluating user preference and
performance with focus-tunable and monovision near-eye
displays. Proceedings of the ACM Conference on Human
Factors in Computing Systems (CHI16), 2016.
[8] G. Maiello, M. Chessa, F. Solari, and P. J. Bex. Simulated
disparity and peripheral blur interact during binocular fusion.
Journal of Vision, 14(8), 2014.
[9] M. Mauderer, S. Conte, M. A. Nacenta, and D. Vishwanath.
Depth perception with gaze-contingent depth of field. ACM
SIGCHI, 2014.
[10] C. H. Morimoto and M. R. M. Mimica. Eye gaze tracking
techniques for interactive applications. Comput. Vis. Image
Underst., 98(1):4–24, Apr. 2005.
[11] K. Pfeuffer, M. Vidal, J. Turner, A. Bulling, and
H. Gellersen. Pursuit calibration: Making gaze calibration
less tedious and more flexible. In Proceedings of the 26th
Annual ACM Symposium on User Interface Software and
Technology, UIST ’13, pages 261–270, New York, NY, USA,
2013. ACM.
[12] L. qun Xu, D. Machin, P. Sheppard, M. Heath, and I. I. Re. A
novel approach to real-time non-intrusive gaze finding, 1998.
[13] T. Shibata, T. Kawai, K. Ohta, M. Otsuki, N. Miyake,
Y. Yoshihara, and T. Iwasaki. Stereoscopic 3-D display with
optical correction for the reduction of the discrepancy be-
tween accommodation and convergence. SID, 13(8):665–
671, 2005.
[14] B. Smith, Q. Yin, S. Feiner, and S. Nayar. Gaze Locking:
Passive Eye Contact Detection for HumanObject Interaction.
In ACM Symposium on User Interface Software and Technol-
ogy (UIST), pages 271–280, Oct 2013.
[15] B. Smith, Q. Yin, S. Feiner, and S. Nayar. Gaze Locking:
Passive Eye Contact Detection for HumanObject Interaction.
In ACM Symposium on User Interface Software and Technol-
ogy (UIST), pages 271–280, Oct 2013.
8
Fine-grained Flower Classification
Leo Lou
qibinlou@stanford.edu
Abstract 2. Related Work

There are several noticeable work done on fine-grained
In this paper, we tackle the task of fine-grained flower flower classification.
classification. Based on the earlier work of Maria et al.[6], In [6], Maria l investigated on using multiple kernel
we incorporate both traditional features like HOG, SIFT, learning for flower images acquired under fairly uncon-
HSV with CNN feature to better describe inter-class differ- trolled image situations the images are mainly downloaded
ences between flowers species. We extensively experiment from the web and vary considerably in scale, resolution,
on different ways to utilize these features to generate a high lighting, clutter, quality, etc. They combined different im-
performance model, such as feature normalizing, feature age features(HSV+SIFT+HOG) to describe a flower image
scaling, feature fusing and multiple kernel learning. and to their model. Their work is known as 102 Oxford
Here we use the well-known 102 Oxford flowers bench- flowers benchmark. Anelis et al. [3] designed a better seg-
mark dataset to do our experiment. Our approach has mentation algorithm to identify potential flower body re-
beaten the ground truth work by Maria [6] and the another gions and applied feature extraction on that. They also
CNN based work by [7], though the latter one contains an- developed a much larger dataset than Oxford 102 flower
other approach which yield better result than ours(but takes dataset with 578 flower species and 250,000 images, which
much longer time to do experiments). contributed to 4% classification performance improvement
compared to Oxford 102 flower benchmark dataset. There
are other ways to describe flower images. For example, [8]
used generic features extracted from a convolutional neu-
1. Introduction ral network previously used to perform general object clas-
sification. They experimented the CNN features on plant
Nowadays image based classification systems are classification task together with an extremely randomized
achieving better and better performance as a result of large forest.
datasets and neural networks. In this paper, instead of focus-
ing on the task of classifying as many as different objects, 3. Segmentation
we investigate the problem of recognizing a large number
of classes within one category, in our case, flowers. Such Most flowers are recognizable only by their body. Per-
task is called fine-grained classification. forming a segmentation to separate the flower body from the
There has been progress in expanding the set of fine- background can help reduce the noise in feature extraction
grained domains we have data for, which now includes e.g. process.
birds, aircraft, cars, flowers, leaves, and dogs. In this paper, Several papers have proposed methods explicitly for the
we focus on the flowers fine-grained classification task. For automatic segmentation of a flower image into flower as
human beings, we can use different features of flowers to foreground, and the rest as background. Here We use
distinguish different species;for example, we can use color, the segmentation scheme proposed by Nilsback and Zis-
shape, size, and smell information to help us make a better serman [5]. Figure 1 shows several example segmentaions
decision. But for computers, the only information they can from this method. The problem with this schema is that
get is from the input image, which requires us to well design there are some over-segmented images, where there is no
visual features to describe the flowers. foreground at all. See Figure 2.
In this paper, We first discuss several key components of 4. Classification

our classification pipeline, including flowers segmentation
in section 3, flowers representing features in section 4, and There are several key features that we human use to dif-
experiment details in section 5. ferentiate various kinds of flowers. First of all, color is use-
ful when discriminating a sunflower from a rose. But to
differentiate a buttercup from a dandelion, shape would be
much more useful, but color would not. Smell is also useful
in some cases, though totally useless in our flowers images
classification case. In this section, we discuss several fea-
tures to represent a flower image, and then discuss pros and
cons of multi-kernels SVM classifier and neural network
classifier.
4.1. Features
Color: Colour is described by taken the HSV values of
the pixels. The HSV space is chosen because it is less
sensitive to variations in illumination and should be able
to cope better with pictures of flowers taken in different
weather conditions and at different time of the day. The
HSV values for each pixel in an image are clustered using
k-means. Given a set of cluster centres (visual words)
wic , i = 1, 2, ..., Vc , each pixel in the image I is then
assigned to the nearest cluster centre, and the frequency
of assignments recorded in a a Vc dimensional normalized
frequency histogram n(wc |I).
SIFT: SIFT descriptors are computed at points on a

regular grid with spacing M pixels over the foreground
flower region. At each grid point the descriptors are
computed over circular support patches with radii R
pixels. Only the grey value is used (not color), and the
resulting SIFT descriptor is a 128 vector. To cope with
empty patches, all SIFT descriptors with L2 norm below
Figure 1. Example segmentations from [5] a threshold (200) are zeroed. Note, we use rotationally
invariant features. The SIFT features describe both the
texture and the local shape of the flower (e.g. fine petal
structures (such as a sunflower) vs spikes (such as a globe
thistle). We obtain n(wf |I) through vector quantization in
the same way as for the color features.
Histogram of Gradients: HOG features, are similar

to SIFT features, except that they use an overlapping local
contrast normalization between cells in a grid. However,
instead of being applied to local regions (of radius R in
the case of SIFT), the HOG is applied here over the entire
flower region (and it is not made rotation invariant). In this
manner it captures the more global spatial distribution of
the flower, such as the overall arrangement of petals. The
segmentation is used to guide the computation of the HOG
features. We find the smallest bounding box enclosing the
foreground segmentation and compute the HOG feature
for the region inside the bounding box. We then obtain
n(wh|I)
Figure 2. Bad segmentation examples
CNN Feature: Convolutional neural networks (CNNs)
were proposed in 1989 by LeCun et al. [4]. Recently, it has
shown its power with the availability of large datasets, per-
formance improvement of GPUs and efficient algorithms. [9] to extract SIFT features. It’s a little bit troublesome to
It has been shown that combining CNN features with a run VLFeat library on Octave because it’s still in experi-
simple classifier such as SVM can outperform classical ments. The Overfeat network feature extractor can be in-
approaches for various classification and detection tasks, stalled from Github source [2] or the pre-build version.
which also require hand-crafted features.
5.3. Procedure
4.2. Classifier In our experiments, we combine the training set and
Currently we adopt multiple kernels SVM as our clas- the validation set in the dataset since we skip fine tuning
sifier to train our final model. It is reasonable to have a the parameters for different features experiments. Instead,
weighted linear combination of SVM kernels, each corre- we directly reuse the reported optimal parameters from [6].
sponding to each feature. The weights vary for different Therefore, the optimum numbers of words is 1000 for HSV,
flower species, some of which might have a high weight on 3000 for SIFT, 1500 for HOG. These optimum numbers
Color feature while some might have a high weight on SIFT might not really be the optimal ones for our experiments
feature. For individual feature, we simply use one-vs-other and other reproducing experiments because of libraries and
SVM classifier to get their corresponding model. implementation differences. We train our classifiers on
the combined training data(both the training and validation
5. Experiment sets) and test our classifiers on the testing sets.
For features like HSV, HOG and SIFT, we use KMeans
5.1. Dataset to cluster the features to get our visual words. We use Mini-
BatchKMeans in scikit-learn toolbox to overcome the bot-
Oxford 102 flowers dataset is a well established dataset
tleneck that some feature sets are too large to fit into the
for subcategory recognition proposed by Nilsback et al. [6].
entire memory. MiniBatchKMeans can cluster the centers
The dataset contains 102 species of flowers and a total of
batch by batch on our split feature sets.
8189 images, each category containing between 40 and 200
images. It has established protocols for training and testing,
5.4. Result
which we have adopted in our work too.
We report our experiments result in Table 1. It can be
seen that combining all the features contributes to a far bet-
ter performance than using single feature. CNN feature out-
performs all the other features. Though we observe that
CNN features extracted from segmented flowers images are
less descriptive than from images with background. There
are probably two reasons for this; one is that our segmenta-
tion algorithm is still not perfect that it filters some impor-
tant parts of the flowers images when segmenting; another
is that background information actually helps to better clas-
sify a flower image.
Features mAP
HSV 42.3%
HOG 49.1%
SIFT 53.0%
Figure 3. The distribution of numbers of images over 102 classes
CNN w/o segmentation 73.9%
CNN w/ segmentation 54.1%
HSV+HOG+SIFT+CNN 84.0%
5.2. Setup
Our experiment is based on Python environment and var- Table 1. Classification performance on the test set. mAP refers
ious Python packages, numpy, scipy, scikit-learn, etc. The to classification performance averaged over all classes(not over all
complete package list can be obtained from the import sec- images)
tions of our scripts. It’s worth mentioning that we use
OpenCV 2.4.8 [1] for Python package to process our images
and extract different features. Due to some unknown issues, Table 2 shows the comparison result between our work
our OpenCV version doesn’t support to extract SIFT fea- and others. Note that in [7], the best performance is
tures. Instead, we use the open sourced Octave and VLFeat achieved by further augment the training set by adding
Method mAP [8] N. Sünderhauf, C. McCool, B. Upcroft, and T. Perez. Fine-
Nilsback et al. [6] 76.3% grained plant classification using convolutional neural net-
Anelia et al. [3] 80.66% works for feature extraction. In CLEF (Working Notes), pages
Ours 84.0% 756–762, 2014.
Ali et al. [7] 86.8% [9] A. Vedaldi and B. Fulkerson. VLFeat: An Open and Portable
Library of Computer Vision Algorithms, 2008.
Table 2. Comparison between our work and others
cropped and rotated samples and doing component wise

power transform, which makes the training model invariant
to scale and rotation. We don’t implement this step because
of priorities and time, but we believe it will also benefit our
model and increase our final performance.
6. Summery and future work

So far we have researched and experimented on avail-
able image features and segmentation algorithms for fine-
grained flower classification task. We observe that multi-
ple features empower the classifier to train a better model
and achieve a better classification accurate on test sets. For
the fine-grained flowers classification task, the learning of
different weights for different classes enables us to use an
optimum feature combination for each classification.
We also realize there are flaws in current pipeline, e.g.
computation time cost is not good enough to be in the re-
altime scale. It poses a barrier to build an usable flowers
classification service for users. Future work should include
using current method to train on a larger dataset and fine
tuning the neural network which we use to extract CNN fea-
tures on flowers dataset.
References
[1] OpenCV. https://github.com/itseez/opencv.
[2] Overfeat. https://github.com/sermanet/OverFeat.
[3] A. Angelova, S. Zhu, and Y. Lin. Image segmentation for
large-scale subcategory flower recognition. In Applications
of Computer Vision (WACV), 2013 IEEE Workshop on, pages
39–45. IEEE, 2013.
[4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation ap-
plied to handwritten zip code recognition. Neural computa-
tion, 1(4):541–551, 1989.
[5] M.-E. Nilsback and A. Zisserman. Delving into the whorl of
flower segmentation. In BMVC, pages 1–10, 2007.
[6] M.-E. Nilsback and A. Zisserman. Automated flower classi-
fication over a large number of classes. In Computer Vision,
Graphics & Image Processing, 2008. ICVGIP’08. Sixth In-
dian Conference on, pages 722–729. IEEE, 2008.
[7] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn
features off-the-shelf: an astounding baseline for recognition.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, pages 806–813, 2014.
Gradient-learned Models for Stereo Matching
CS231A Project Final Report
Leonid Keselman
Stanford University
leonidk@cs.stanford.edu
Abstract
In this project, we are exploring the application of ma-

chine learning to solving the classical stereoscopic corre-
spondence problem. We present a re-implementation of sev-
eral state-of-the-art stereo correspondence methods. Addi-
tionally, we present new methods, replacing one of the state-
of-the-art methods for stereo with a proposed technique Figure 1. An example of a stereo left-right pair from the Middle-
based on machine learning methods. These new methods bury 2014 dataset [18]. The motorcycle scene will be consistently
out-perform existing heuristic baselines significantly. used through this paper as a visual example result .
1. Introduction ported there.

The interest in this problem has important practical ap-
Stereoscopic correspondence is a classical problem in
plication in autonomous vehicles and commercial applica-
computer vision, stretching back decades. In its simplest
tions. For example, the KITTI dataset was formed to test
form, one is given a calibrated, rectified image pair where
if low-cost stereoscopic depth cameras to replace high-cost
differences between the pair exist only along the image
LIDAR depth sensors for autonomous vehicle research. In
width. The task is to return a dense set of corresponding
a different field, commercial depth sensors such as the orig-
matches. An example image pair is shown in figure 1.
inal Microsoft Kinect and Intel RealSense R200 use stereo-
While this task may seem straightforward, the primary
scopic correspondence to resolve depth for tracking peo-
challenge comes from challenging, photo-inconsistent parts
ple and indoor reconstruction problems. As such, improved
of the image. Parts of the image will often contain ambigu-
methods for stereoscopic correspondence have wide appli-
ous regions, a lack-of-texture, and a photo-metric mismatch
cation and use.
for a variety of reasons (specular reflections, angle of view,
etc.).
This field is well studied, and there exist many stan- 2. Related Work
dard datasets, such as Middlebury [18] and KITTI [6].
Both of these datasets contain many rectified left-right pairs, 2.1. Previous Work
along with corresponding ground truth matches. The former
dataset contains high resolution images from largely indoor In the field of stereo matching, one of the recent innova-
scenes, and comes from using a method of dense structured tions in the past few years was the use of convolutional neu-
light correspondence method. In contrast, the KITTI dataset ral networks in improving the quality of matching results
contains much lower resolution images, and consists of out- [21]. At the time of it’s announcement at CVPR 2015, it
door data gathered from a vehicle perspective. Additionally, was the top performer in both standard datasets. Even at the
the KITTI dataset’s annotations come from a LIDAR tech- present day, all better performing methods on the Middle-
nique, and are generally sparse. While KITTI seems more bury leaderboard use the Matching Cost CNN (MC-CNN)
popular based on the size of their leader-board, the dense costs as a core building block. Their primary contribution is
annotations available in Middlebury [18] are useful for the to train a convolutional neural network (CNN) to replace the
problem tackled in this paper, and our results are only re- block-matching step of a stereo algorithm. That is, instead
1
Region Neighborhood Outlier
Matching Aggregation Propagation Removal
State of-the-art MC-CNN Cross- Semi-global Learned

Aggregation matching Confidence
Figure 2. The architecture and algorithm flow for state-of-the-art

methods in stereo. They first use the MC-CNN cost function [21],
then combine those results with cross-based aggregation [22], and
share them with neighbors using semi-global matching [7]. Meth-
ods in dashed lines are heuristic methods, while the ones with solid Figure 3. The matching architecture of [21], the current state-of-
lines use a machine-learned method. the-art in stereo matching.
of using a sum of absolute differences cost such as

2.2. Key Contributions
XX
Cost(source, target) = |sourceij − targetij | 1. A fast, flexible implementation of stereo matching
i j We present a new, from-scratch implementation of
state-of-the-art methods in stereo matching, including
or a robust non-parametric cost function such as Census
Census [20], semiglobal matching [7], cost-volume fil-
[20]
tering [10]. Along with standard methods for hole fill-
R(P ) = ⊗ζ(P, Pij )
ing, like those used in MC-CNN [21], and many outlier
Cost(source, target) = popcnt(sourceij ⊕ targetij ) removal methods [16]. The implementation is cross-
platform, C++, multi-threaded, and uses no libraries
the authors of [21] learn a network to compute a
except those for loading and saving images. It is fast,
Cost(source, target) metric based on the ground truth
and produces competitive RMS error results on stan-
available from stereoscopic correspondence datasets. An
dard datasets. See section 3.1.
example of their network architecture is shown in figure 3.
However, in order to obtain their final result, they use 2. A machine-learned method for correlation selec-
a combination of algorithms to select an optimum match. tion We’ve implemented and tested several semiglobal
Namely, they use a combination of their cost metric [21], matching replacement architectures, trained them on
cross-based aggregation [22], and semi-global matching the Middlebury training data, and demonstrated that
[7]. This flow is shown in figure 2. We hypothesize that they perform significantly better on out-of-bag exam-
a short-coming of this state-of-the-art method is that two ples than semiglobal matching. See section 4.
of the techniques used in the algorithm flow make use of a
heuristic method for completing a certain task. We hope to 3. Baseline Implementation
build on the success of the MC-CNN method and use gradi-
ent based learning [12] to replace other components of the 3.1. C++ Baseline
stereo algorithm. The value in picking this specific classifi- First, we implemented stereo matching baselines using
cation algorithm is that it has the ability for us to eventually current, non-machine learned methods. The stereo algo-
design an end-to-end gradient-learned system that trains an rithms described below were implemented from scratch, in
MC-CNN along with our proposed system. The goal for C++, with no external libraries outside of image loading.
the project is first implement these baseline algorithm meth- The performance of our baseline is later summarized in ta-
ods, and then begin to to test and design algorithms and ble 1.
methods to replace one of the two heuristic algorithms in the An example left-right pair is show in 1. We’re using
traditional stereoscopic pipeline, namely semiglobal match- quarter-sized training images from the latest Middlebury
ing [7] or cross-based aggregation [22]. In this report, we dataset [18]. These are roughly 750x500 pixels in resolu-
only present methods for replacing semiglobal matching, tion. The results from our algorithm are compared to the
but not yet cross-based aggregation. ground truth visually in figure 4 and quantitatively in table
1. An elaboration of the different papers and methods im-
plemented for each section is described below. The code
is all C++11, and compiles on Visual Studio 2013 and gcc,
with no external libraries. Threading is implemented via
OpenMP [15] to parallelize cost computation across all pix-
2
Figure 4. The image on the left is ground truth for the motorcycle scene in 1. The image on the right is the results of our semi-global
matching pipeline with naive hole filling as described in section 3.1. For visual comparison, occluded and missing ground truth pixels from
both images are masked out.
els in a given scanline. This cost accumulation is the pri- 3.1.3 Propagation
mary computational bottleneck in the system so paralleliz-
In order to perform propagation across the image, we’ve im-
ing that component is enough to provide sufficient scaling
plemented semi-global matching [7], in full, as described in
across processor cores.
the original paper. We chose to perform 5-path propagation
for each pixel, as it represents a row causal filter on the im-
3.1.1 Cost Computation age, using a pixel’s left, right, top, top-left, and top-right
neighbors. This produces an answer that satisfies the cost
As a baseline method of cost computation, we’ve imple- function of Hirchmuller [7]
mented both standard sum of absolute differences, and the X
robust Census metric [20]. Census was recently tested and E(D) = (C(p, D(P )) (1)
p
shown to be the best performing stereo cost metric [8]. The
weighted sum of absolute differences and Census was addi- X
+ P1 · 1{D(p) − D(q)} = 1
tionally state-of-the-art for Middlebury until a year or two
q
ago [13]. The state of the art in this space is MC-CNN X
method [21], which implemented a CNN algorithm to re- + P2 · 1{D(p) − D(q)} > 1
place traditional methods. However, since our project fo- q
cuses on implementing neural networks in other parts of the Additionally, we added naive hole filling by propagating
stereo pipeline, re-implementing this cost metric is not a pixels from left-to-right, in order to fill occluded regions.
high priority. This is a naive metric, but is a large part of the hole filling
Specifically, we implemented Census with 7x7 windows, used in the state of the art work [21].
which allows us to exploit a sparse census transform [5],
and fit the result for every pixel into 32-bits. This enables 4. Learning Propagation
efficient performance with the use of a single popcnt in-
struction on modern machines. Most papers in the KITTI dataset build on top of the
successful method of semi-global matching [7], which is
an algorithm for propagating successful matches from pix-
3.1.2 Region Selection els to their neighbors. The goal of this part of the project
was to replacing this function with either a standard neural
For our region selection baseline, we’ve implemented both network, or recurrent neural network. Depending on one’s
box correlation windows and weighting with a non-linear perspective on what operation semiglobal matching is per-
smoothing algorithm such as the bilateral filter [19]. This forming, there is a wide array of neural network architec-
was inspired recent unpublished ECCV 16 submissions tures that may be amenable to replace it. An overview of
on the Middlebury leader-board, which claim to replace the formulations is shown in figure 5.
the popular cross-based [22] with a smooth affinity mask The first and most straightforward view of what the en-
method like a bilaterial filter, as first shown in [10]. ergy function, as stated in equation 1, is that it regularizes a
3
4.1. 1D Smoothing
One-dimensional Smoothing Two-dimensional Smoothing
One straightforward view of semi-global matching is
Probability of correct disparity Column-wise probability simply as regularization function on top of a pixel’s cor-
(70d) disparity image (750x70)d relation curve. A correlation curve is the set of matching
costs for a single pixel and it’s candidates. If this input is
Classifier Classifier negated, and fed a softmax activation function, as used to
train many neural networks, it treats the values as unnor-
malized log probabilities, and selects the maximum (which
Raw costs per disparity (70d) Disparity Image (750x70)d
would be the candidate with lowest matching cost).
Raw values (750x1)d
efyi
Li = − log( P fi )
je
Figure 5. An example of two different ways to formulate semi-
global matching as a classification task. The one on the left is
Our original implementations for this method were all
explored in section 4.1, while the one on the right is explored in
straightforward multi-layer perceptions (MLP), using a one,
4.2.
two, or three layer neural network to produce a smarter
minimum selection algorithm. However, no matter the loss
function, shape, dimensions, regularization, or initialization
function, we were unable to get any MLP to converge. That
single pixel’s correlation curve into a more intelligent one.
is, using a 0-layer neural network (the input itself) was bet-
This view is fairly simple, doesn’t incorporate any neigh-
ter than any learned transformation to that shape and size.
borhood information, but in our testing was the most suc-
cessful model. This is elaborated in section 4.1. Instead, we found success by using a one dimensional
convolutional neural network as shown in figure 6. We sus-
A second view of what semiglobal matching does in pect a CNN was able to handle this task better, as one bank
practice is that is regularizes an entire scanline at a time, of convolutions could learn an identity transform, while oth-
performing scanline optimization and producing a robust ers could learn feature detectors that incorporated interest-
match for an entire set of correlation curves at once. This ing feedback into that identity transform. In comparison, a
was the view we took when building models in section 4.2. randomly initialized fully connected network may struggle
to learn a largely identity transform with minor modifica-
A third view of what semiglobal matching does is it tions. We implemented the neural network on top of Keras
serves as a way of remembering good matches, and propa- [4] and TensorFlow [1]. We additionally learned several
gating their information to their neighbors. This is straight- non-gradient based classifier baselines such as SVMs and
forward and almost certainly what semiglobal matching random forests using scikit-learn [17].
does. This would require a pixel recurrent neural network
such as that in [14]. In our limited time and testing, we 4.2. 2D Smoothing
were unable to get any of these architectures to converge
and hence have excluded them from this paper. However, As shown in figure 5, there is an alternative concept of
our primary focus was on building a bidirection RNN with how semiglobal propagation. This one incorporates pixel
GRU [3] activations. In practice, small pixel patches didn’t neighborhoods, and seems a more natural fit for the energy
converge while large patches were not able to fit into the function presented in equation 1. For this formulation of
memory of the machines we had available for training. a neural network, the correlation curves of an entire scan-
line are reshaped into an image in disparity cost space no-
For testing and training, we gather a subset of the Mid- tation is described in figure 7. We then create a model us-
dlebury images [18], and split into into a random training ing a two-dimensional convolutional neural network [12] on
and testing set with a 80%-20% split. The unseen samples top of these disparity cost space images The top level is a
are then used for evaluation. Middlebury provides 15 im- column-wise softmax classifier of the same size as the input
ages for training and 15 for evaluation. For the classifiers in dimensions. In order to implement this in TensorFlow [1],
section 4.1, this results in roughly 500,000 annotations per we first pass in a single disparity image as a single batch.
image (using quarter sized images), and 500,000 tests of the We run our convolutional architecture over this model, and
network. While for the classifiers in section 4.2, this results then reshape the output into pixel-many ”batches” for each
in 500,000 annotations computed over about 1,000 runs of of which we have a label. This allows the built-in softmax
the network (since it computes 500 outputs at the same and cross-entropy loss formulations to work out-of-the-box
time). See below for details of how this is implemented. with no hand-made loops.
4
Model RMS Error Runtime
Census 28.92 1.2s
SGBM 28.12 3.1s
softmax
SGBM + BF 32.8 5.8s
70d linear projection OpenCV SGBM 38.00 0.9s
MC-CNN (acct) 27.5 150s
7x1 Average Pool Table 1. A summary table of numerical results on the training Mo-
torcycle image. The error metric is root-mean-squared error in
disparity space, and the run-times are on a quad-core i7 desktop.
The first three lines are baseline implementations implemented by
64, 5x1 convolutions, stride 1 us, while the last two are standard algorithms available on the
dataset website [18]. The MC-CNN results were run on a GPU
[21] (which were on an GPU).
32, 9x1 convolutions, stride 5
quality of all pixels predicted by the classifier. A sum-

16, 9x1 convolutions, stride 2 mary table is shown in table 1. We show that our baseline
implementation is on the same order of magnitude as the
SSE-optimized, hand-tuned implementation of semiglobal
8, 9x1 convolutions, stride 1 matching available from OpenCV [2]. We believe that both
the performance and accurate matching is a function of us
Input: 70d correlation curve using the robust and fast ADCensus [13] [20] weighted
cost function. Since the primary focus of this project as
to simply provide a flexible baseline for quickly generating
data for the machine learned methods in section 4, we did
not spend much time micro-optimizing or tuning algorithm
Figure 6. An architectural view of our most successful machine
learned method, a 1D CNN for predicting better minimums in cor-
hyper-parameters. However, if one wished to tune this al-
relation curves. gorithm there are dozens of knobs, including the relative
weighting of absolute differences and Census, the regular-
ization strengths of P1 and P2 from semiglobal matching,
𝐼𝐶𝑜𝑠𝑡 (𝑤, 𝑑) and the weights used in the bilateral filter.
Disparites
Type equation here.
5.2. Learned Propagation methods

Image Width
In the scope of testing the various propagation classifiers,
we adopt two different evaluation metrics. The first is the
Figure 7. A brief visual diagram of a Cost Image for single scan-
traditional training/test split used in machine learning meth-
line of stereo matching. Each pixel contains the cost of matching ods. The other is the RMS error metric used for stereo algo-
for that value, at that image pixel. Across the entire image, there rithm evaluation. A result comparing standard methods and
exists a cost volume across all scan-lines in a stereo pair, our pro- our proposed classifiers on test data is shown in figure 8.
posed architectures only deal with a known, discrete number of We see that the one-dimensional CNN as presented in
cost images. section 4.1 and shown in figure 6 outperforms the cur-
rent standard methods for smoothing matches. That is,
when fed with the ADCensus correlation curves generated
5. Experiments by our matching algorithm, the neural network generates
5.1. Baseline Method predictions that are much more accurate than the heuristic
semiglobal matching method used in state-of-the-art papers
We tested our baseline C++ implementation of modern such as MC-CNN [21]. This result is even true when we
stereo matching for both time and quality of output. Specif- take the network’s predictions back to the matching algo-
ically, we focused on simply a single Middlebury training rithm and use it to generate a full correspondence image.
image (the Motorcycle) to validate that our results were Even though the neural network (currently) lacks the ability
within expectation for a stereo matching baseline. Our two to make subpixel accurate guesses, it generates lower RMS
key metrics were runtime and root-mean-squared error for error than standard methods like Census and semiglobal
all the dense, all-pixel label ground truth. This is just one matching, which do have subpixel matching built into the
of the metrics for Middlebury, but is one that measures the baseline.
5
Test Accuracy RMS Error
91.2
84.0
74.4
70.4
(a) Census and Semiglobal Results
11.6 9.4 9.3 9.1
ADCensus ADCensus+SGM Random Forest Conv1D NN
Figure 8. Numerical results on the out-of-bag testing data across

the a subset of the Middlebury [18] images.
(b) Random Forest and 1D CNN Results
Figure 9. A qualitative example using the presented classifiers. It
Model Out-of-bag Accuracy can seen that the original cost method (Census) is able to resolve
Census 67.7% certain parts of the scene. On the other hand, semiglobal propaga-
SGBM 69.2% tion is able to in-paint the image and generate a smooth disparity
1D CNN Training 76.4% image. On the other hand, the errors made by the two classifier
1D CNN Test 75.8% models, although having better accuracy and RMS error than the
2D CNN Training 58.5% heuristic methods, sometimes generate what looks like completely
2D CNN Test 55.4% erroneous results.
Table 2. An summary table of numerical results when testing on a
large batch of Middlebury testing images
6. Conclusion
Additionally, as can be seen in table 2, the 1D CNN
model is not yet exhibiting overfitting on out-of-bag sam- We have presented a new method for taking stereoscopic
ples, and might benefit from additional training time. It correlation costs and smoothing them into a more refined
can also be seen that our best 2D CNN architecture dras- estimate. This method is gradient-trainable, and outper-
tically underperforms even the standard baselines. While forms the semiglobal matching [7] heuristic technique used
there may be some more optimal 2D CNN architecture than in state-of-the-art methods such as MC-CNN [21]. This
the one we tried, our poor initial results made us moved to- leads support to the hypothesis proposed in the introduc-
wards trying to build an RNN method instead. However, tion, which is that continuing to replace components of the
we did not have enough time to finish designing and train- stereo matching pipeline with machine-learned models is a
ing our RNN models for replacing semiglobal matching. way to improve their performance. Since the models pre-
sented here were done with ADCensus costs [13] and not
Another interesting experimental result is the qualitative MC-CNN costs [21], and we did not have enough time to
performance of the classifier models. As shown in figure train on the full Middlebury dataset [18], we don’t present a
9, the classification-based models sometimes generate com- new state-of-the-art for stereoscopic correspondence. How-
pletely erroneous results for parts of the image. While Cen- ever, we believe that these results suggest that one may be
sus will fail to generate a result sometimes, and semiglobal possible by simply running the proposed techniques with
matching learns a smooth transformation. In contrast, while MC-CNN on the full dataset.
the classification models have lower error, they sometimes
predict very non-smooth results, as the classifier is run per In addition, we’ve created a new, simply, fast and cross-
pixel. This is suggestive that a classifier, such as an RNN, platform stereo correspondence implementation. We’ve
that accounts for neighborhood information may perform shown it to be about as fast as the one in OpenCV, and to
even better. Also, while we did not combined semiglobal produce results that are notably more accurate. We hope
matching with our 1D CNN, it is possible to use the nor- this can be used as a base for others to experiment with
malized probabilities from the neural network together with other stereoscopic correspondence ideas without having to
semiglobal matching to overcome this lack of smoothness dive into complicated OpenCV SSE code or deal with slow
and achieve perhaps an even better result. MATLAB implementations.
6
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
Flow: Large-scale machine learning on heterogeneous sys-
tems, 2015. Software available from tensorflow.org.
[2] G. Bradski. Opencv library. Dr. Dobb’s Journal of Software
Tools, 2000.
[3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
Figure 10. An example of a pixelwise RNN from [14], a gradient- F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
learned method for propagating information across images. representations using rnn encoder-decoder for statistical ma-
chine translation. arXiv preprint arXiv:1406.1078, 2014.
[4] F. Chollet. Keras. https://github.com/fchollet/
keras, 2015.
[5] W. S. Fife and J. K. Archibald. Improved census transforms
for resource-optimized stereo vision. Circuits and Systems
for Video Technology, IEEE Transactions on, 23(1):60–73,
2013.
[6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
tonomous driving? the kitti vision benchmark suite. In
Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
Figure 11. An example of a spatial transformer for region selection [7] H. Hirschmüller. Accurate and efficient stereo processing by
[11], a gradient-learned method for region selection. semi-global matching and mutual information. In Computer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, volume 2, pages 807–814.
7. Next Steps IEEE, 2005.
[8] H. Hirschmüller and D. Scharstein. Evaluation of stereo
To continue this theme of research, we wish to explore
matching costs on images with radiometric differences. Pat-
additional architectures for stereo correspondence algo-
tern Analysis and Machine Intelligence, IEEE Transactions
rithms that are trained with error gradients. While the one- on, 31(9):1582–1599, 2009.
dimensional CNN presented here works well, it isn’t able [9] S. Hochreiter and J. Schmidhuber. Long short-term memory.
to capture the neighborhood information that semiglobal Neural computation, 9(8):1735–1780, 1997.
matching can. To incorporate neighborhood information, [10] A. Hosni, M. Bleyer, C. Rhemann, M. Gelautz, and
we’d like to explore recurrent neural network models, which C. Rother. Real-time local stereo matching using guided im-
we began to design but were unable to get running in time age filtering. In Multimedia and Expo (ICME), 2011 IEEE
for this project submission. By coupling our 1D-CNN ar- International Conference on, pages 1–6. IEEE, 2011.
chitecture with either a spatial transformer networks front- [11] M. Jaderberg, K. Simonyan, A. Zisserman, and
end [11], or a recurrent neural network backend [3] [9] , we K. Kavukcuoglu. Spatial transformer networks. CoRR,
might produce a new state-of-the-art algorithm for the clas- abs/1506.02025, 2015.
sic stereo problem. Examples of these models are shown in [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
figures 10 and 11. based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
8. Code [13] X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang.
On building an accurate stereo matching system on graph-
Code is made available at https://github.com/ ics hardware. In Computer Vision Workshops (ICCV Work-
leonidk/centest. Running the stereo matching algo- shops), 2011 IEEE International Conference on, pages 467–
rithm is straightforward and documented in the README, 474. IEEE, 2011.
but running the learning algorithms (found in the learning/ [14] A. V. D. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel
recurrent neural networks. CoRR, abs/1601.06759, 2016.
folder) varies depending on the method.
[15] OpenMP Architecture Review Board. OpenMP application
program interface version 3.0, May 2008.
References
[16] M.-G. Park and K.-J. Yoon. Leveraging stereo matching with
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, learning-based confidence measures. In Computer Vision
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe- and Pattern Recognition (CVPR), 2015 IEEE Conference on,
mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, pages 101–109. IEEE, 2015.
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
7
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Re-
search, 12:2825–2830, 2011.
[18] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl,
N. Nešić, X. Wang, and P. Westling. High-resolution stereo
datasets with subpixel-accurate ground truth. In Pattern
Recognition, pages 31–42. Springer, 2014.
[19] C. Tomasi and R. Manduchi. Bilateral filtering for gray and
color images. In Computer Vision, 1998. Sixth International
Conference on, pages 839–846. IEEE, 1998.
[20] R. Zabih and J. Woodfill. Non-parametric local transforms
for computing visual correspondence. In Computer Vi-
sionECCV’94, pages 151–158. Springer, 1994.
[21] J. Zbontar and Y. LeCun. Stereo matching by training a con-
volutional neural network to compare image patches. CoRR,
abs/1510.05970, 2015.
[22] K. Zhang, J. Lu, and G. Lafruit. Cross-based local
stereo matching using orthogonal integral images. Circuits
and Systems for Video Technology, IEEE Transactions on,
19(7):1073–1079, 2009.
8
Human Pose Estimation for Multiple Frames
Marianna Neubauer Hanna Winter Lili Yang

Stanford University Stanford University Stanford University
mhneub@stanford.edu hannawii@stanford.edu yangl369@stanford.edu
Abstract
State of the art models for human pose estimation
Human pose estimation is a well studied topic in vision. that are implemented for single static RGB images, also
However, most modern techniques in human pose estima- have some minimal but noticeable accuracy shortcomings
tion on multiple, consecutive frames, or motion capture, [1]. Currently, when used on video frame sequences, these
require 3D depth data, which is not always readily avail- models do not utilize the additional information provided
able. Prior work using single view 2D data, on the other by surrounding frames. Operating under the assumption
hand, has been limited to pose estimation in single frames. that human poses change minimally between frames,
This raises some interesting questions. Can human pose we improve the accuracy of Yang et al.’s [1] efficient
estimation in multiple frames be effected using 2D single and flexible model for human detection and human pose
frame techniques, thereby discarding the expensive reliance estimation in single static images. We take into account
on 3D data? Can these 2D pose estimation models be im- the sift features of other frames in the same video clip
proved upon by taking advantage of the data similarities by training SVMs on these features. We can improve the
across multiple consecutive images? In this paper, we en- output of Yang’s model by testing the SVMs on parts of
deavor to answer these questions. We take Yang et al.’s [1] the images and adjusting the original body parts to reflect
single frame pose estimation model using flexible mixture the scores calculated by the trained SVMs. The result is
of parts and apply it in a multi-frame context. We demon- a notable increase in accuracy of the imperfect Yang pose
strate that we can achieve improvements on the original estimation.
method by taking advantage of the inherent data similar-
ities between consecutive frames. We achieve speed im- After discussing related work and the implications of
provements by restricting Yang et al.’s to search locally in our method in Section 2, we further describe our process,
intermediate frames and, under certain circumstances, ac- resulting algorithm, and our evaluation process in detail in
curacy improvements by running a second, corrective, pass Section 3. Finally, we analyze our testing data and experi-
using SVMs trained for instance recognition. mental results for our various methods and hyperparameters
in Section 4.
1. Introduction 2. Background
Human pose estimation has become an extremely 2.1. Review of Previous Work
important problem in computer vision. Quality solutions
to this problem have potential to impact many different Human pose estimation is a well studied subject, both
aspects of vision such as activity recognition and motion in video (multiple frames) and in images (single frames).
capture. Additionally, success in these aspects can be Currently, most modern techniques for pose estimation in
applied to gaming, human-computer interaction, athletics, video rely on 3D depth data. A well known example of
communication, and health-care. Despite huge progress this is the xBox Kinect [2], which uses pose estimation to
in motion capture, as exemplified with the Xbox Kinect, determine the gamer’s motion. 3D depth data has many
the current solutions used in gaming require extensive advantages over 2D image data, not the least of which is
hardware making it impossible for such technology to be the additional dimension of information. However, 3D data
used in daily human-computer interactions [2]. We hope to can only be captured using specialized, and often expen-
improve motion capture to work with simple RGB single sive equipment and is not as nearly ubiquitous as 2D videos.
view cameras allowing this technology to exist on everyday
phones and computers. Recent work in pose estimation on 2D image data
1
feature a wide range of techniques and approaches, among 3.2. Yang Algorithm Speedup
them Yang’s [1], Agarwal’s [3], Dantone’s [4], and To-
The original implementation of Yang’s mixture of parts
shev’s [5]. These methodologies are similar in that they
algorithm runs in 30 seconds on a typical clip from our
focus on pose estimation on single images. We focus
test set (see section 4.1). Since we are testing on upwards
primarily on Yang’s [1] method of pose estimation using a
of 2000 images, this is unacceptably slow. Also, in a
flexible mixture-of-parts. Yang’s method has the advantage
multi-frame video with multiple people the highest scoring
of producing relatively good results on full body images
bounding boxes often migrate from person to person. To
across a variety of poses and background contexts, while
remedy these issues we reduced the space in which the
still retaining a significant speed advantage over certain
mixture of parts algorithm searched for the bounding boxes.
other approaches, such as Toshev’s [5] pose estimation
using convolutional neural networks. A relatively fast
For the first frame of the video clip we run the full
algorithm is of particular significance when we consider
Yang algorithm. For the second frame, we crop the image
pose estimation in the multi-frame context.
to the box bounding containing the entire person plus
a little extra, the size of a body part bounding box, on
2.2. Our Method the top, bottom, and sides. We then run the full Yang
algorithm on the cropped image. We store the pyramid
Previous methods for pose estimation in the multi-frame level that is used for the bounding boxes on the second
realm rely on 3D depth data. Our method uses only RGB image. For the third frame and all subsequent frames,
single view image data to accurately locate 26 different we crop the image using the same method to crop the
body parts. Additionally, our SVMs are trained specifi- second image and we search only within the pyramid lev-
cally on information from a given video clip resulting in els above, at, and below the previously stored pyramid level.
a more accurate classification of small, specific body parts.
Because deep learning would not be feasible in this con- Cropping the image ensures the bounding boxes do
text, as neural networks take too long to train and require not migrate to another person and speeds up the search
an extremely large amount of training data, we believe our for the bounding boxes. Reducing the pyramid levels also
method is the best learning-based technique to improve pose results in significant speedup. Instead of 30 seconds, the
estimation in the multi-frame context. algorithms runs in about 0.1-0.4 seconds per frame. This
speedup made our SVM correction method, described in
section 3.4, feasible because it allowed us to run Yang on
3. Technical Details all the frames of a given video clip in a reasonable amount
3.1. Overview of Methodology of time. This was necessary to obtain enough training data
for the SVMs.
Utilizing the available source code, we are improving
3.3. Interpolation with HOGs
Yang et al.’s Image Parse model algorithm [1] on a vari-
ety of image sequences of human motion gathered from The HOG interpolation method relies on the assumption
Youtube. As an initial attempt, we implemented a HOG that a person’s pose can change only so much between
features search algorithm where we compute HOG features consecutive frames. Therefore, given the bounding boxes
for each frame and find the location of body parts by search- for body parts in one frame, we are assured that the
ing for similar features to those calculated for the body part associated bounding boxes in subsequent frame may be
in the frame prior. We found that although this method found in the same vicinity and would retain similar features.
dramatically speeds up the process, the results are wors-
ened. Then, we implemented an SVM correction method Our implementation uses Yang’s model to select bounding
where we train an SVM for each body part for each video boxes for the first frame of the target sequence. In each
clip. We improve the original Yang by testing the SVMs subsequent frame, for each bounding box, we run a sliding
on parts of the image and adjusting the Yang output based window search in the local vicinity of its location in the
on the scoring results. Expanding upon this method, we prior frame to select candidate bounding boxes. We then
integrated hard negative mining [6] for computing logical select the candidate with the closest match in HOG features
negatives for each SVM. Additionally, we added a double- to the associated bounding box in the prior frame.
pass with another SVM trained to classify a sub-image as
a body part or background. Finally, in order to measure By running Yang’s relatively expensive procedure only on
the accuracy of our computed bounding boxes we manually the first frame, we are able to achieve significant speed
annotate ground truth bounding boxes on the same image improvements over a full run of Yang’s across all frames.
sequences. However, this methodology has two disadvantages. Firstly,
2
any pose estimation error made by Yang in first frame
are propagated into the subsequent frames. Secondly,
the quality of the interpolation disintegrates the farther
removed we are from the initial frame. The key weakness
of interpolation with HOGs is that it takes into account
the output of Yang’s model for only a single frame. In
subsequent investigations, we focus instead on producing
accuracy improvements using SVMs trained on the output
Figure 2: The process of finding negative examples in each
across all frames.
frame. The leftmost image shows the boundary around the
3.4. SVM Correction person in which random bounding boxes are found. The
center image shows these boxes. Then, all the boxes that
Considering only a single human in each of the image se- overlap with any of the body parts are filtered out and the
quences, we notice that various features, such as the color of resulting bounding boxes that will become negative exam-
their skin or clothing, do not change over frames. Using this ples are displayed in the rightmost image. This process is
observation, we train video clip specific SVMs to improve repeated until a sufficient amount of negative examples are
the output from Yang’s model [1]. From Yang, there are found.
26 bounding boxes indicating locations of 26 body parts for
each frame. We split up the frames into sub-images defined
by each bounding box as seen in Figure 1 and treat each closest cluster center to each sift feature vector, we create a
of these sub-images as training data for the SVMs. Addi- histogram of this distribution and concatenate all sub-image
tionally, for each frame, we compute negative examples by histograms together to form our Bag of Words. The Bag
randomly selecting bounding boxes within a certain area of of Words features are then used to train the 26 SVMs. For
the human and then discarding those that overlap with any a given SVM for body part a, all features for the 25 other
of the calculated body part bounding boxes. We then re- body parts and for the negative sub-images are treated as
peat this process until enough negative examples are found negative examples.
(Figure 2).
In order to improve the original output from Yang’s
model [1], we test the SVM on every 10 frames using a
sliding window. As shown in Figure 3, for a given frame
and a given body part, a, we initialize a score for the
SVMs associated with a on the original calculation from
Yang. Then, we start sliding a window of the same size as
the original computing a score at every position with the
SVM for a. The window position with the maximum score
becomes the corrected bounding box.
Figure 1: A visualization of segmenting the frames from the

bounding boxes calculated by Yang [1] to create the training
data used to train the 26 different SVMs.
Utilizing the VLFeat library [7], we compute cluster

centers from the combined training data by calculating the
sift features for each training example and using k-means
clustering to find centers for all the sift features. Note that
sift features were computed using the RGB information as Figure 3: The sliding window method. The image to the
the colors are important features for training. We found that left displays the original Yang output for the left hand. The
a larger number of centers, such as 100, produced better middle image shows the sliding window starting from the
results. For all of the training data, we create Bag of Words to left and moving across and down. A score is calculated
feature vectors. For each pyramid depth, p, we break the for each position. The image on the right indicates the cor-
training example into a p × p grid of sub-images and take rected body part which is the position of the sliding window
the sift feature vector for each section. After finding the that resulted in the best score.
3
3.4.1 Double-Pass SVM To evaluate the performance of our algorithms, we
compute an AP vs. overlap threshold curve (AOC), similar
After our initial results, we noticed that if Yang’s model
to the AP curve described in [8]. A robust algorithm
mistakenly placed enough bounding boxes on parts of the
should generate a curve that maintains high AP for all
background that our SVM would do the same. We improve
overlap thresholds, however some drop off is expected. If
our method by using an additional, background distinguish-
there is a drop off it should occur at high overlap thresholds.
ing SVM. We train this SVM on the same feature set as the
26 body parts SVMs but using as positives all body parts
Different regions of the body have drastically differ-
bounding boxes and as negatives all background bounding
ent performances. In general arms and legs perform more
boxes. During the sliding windows stage, this SVM is used
poorly than head and torso in Yang’s algorithm. Therefore,
to filter candidate bounding boxes. Only bounding boxes
we also look at the average raw IOU for each region of
that are classified as non-background are kept and subse-
the body for each clip to see if the relative performance
quently scored by the corresponding body part SVM.
between different algorithms depends on the body region.
We defined seven regions: head, left torso, left arm, left
3.4.2 Hard Negative Mining leg, right torso, right arm, and right leg.
To further improve our method, we take advantage of the
4. Experiments
hard negative mining method [6]. In this addition to our
SVM Correction technique, we train our original 26 SVMs 4.1. Dataset
without any negative examples aside from other body parts.
Yang’s model [1] is pre-trained on the Image Parse
Then, using these SVMs, we test the on the randomly se-
dataset [9]. For testing, we require a dataset containing
lected negatives collected by our previous method. We do
human full-body footage because the model is trained on
this over a series of iterations where in each iteration we
images containing full-body poses.
collect new negative examples, test these negative examples
on all 26 SVMs, take the maximum score, and then keep a
To capture a variety of poses, we pulled video footage from
maximum of 30 negative examples for each video frame
Youtube containing varied subject matter [10], [11], [12],
that have a positive score. Our iterations stop once we have
[13] such as people walking, dancing, and playing sports.
kept a sufficient amount of negative examples. Using this
We cut these videos such that each clip contains a single
technique, we are able to collect the most ”confusing” neg-
camera view and the full-body of the subject. We prepro-
atives to train our SVMs on. We then recompute the cluster
cess the clips to obtain image sequences of the frames.
centers and Bag of Words features including the negative
Each frame is downsized using bicubic interpolation to be
examples and re-train all 26 SVMs. The correction step us-
about 256x256 pixels while maintaining the original aspect
ing the sliding window technique remains the same.
ratio. The downsizing is done to match the approximate
3.5. Evaluation size of the testing images used in [1].
To evaluate the performance of our algorithm, we The ground truths associated with our dataset were
measure how many body parts are correctly localized by made by manually clicking the points of all 26 different
comparing the pixel positions of the computed bounding body parts for every 10 frames. Each click is the centroid
boxes and the manually annotated ground truth bounding of a bounding box for a given body part. For evaluation, we
boxes. The Image Parse model outputs the four corners of believe comparing every 10 frames with the ground truth
a square bounding box while the manual annotation only values is sufficient to determine accuracy.
stores the centroid of a bounding box. We measure the
intersection over union of the computed bounding box and
the ground truth. We assume the size of the bounding box
for the ground truth is the same as the size of the computed 4.2. Results
bounding boxes. A bounding box is labeled ”correct” if its
4.2.1 HOG interpolation
IOU is above a certain threshold.
The HOG interpolation failed to provide accurate bounding
To aggregate this data for a single video clip, we count the boxes for subsequent frames because of drift. Any pose
number of frames a body part is correctly localized and estimation error made by Yang in first frame are propagated
divide that by the total number of frames. This number is into the subsequent frames and the quality of the interpo-
the average precision (AP) of the algorithm for that body lation disintegrates the farther removed we are from the
part in that video clip. initial frame. Figure 4 shows the decrease in average IOU
with increasing frame number. In general, the average IOU
4
Figure 4: Average IOU over for all video clips. Each thin
Figure 6: AP vs. Overlap Threshold Curve of the original
solid line represents a clip. There are 12 clips ranging from
Yang output (red) and the HOGs interpolation output (blue).
51 to 121 frames. The black dotted line is an average IOU
Lines with corresponding symbols indicate corresponding
over all clips.
clips. For example, the triangle symbol is the Yang and
HOG evaluation for Walking Clip 1.
Figure 5: Average IOU over all clips for each body region
of Yang output (blue) and HOGs interpolation (yellow). Figure 7: Average IOU in Walking Clip 3 for each body
region of Yang output (blue) and HOGs interpolation (yel-
low).
overlap with the ground truth over all frames in all clips is
significantly lower than in the original Yang output (Figure 4.2.2 One Pass SVM with Randomly Selected Nega-
5). tives
All clips performed worse under HOG interpolation Our single pass SVM has a pyramid depth of 5 and 100
except for Walking Clip 3 (the diamond in Figure 6). A cluster centers because those parameters produced consis-
histogram of the average IOU for each body region rein- tently good results. We trained and tested the SVM on 5
forces that finding (Figure 7). This is likely not because the clips and found that it improved the performance of two of
HOGs performed well but instead because the Yang output the clips, and decreased performance in two of the clips,
performed poorly for that particular clip. Note that the and did not change the performance in one of the clips (see
left arm in Figure 8 is not properly localized by the Yang Figure 9). Specifically, the SVM improved Beyonce Clip
output, but the HOGs have some overlap with the ground 1 and MLB Clip 1, it made worse Dog Walking Clip 2
truth. Also note that the right arm has better localization in and Walking Clip 1, while Dog Walking Clip 1 remained
the HOGs interpolation than the Yang output. the same. The improvement in Beyonce Clip 1 was very
5
(a) Yang Output (b) HOGs (a) Yang Output (b) Single Pass SVM
(c) Ground Truth

(c) Ground Truth
Figure 10: Frame 91 of Beyonce Clip 1
Figure 8: Frame 41 of Walking Clip 3.
Figure 9: AP vs. Overlap Threshold Curve of the original

Yang output (red), the single pass SVM correction (blue),
and the double pass SVM correction (green). Lines with Figure 11: Average IOU over all clips for each body region
corresponding symbols indicate corresponding clips. dia- of Yang output (blue), single pass SVM correction (green)
mond: Beyonce CLip 1, asterisk: Beyonce CLip 6, x: Dog and double pass SVM correction (yellow).
Walking Clip 1, triangle: Dog Walking Clip 2, square: MLB
Clip 1, carrot: Walking Clip 1.
6, Dog Walking Clip 2, and Walking Clip 1 (Figure 9).
However, in Dog Walking Clip 2 and Walking Clip 1, it still
significant (the asterisk in Figure 9). Figure 10 shows that performed worse than the original Yang output. The double
the original Yang output placed the bounding boxes too pass SVM performed significantly better than the original
far right and the SVM correction shifted them back to the Yang output in Beyonce Clip 1 and Beyonce Clip 6. For
center. The SVM also fixed one of the bounding boxes in MLB Clip 6, the SVM corrections have higher average
the left (pink) arm. precision at lower and middle thresholds while the original
Yang output has a higher average precision at the highest
Averaging the IOU over all of the clips (Figure 11) thresholds. In Dog Walking Clip 1 the performance of all
reveals that the SVM did slightly worse for all body regions three methods are similar.
except for the head, left torso and left arm.
In general, both the single and double pass SVM,
when averaged over all the clips, resulted in more accurate
4.2.3 Double Pass SVM with Hard Negatives
bounding boxes than the original Yang output (Figure ??).
The double pass with hard negative mining improves the The Single pass SVM performs the best for the left torso,
performance over the single pass SVM in Beyonce Clip and right leg, while the double pass SVM performs the
6
(a) MLB Clip 1 (b) Dog Walking Clip 1 (c) Beyonce Clip 1 (d) Walking Clip 1
Figure 12: Example frames from various different clips displaying the SVM correction using hard negative mining and a
double pass. The top row is the original Yang result and the bottom row is the result after our SVM correction.
best with the head, left arm, left leg, and right torso. This vary the threshold and create another AP curve or ROC
indicates that the extra background SVM pass and the hard curve based on that threshold to determine its effect on the
negatives mining did improve the performance of the SVM performance of the SVM.
correction overall especially since the arms from incorrect
Yang outputs tend to include background sub-images. The current implementation of the SVM is impracti-
cally slow. The most time is spent computing the Bag of
Figure 12 shows example frames of where the double Words feature vectors in various parts of our algorithm
pass SVM corrects errors in the original Yang output. For including the hard negative mining loop and the sliding
example, the right arm for Walking Clip 1 in the Yang window correction section. Therefore, we believe this to
output bounds the background while in the SVM correction be the bottleneck of our method. Thus, parallelizing this
it bounds the right arm. For MLB Clip 1 and Beyonce Clip computation such that all frames or even all body parts in
1 the arms move closer to the body in the SVM correction each frame are computed in tandem could have a significant
except for one bounding box. The solitary bounding speed up.
box remains far away because the true arm is outside of
the search space defined by our correction algorithm. In
6. Conclusion
Beyonce Clip 1 the left and right legs alternate probably
because the sift features are very similar between left and It is certainly true that human pose estimation is a
right legs. There is also a right arm bounding box on the challenging subject with many avenues of research yet to
left leg because, please, Beyonce’s legs basically look like be explored. We have made a small effort by introducing
arms anyway. a method that utilizes the similarities among video frames
to improve a single image pose estimation model when
5. Future Work used in a multi-frame context. The improvement was
particularly marked on the clips where the original Yang’s
If given the time, we could make several modifications algorithm performed the most poorly - and arguably where
to our SVM. Firstly, we did not tune all of the parameters of improvement was most necessary.
the SVM across all of the clips to find the best overall set of
parameters. We also noticed that while some values worked More importantly, we have highlighted areas where
well for some clips, they worked less well for others. More more research is possible and laid the groundwork for
investigation in this area could produce interesting insights. future avenues of investigation.
Secondly, in the implementation of the hard negatives

for the SVM, we arbitrarily set a threshold to decide References
whether to include the negative example in our set of
negatives for the final SVM. For some clips this threshold [1] Y. Yang and D. Ramanan. Articulated pose es-
was too high and it was difficult to collect enough negative timation with flexible mixtures-of-parts. In IEEE
examples in a reasonable time. In the future we could Conf. on Computer Vision and Pattern Recognition
7
(CVPR), pages 1385–1392, Washington, DC, USA,
2011. IEEE.
[2] B. Bonnechre, Jansen B., P. Salvia, H. Bouzahouene,
Omelina L., J. Cornelis, M. Rooze, and S. Van
Sint Jan. What are the current limits of the kinect sen-
sor? In 9th International Conf. on Disability, Virtual
Reality and Associated Technologies, pages 287–294,
Laval, France, 2012.
[3] A. Agarwal and B Triggs. 3d human pose from silhou-
ettes by relevance vector regression. In IEEE Conf.
on Computer Vision and Pattern Recognition (CVPR),
volume 2, pages II–882–II–888 Vol.2, June 2004.
[4] M. Dantone, J. Gall, C. Leistner, and L. van Gool. Hu-
man pose estimation using body parts dependent joint
regressors. In IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), pages 3041–3048, Port-
land, OR, USA, June 2013. IEEE.
[5] A Toshev and C Szegedy. Deeppose: Human
pose estimation via deep neural networks. CoRR,
abs/1312.4659, 2013.
[6] Andrea Vedaldi. Object category detection practical.
http://www.robots.ox.ac.uk/ vgg/practicals/category-
detection.
[7] A. Vedaldi and B. Fulkerson. VLFeat: An open
and portable library of computer vision algorithms.
http://www.vlfeat.org/, 2008.
[8] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman. The pascal visual object
classes (voc) challenge. International Journal of Com-
puter Vision, 88(2):303–338, June 2010.
[9] D. Ramanan. Learning to parse images of articulated
bodies. In Advances in Neural Information Processing
Systems 19, Proceedings of the Twentieth Annual Con-
ference on Neural Information Processing Systems,
Vancouver, British Columbia, Canada, December 4-7,
2006, pages 1129–1136, 2006.
[10] beyonceVEVO. Beyonce - sin-
gle ladies (put a ring on it).
https://www.youtube.com/watch?v=4m1EFMoRFvY.
[11] Barcroft TV. Dog whisperer: Trainer
walks pack of dogs without a leash.
https://www.youtube.com/watch?v=Cbtkoo3zAyI.
[12] Cesar Bess. Mlb top plays april 2015.
https://www.youtube.com/watch?v=mpe9w-CHsoE.
[13] BigDawsVlogs. Walking next to people extras.
https://www.youtube.com/watch?v=776niN4-A58.
8
Indoor Scene Segmentation using Conditional Random Fields
Colin Wei and Helen Jiang

Stanford University
{colinwei, helennn}@stanford.edu
Abstract cess cases and failure cases. We hope that in doing so, we
can gain intuition on directions for future work. We follow
Indoor scene segmentation is a problem that has become and re-implement the technical details of Silberman et. al’s
very popular in the field of computer vision with applica- method of indoor semantic segmentation. Furthermore, we
tions that include robotics, medical imaging, home remodel- test modifications to the method.
ing, and video surveillance. This problem proves even more
difficult when the scene is cluttered. Our project aims to 2. Related Work and Contributions
explore ways to improve indoor scene segmentation algo- 2.1. Literature Review
rithms by examining and evaluating a popular method.
We focus on evaluating the robustness of the algorithm There is a large body of work on using conditional ran-
for indoor scene segmentation described in [16] by Silber- dom fields (CRF’s) in order to produce semantic segmen-
man et. al which uses SIFT features and conditional ran- tations of images. In [11], He et. al present an approach
dom fields to produce segmentations. In our project, we re- using multi-scale conditional random fields for image seg-
implement their method and compare performance by mod- mentation. They leverage 3 different probabilistic mod-
ifying specific sections. els: a classifier relying only on local information, a con-
We find that the neural network used in [16] is not robust ditional random field relying on hidden regional variables
to an increasing number of classes, but the CRF model is to model interactions between object classes, and a CRF
in the sense that as we increase the number of classes, the that incorporates global label information. Their model
CRF becomes increasingly important to produce an accu- relied on a complicated training loop. Subsequent works
rate segmentation. Furthermore, we also demonstrate that such as [15, 18] use pairwise potentials to model the in-
changing the algorithm we use to generate superpixel seg- teractions between neighboring pixels when producing se-
mentations increases classification accuracy of the entire mantic segmentations. In [16, 14], Silberman et. al fol-
pipeline. low the same CRF framework and also introduce the NYU
dataset, a densely labeled dataset of indoor scenes. Finally,
in [8], Chen et. al introduce a state-of-the-art segmenta-
1. Introduction tion pipeline which utilizes a deep convolutional neural net-
work and fully-connected CRF to produce accurate segmen-
In our paper, we explore the semantic segmentation of tations.
indoor scenes, even cluttered ones. The main goal is that
given an RGB or RGB-D image of a cluttered indoor scene, 2.2. Our Contribution
we output a properly labeled image with each individual We reimplement the semantic segmentation approach in
pixel corresponding to an object class, such as a television, [16], and run the following main experiments:
chair, or table. Although semantically segmenting a scene
1. We evaluate the ability of the neural network used in
is an easy task for humans, automatic segmentation using
[16] to learn 100 object classes. In [16], 13 object
machines proves to be a challenging problem.
classes are used. This experiment allows us to mea-
The solution we are implementing is modeled after the
sure the robustness of their neural network.
segmentation algorithm described in [16], which utilizes
neural networks trained on SIFT features along with con- 2. We evaluate the performance of their CRF pipeline
ditional random fields in order to produce a semantic seg- using different superpixel algorithms to create initial
mentation. low-level segmentations. We also qualitatively analyze
The goal of our paper is to thoroughly evaluate the al- the performance of the CRF on images in the test set
gorithm described in [16] in order to understand its suc- and note potential areas for improvement.
1
szwalb segmentation, the approach used in [16], as
well as quickshift and SLIC segmentation. All points i
in a superpixel are assigned class probabilities Pi that
are equal to the average of the class probabilities for
all descriptors corresponding to grid points in a super-
Figure 1: Example data from the NYU dataset: Left=RGB pixel. If no grid points fall inside a superpixel, we as-
Image; Middle=Raw Depth Image; Right=ground truth sign the superpixel uniform class probabilities.
class labels created by Amazon Turk
4. Model pixel labels as a conditional random field. The
energy of the CRF is defined as follows:
3. We evaluate the performance of their CRF pipeline in X X X
the 100 object class setting, and show that the CRF E(y) = φi (yi ) + ψij (yi , yj )
model is robust to an increasing number of object i∈I i∈I j∈N (i)
classes.
The summations are taken over all pixels in the image.
3. Technical Details The first summation models unary potentials for each
class, while the second summation models pairwise in-
3.1. Dataset teractions between neighboring pixels.
We use the SUN RGB-D dataset [17], which contains the Silberman et. al model φi as the negative log of the
RGB-D images from the NYU depth dataset [14] which is product of Pi (yi ) and a location prior on the class yi .
the one that Silberman et. al. created and used, the Berkeley We did not implement location priors and instead set
B3D0 dataset [12], and the SUN3D dataset [21]. φi (yi ) = − log Pi (yi ).
In our project, we use 1449 images from only the NYU Finally, we set pairwise potentials
v2 dataset [14], although we use the labels provided by the
ψij (yi , yj ) = 1(yi 6= yj )ηe−αkIi −Ij k2
2
SUN dataset [17]. The NYU dataset includes the raw RGB
images, the raw depth images, and the labeled images, as
shown by example in Figure 1. We chose to use the SUN where Ii and Ij are the i-th and j-th RGB color chan-
RGB-D version of the NYU images because the SUN RGB- nels and η, α are hyperparameters. This potential func-
D dataset is around 10000 images total, allowing us to ex- tion mirrors the one used in [10]. Although Silberman
tend our work to settings with more data in the future. The et. al use a different pairwise potential function, we
SUN RGB-D dataset provides object class labels for each find that this potential function is easier to tune.
individual pixel of every image. Although the dataset pro- 5. Minimize the energy function of the CRF. In [16], Sil-
vides depth information for each image, we only use RGB berman et. al use the scheme provided by [4]. We
information for all of our implementations. experiment with both Boykov et. al’s α-expansion al-
3.2. Segmentation Pipeline gorithm in [4] and simulated annealing [2].
To produce a semantic segmentation of an image, we 3.3. Low-level Segmentation Methods

take the following steps, following the method of Silberman
In our semantic segmentation framework, we utilize
et. al:
three different unsupervised algorithms for the purpose of
1. Extract SIFT features {fij } from a dense grid on the producing low-level segmentations:
image using a sliding window. As in [16], we use a
grid with a stride of 10, where each sliding window is 3.3.1 Felzenszwalb [9]
of size 40 by 40. Although Silberman et. al concate-
This algorithm is used by Silberman et. al and preserves
nate SIFT features from three different scales, we only
detail in low-variability regions in the image rather than
use one scale for our project in the interest of saving
high-variability regions. The algorithm proceeds by using
time and computational power.
a graph-based representation of the image to find region
2. Use a neural network to produce class probabilities boundaries, sequentially combining different regions based
P (·|fi ) for each feature location i in the image. As on a similarity score across regions.
in [16], our neural network has a single hidden layer
of size 1000. 3.3.2 Quickshift [20]
3. Use a low-level segmentation algorithm to segment the Quickshift is a mode-seeking clustering algorithm that
image into superpixels. We experiment with Felzen- builds off of variants of mean-shift. The algorithm provides
a density estimate for each pixel location; it then constructs Table 1: Statistics for 11 Class Train/Test Set
a tree where images are connected to their nearest neigh- Class Name # Training Descriptors # Test Index
bors with larger pixel values. By cutting branches of the Bed 31791 39161 0
tree whose distance cross a max threshold, we obtain clus- Bookshelf 58853 38289 1
ters for the image. Cabinet 99732 80484 2
Ceiling 14025 11884 3
3.3.3 SLIC (Simple Linear Iterative Clustering) [1] Floor 108333 80595 4
Picture 24932 25235 5
SLIC is a clustering method that essentially applies the k- Sofa 34744 47957 6
means algorithm on a different feature space, which takes Table 29964 28882 7
into account pixel location and color intensity. After con- Television 7126 12905 8
structing this feature space, SLIC uses Lloyd’s algorithm to Wall 373894 311308 9
output cluster assignments. Window 30695 28160 10
3.4. CRF Optimization The total number of training and test examples for each
class for our 11 class model.
Although the minimization of the energy function of a
CRF is an NP-hard problem, the α-expansion algorithm [4]
and simulated annealing provide general methods for per- the probability distribution is near uniform, to low values
forming this optimization [2]. near 0, where the distribution concentrates on labelings that
minimize the energy function.
3.4.1 α-expansion Algorithm
3.5. Our Implementation Details
This algorithm by [4] formulates the problem of minimiz-
ing the energy function of a CRF as a min-cut problem. We implement everything using Python. To extract SIFT
Although this problem is still NP-hard in the worst case, features, we use OpenCV’s Python wrapper [5]. To train
Boykov et. al provide an approximation algorithm that finds the neural network over the dense grid of SIFT features,
a local minimum with respect to α-expansion moves, which we use the Python package scikit-neuralnetwork [7]. We
consists of moves where pixels change their labels to α for use the package scikit-image to run the superpixel seg-
a given label α. Given a labeling y and label α, we can mentation algorithms [19]. Finally, for the graph-cut op-
compute timization algorithm of [4], we use gco-python, Andreas
Mueller’s Python wrappers for the gco optimization pack-
min E(y 0 ) age [4, 13, 3]. We implemented simulated annealing from
y 0 ∈Nα (y)
scratch using Cython to optimize for speed. We did not
where Nα (y) is the set of labels within one α-expansion of use any code from [16], as we implemented all of the pre-
y and E(y 0 ) is the energy of the labeling y 0 , in polynomial processing steps ourselves.
time using a reduction to standard min-cut. The algorithm
proceeds by iteratively performing this local minimization 4. Experimental Setup and Results
procedure. 4.1. Neural Network
We train and test a neural network for classifying image
3.4.2 Simulated Annealing
patches. We partition all images from the NYU dataset into
Simulated annealing is a black-box combinatorial optimiza- the train/test splits provided in [14]. For each training im-
tion problem that is guaranteed to converge to the optimal age, we extract SIFT features from the dense grid described
solution, though it may be very slow in practice [2]. The in Section 3.2. Likewise, we construct a test set by extract-
algorithm proceeds by sampling local changes to a label as- ing SIFT features from each image in the test set using the
signment (i.e. the configuration of a single pixel at location same dense grid. After consolidating training features, we
i) with conditional probability equal to randomly partition the SIFT descriptors into 12 total subsets
so we do not have to load the entire training set into mem-
exp − T1 E(y new )

new ory. One training loop consists of a pass over all 12 training
P (yi |y) = P 1 0

y 0 ∈Ni (y) exp − T E(y )
subsets, starting with a learning rate of 0.1 and multiplying
by a decay rate of 0.75 with each subsequent subset. Every
where Ni (y) is the set of labelings obtained by changing other hyperparameter for our neural network is set to the
only pixel i, and T is a changing temperature parameter. scikit-neuralnetwork default.
The idea is to gradually reduce T from high values, where Because the SUN RGB-D dataset contains 894 different
Table 2: Train/Test Accuracy on 11 Class Set
% Train Correct % Test Correct
62.06% 49.96%
Accuracy on 11 class training and test sets. Computed for
all grid points for each train/test image.
Table 3: Train/Test Accuracy on 100 Class Set

% Train Correct % Test Correct
25.89% 13.37%
Accuracy on 100 class training and test sets. Computed for
all grid points for each train/test image.
(a) Confusion matrix for training.
object class labels, a number that is too big for a neural net-
work with a single hidden layer to accurately classify, we
instead train on subsets of the classes. In total, we train
2 different neural networks. Following [16], we handpick
11 object classes to train on, shown in Table 1. Silberman
et. al use 13 classes - the classes that we use, in addition
to the “blind” and “background” classes. We do not use
these classes, however, because they do not appear in the
SUN RGB-D labels. We also choose the 100 most common
labels, and we train a neural network to classify between
these classes. For reference, our code includes a pickle file
containing the names of these 100 classes. The 11 hand-
picked classes form a large subset of these 100 most com- (b) Confusion matrix for testing.
mon classes. Figure 2: Confusion matrix for 11 class dataset. Indices
From Table 1, we can see that class distributions between correspond to indices in Table 1.
train and test are pretty similar, but class distributions are
both very skewed. This also holds for the 100 classes set.
Because of this skewed distribution, it is possible to train seen during training.
a neural network that achieves high test accuracy but only Judging from the confusion matrices in Figure 2, it
learns a few classes properly. To remedy this issue, we bal- seems that the 11 class neural network performs worst on
ance the training distribution by capping each class at 5000 classes that are both relatively scarce and similar in appear-
examples per split in the 11 class case and 1000 examples ance to other classes. For example, sofas and tables are
per split in the 100 class case. We are unclear on how Sil- commonly classified as floors. These sofa and table classes
berman et. al work around this problem. are very scarce compared to floors and exhibit similar prop-
In Table 2 and Figure 2, we show the accuracy results erties such as a large flat surface, leading to this incorrect
and confusion matrices for the 11 class dataset. In Table 3 classification. We have already mitigated many of these in-
and Figure 3, we show the accuracy results and confusion correct classifications by trying to balance classes during
matrices for the 100 class dataset. From the discrepancy be- training, but we are unsure how to improve this further with-
tween the training and testing accuracy for both datasets, it out switching to a deeper network architecture. Since Sil-
is clear that our models overfit, even though we use a sub- berman et. al do not provide their neural network results
stantial amount of training data given the size of networks in [16], we cannot perform a direct comparison. However,
we train. There are two main reasons why this could hap- their CRF with only unary potentials achieves a 40.9% pixel
pen. First, descriptors from the same image corresponding accuracy on 13 classes, which implies that our results on 2
to nearby points possess some redundancy, which means the fewer classes are on a comparable performance level.
effective number of training samples is smaller than the ac- We cannot make any comparison to [16] on the 100 class
tual number. Second, intra-class discrepancy is very high case because they only provide results for 13 classes. How-
between different indoor scenes. Since the train/test splits ever, the confusion matrix in Figure 3 shows that the sin-
in [14] are arranged so that not a single scene is in both train gle layer neural network is not robust enough for the 100
and test, test images could present variations of a class not class case. Many of the classes that are very incorrectly
Table 4: Superpixel Algorithm and 11 Class CRF Perfor-
mance
Superpixel Alg. Unary Acc. CRF Acc.
Felzenszwalb 48.74% 49.95%
Quickshift 48.20% 51.21%
SLIC 49.93% 51.35%
Accuracy of pixel-level labels for segmentation of test im-
ages on 11 classes. We only consider pixels that fall in one
of the 11 classes. Unary accuracy is computed from the
segmentation given by minimizing the unary terms of the
energy function. CRF accuracy is computed from consider-
ing pairwise terms too.
(a) Confusion matrix for training.
(a) Original image
(b) Felzenszwalb, quickshift, SLIC superpixels

Figure 4: Example superpixel segmentation.
(b) Confusion matrix for testing.

Figure 3: Confusion matrix for 100 class dataset. Indices 4.2. Superpixel Algorithms and CRF Performance
are available in the code.
For the 11 class case, we analyze the performance of

the entire segmentation pipeline described in Section 3.2,
classified in the test confusion matrix are also very uncom- varying the algorithm we use to create initial superpixel
mon; for example the class ”backpack” (index 8) appears segmentations. For the scikit-image implementation of the
only 539 times throughout the entire training set, and the Felzenszwalb algorithm, we set the scale parameter to 100
model rarely outputs that label. An interesting question is based on qualitative evaluation of a few training images.
whether the model fails because SIFT descriptors cannot For quickshift and SLIC, we use the default hyperparame-
capture enough information about all 100 classes or because ters from the implementation of scikit-image.
the network architecture itself is flawed. In future work, we
could investigate this by training the single layer network Table 4 shows the accuracy results for our different trials
on the raw image patches and observing whether this im- over the entire test set. To produce the results in Table 4, we
proves results, which would imply SIFT as the problem. If use Boykov et. al’s α-expansion algorithm for minimizing
the results do not improve, a deeper network may be neces- the CRF energy for each trial [4, 13, 3]. We use 5 iterations
sary. for the α-expansion algorithm.
Figure 5: The images on the left show segmentation results Surprisingly, we find that performing Felzenszwalb seg-
for different superpixel initializations in Figure 4. On the mentation to create superpixels, the method that Silberman
right, the truth map is show. Black means the model clas- et. al use in [16], actually results in the worst performance
sified the pixel correct, white means the model classified out of all the superpixel segmentation methods that we try.
incorrect, and gray means the pixel does not belong to any The accuracy rate of the conditional random field, 49.95%,
of the 11 classes. is more or less the same as the test accuracy of the 11 class
set, and the accuracy of the unary model with Felzenszwalb
is worse. We should note that the two test sets are differ-
ent - the test set in Section 4.1 uses only the subset of pixel
locations along the dense grid, while our test set here uses
all pixels. Even so, it seems that first performing the super-
(a) Felzenszwalb unary pixel segmentation with Felzenszwalb and quickshift actu-
ally hurts the performance of the classifier. This is because,
as seen from Figure 5, both Felzenszwalb and quickshift
create tiny clusters that do not contain any grid points (see
the tiny yellow clusters in Figures 5a and 5c) and are thus
assigned uniform superpixel probabilities. SLIC works well
even restricted to unary potentials because it produces larger
(b) Felzenszwalb CRF
clusters, ensuring that a grid point falls in each cluster.
In the CRF setting with pairwise potentials, quickshift
and SLIC both provide better performance than Felzen-
szwalb segmentation. This is because, as seen in 4b,
quickshift and SLIC provide more balanced cluster sizes,
whereas Felzenszwalb segmentation can often produce very
(c) Quickshift unary large superpixels. Since all pixels within a superpixel
share the same class probabilities, if these large superpixels
have the wrong class probabilities, performance will suffer.
Meanwhile, for quickshift and SLIC superpixels which are
smaller, as long as a few superpixels have the correct class
probabilities, the CRF will be able to fix the labels for adja-
cent superpixels too.
(d) Quickshift CRF For all three superpixel algorithms, the CRF is able to
correct some of the mistakes seen in the unary version of
the model. For example, it can fix nearly all of the inaccura-
cies introduced by superpixel clusters that are too small, as
seen in the smoothness of the CRF segmentations in Figure
5. Furthermore, the CRF is able to produce minor improve-
(e) SLIC unary ments in fixing some small wrongly classified regions. The
potential for improvement is limited in three ways:
1. The accuracy of the neural network is too low. If the

neural network is too confident in the wrong labels,
then it is hard for the CRF to change as desired. In fu-
ture work, we can examine whether using a CRF with
(f) SLIC CRF more powerful neural network models results in more
performance gains solely from the CRF.
2. The spatial transition potentials do not provide a good

enough model of the transitions between classes. As
an example, this occurs in Figure 5f. The CRF is able
to correctly label more parts of the floor near the right
side of the bed. As seen in Figure 4b, a large chunk of
the patch of the floor incorrectly classified in the SLIC
unary belongs to the same superpixel. Since the CRF
Table 5: 100 Class CRF Performance
Superpixel Alg. Unary Acc. CRF Acc.
Felzenszwalb 15.16% 15.78%
SLIC 14.77% 15.67% Original/SLIC Original/SLIC
Accuracy of pixel-level labels for segmentation of test im-
ages on 100 classes. We only consider pixels that fall in one
of the 100 classes. Unary accuracy is computed from the
segmentation given by minimizing the unary terms of the
SLIC unary SLIC unary
energy function. CRF accuracy is computed from consider-
ing pairwise terms too.
is able to fix half of this superpixel, neural network in- SLIC CRF SLIC CRF
accuracy is not the only problem - therefore, transition Figure 6: Sample truth maps and segmentations for the 100
potentials must be an issue too. class case.
3. The CRF can fix labels for pixels only if labels for We provide our accuracy results in Table 5. Surprisingly,
nearby pixels in the same object are correct. This is whereas the superpixel segmentations hurt our unary poten-
because the CRF model is based on pairwise interac- tial performance in the 11 class case, they actually improve
tions between neighboring pixels. In future work, we our performance in the 100 class case over train and test ac-
could attempt to fix this issue by using a fully con- curacy given in Table 3. We are unsure why this is the case.
nected CRF model as in [8], which allows the model In addition, Felzenszwalb actually provides better perfor-
to account for global feature interactions. mance for unary potentials now. This discrepancy might
be due to the fact that in the 100 class case, larger clusters
Another way to potentially address the first and second lim-
might result in significant improvement for some test im-
itations is to fine-tune the neural network probabilities and
ages because averaging class probabilities of large clusters
learn the pairwise interaction potentials by directly train-
reduces noise, and there is more noise in the 100 class case
ing a CRF model instead of handcrafting the pairwise po-
as opposed to the 11 class case.
tentials. We can do this by modeling the pairwise poten-
The increase in accuracy percentage between the unary
tials as the result of some convolution kernel applied to
and CRF models is an interesting result that demonstrates
the local image patch followed by some nonlinearity. We
the robustness of the CRF model. As the number of classes
could formulate a training objective that maximizes the log-
increases, the CRF actually seems to make a bigger impact
likelihood of the true labels, optimize it using contrastive di-
on the final segmentation. Although the actual increase in
vergence [6], and perform back-propagation into the unary
accuracy is lower in the 100 class case than the 11 class
and pairwise potentials to learn these functions. We suspect
case, the increase is higher in proportion because many
that this approach could mitigate the first and second limita-
fewer pixels get labeled correctly in the 100 class case.
tions by providing an energy function that is optimized for
From the examples shown in Figure 6, we can also qual-
the desired task, producing correct pixel labels. Because of
itatively observe the increased impact of the CRF on seg-
lack of computational power, we leave this idea for future
mentation quality. In Figure 6, we show example segmenta-
work.
tions using the SLIC superpixel algorithm; we choose to an-
alyze SLIC because SLIC and Felzenszwalb exhibit similar
4.3. Varying the Number of Classes
CRF performance on the 100 class set, while SLIC is clearly
We also explore the robustness of the approach in [16] better on the 11 class set. Whereas the truth maps in Fig-
to a varying number of classes. To do so, we run the seg- ure 5 do not change much between the unary and pairwise
mentation pipeline in Section 3.2 the test set using the 100 cases, the truth maps shown in Figure 6 exhibit significant
most common class labels. We run our experiments using changes between the two cases. Furthermore, the segmenta-
Felzenszwalb segmentation and SLIC in order to generate tions obtained using the CRF contain far fewer clusters than
superpixels; we forgo testing quickshift because it provided the segmentations obtained only using the unary potentials.
similar performance to SLIC in our 11 class test set and the We can explain the increased impact of the CRF as fol-
segmentation algorithm takes too long to run in conjunction lows: since there are a larger number of classes the network
with performing α-expansion optimization on 100 classes, is less certain about its classification choices and therefore
which already requires significantly more time than the 11 assigns more uniform class probabilities. As a result, the
class case. pairwise potential term is larger in magnitude than the unary
Table 6: Optimization Algorithm Comparison
Optimization Alg. % Acc on Random Sample
Simulated Annealing 51.55%
Boykov et. al 53.01%
Unary 51.56%
Accuracy of pixel-level labels for segmentation on 11
classes using different optimization scheme. In the “Unary”
scheme, we ignore pairwise potentials and take the argmax
class probability for each pixel. For all optimization meth-
ods, we use SLIC to obtain superpixels.
term, which means the decisions resulting from pairwise

potentials are weighted more heavily. Thus, we can con-
clude that while the neural network model is not robust to an
increase in the number of classes, the CRF is quite robust.
Thus, perhaps we can expect further improvements by us-
ing better CRF potentials as described at the end of Section
4.2. One interesting question for future work is whether the
CRF would still make as big of an impact in the case where
we have a better neural network model for classifying the
grid points.
(a) Simulated annealing with SLIC
4.4. Simulated Annealing
In an attempt to determine whether the optimization al-

gorithm used for the CRF is optimal, we implement simu-
lated annealing, a black box algorithm useful for Markov
Random Fields [2]. We anneal from a temperature of 1
to 0.01, using 15 total linearly spaced temperatures. Each
annealing step, we perform a Gibbs sampling sweep over
every single pixel in the image, sampling new class labels
with probability proportional to exp − T1 E(y) , where T

is the current temperature and E(y) is the energy of CRF

on labels y. We initialize our labels using the minimum en-
ergy configuration for unary potentials - although we tried
different initialization methods, we found that this worked
best empirically. Because we sweep every single pixel in
the image every Gibbs sampling step, our implementation
is extremely slow, even when we use Cython. Therefore,
we only run our segmentation pipeline on a random sample
of 69 test images, using only the 11 class model. For fair-
ness, we compare to the α-expansion algorithm in [4] and
unary potentials using the same random sample. We show
results in Table 6. (b) Unary with SLIC
Figure 7: Example segmentation using simulated annealing.
Unfortunately, simulated annealing does not seem to op- [2] D. Bertsimas and J. Tsitsiklis. Simulated annealing. Statist.
timize the CRF energy at all. As shown in our example, Sci., 8(1):10–15, 02 1993. 2, 3, 8
which is the same initial image as used in Figure 4 and Fig- [3] Y. Boykov and V. Kolmogorov. An experimental comparison
ure 5, our simulated annealing implementation does very of min-cut/max- flow algorithms for energy minimization in
little to change the initial assignment. All examples in our vision. IEEE Transactions on Pattern Analysis and Machine
test set appear similar to this. This suggests that some of Intelligence, 26(9):1124–1137, Sept 2004. 3, 5
our hyperparameters are not set correctly. For example, our [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-
ergy minimization via graph cuts. IEEE Trans. Pattern Anal.
initial temperature might be too high, resulting in the Gibbs
Mach. Intell., 23(11):1222–1239, Nov. 2001. 2, 3, 5, 8
sampler getting “stuck”. Furthermore, we might not have
[5] G. Bradski. Opencv. Dr. Dobb’s Journal of Software Tools,
run a sufficient number of iterations of simulated annealing.
2000. 3
Given more time, we could try to optimize these hyperpa-
[6] M. A. Carreira-Perpinan and G. E. Hinton. On contrastive
rameters to obtain better results. However, we also conclude divergence learning. 7
that given the large number of hyperparameters to optimize [7] A. Champandard and S. Samothrakis. sknn: Deep neural
and the high computational cost, simulated annealing is not networks without the learning cliff. 2015. 3
worth the effort compared to cut-based segmentation. [8] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
Yuille. Semantic image segmentation with deep convolu-
5. Conclusion and Future Work tional nets and fully connected crfs. CoRR, abs/1412.7062,
2014. 1, 7
In this paper, we implement Silberman et. al’s image
[9] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient
segmentation pipeline in [16] with the final goal of explor- graph-based image segmentation. Int. J. Comput. Vision,
ing the strengths and weaknesses of their algorithm. We 59(2):167–181, Sept. 2004. 2
show that the neural network model based on SIFT features [10] L. Grady. Random walks for image segmentation. IEEE
works sufficiently well for a small number of classes, but Trans. Pattern Anal. Mach. Intell., 28(11):1768–1783, Nov.
does not work for classification tasks on a larger number 2006. 2
of object classes. We also analyze and explain the perfor- [11] X. He, R. S. Zemel, and M. A. Carreira-Perpinan. Multiscale
mance of different superpixel algorithms in place of Felzen- conditional random fields for image labeling. In Computer
szwalb’s segmentation algorithm, and we find that SLIC Vision and Pattern Recognition, 2004. CVPR 2004. Proceed-
is optimal in terms of both speed and final segmentation ings of the 2004 IEEE Computer Society Conference on, vol-
quality. Furthermore, we experiment with a larger num- ume 2, pages II–695–II–702 Vol.2, June 2004. 1
ber of object classes in our test set, and we show that the [12] A. Janoch, S. Karayev, Y. Jia, J. T. Barron, M. Fritz,
CRF framework is very robust to an increasing number of K. Saenko, and T. Darrell. A category-level 3-d object
classes, even if the neural network model is not. Finally, dataset: Putting the kinect to work. in iccv workshop on con-
sumer depth cameras for computer vision. 2011. 2
we compare the performance of Boykov’s segmentation al-
[13] V. Kolmogorov and R. Zabih. What energy functions can
gorithm to simulated annealing, and we find that Boykov’s
be minimized via graph cuts? In Proceedings of the 7th
algorithm works much better for our problem.
European Conference on Computer Vision-Part III, ECCV
Our hope is that our experimentation provides interesting ’02, pages 65–81, London, UK, UK, 2002. Springer-Verlag.
directions for future research and we believe that our anal- 3, 5
ysis of the CRF’s performance does this. In Section 4.2, [14] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
we discuss limitations of the CRF and argue that directly segmentation and support inference from rgbd images. In
learning some CRF potentials from data is a promising di- ECCV, 2012. 1, 2, 3, 4
rection for further work. In Section 4.3, we show that the [15] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Com-
CRF becomes more important as the number of classes in- puter Vision – ECCV 2006: 9th European Conference on
creases. Another interesting direction for future work is to Computer Vision, Graz, Austria, May 7-13, 2006. Proceed-
see if this still holds for an even larger number of classes ings, Part I, chapter TextonBoost: Joint Appearance, Shape
and also differing CRF connectivities. and Context Modeling for Multi-class Object Recognition
Our code can be found at: and Segmentation, pages 1–15. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2006. 1
https://github.com/cwein3/ [16] N. Silberman and R. Fergus. Indoor scene segmentation us-
im-seg ing a structured light sensor. In Proceedings of the Inter-
national Conference on Computer Vision - Workshop on 3D
Representation and Recognition, 2011. 1, 2, 3, 4, 6, 7, 9
References
[17] S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d
[1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and scene understanding benchmark suite. In The IEEE Confer-
S. Ssstrunk. Slic superpixels compared to state-of-the-art su- ence on Computer Vision and Pattern Recognition (CVPR),
perpixel methods. 3 June 2015. 2
[18] B. Triggs and J. J. Verbeek. Scene segmentation with
crfs learned from partially labeled images. In J. C. Platt,
D. Koller, Y. Singer, and S. T. Roweis, editors, Advances
in Neural Information Processing Systems 20, pages 1553–
1560. Curran Associates, Inc., 2008. 1
[19] S. van der Walt, J. L. Schönberger, J. Nunez-Iglesias,
F. Boulogne, J. D. Warner, N. Yager, E. Gouillart, T. Yu,
and the scikit-image contributors. scikit-image: image pro-
cessing in Python. PeerJ, 2:e453, 6 2014. 3
[20] A. Vedaldi and S. Soatto. Quick shift and kernel methods for
mode seeking. 2008. 2
[21] J. Xiao, A. Owens, and A. Torralba. Sun3d: A database
of big spaces reconstructed using sfm and object labels. In
ICCV, 2013. 2
Understanding 3D Space-Semantics
Iro Armeni
Voxelized 3D Sliding
Cube (Input)
x,y,z global
location
[0,0,0,0,...,1] [1,0,0,0,...,0]
null ceiling
Leaky ReLU
Leaky ReLU
Leaky ReLU
Leaky ReLU
Leaky ReLU
Softmax
3DConv
3DConv
3DConv
3DConv
3DConv
3DConv
or [0,0,1,0,...,0]
wall
[0,0,0,1,...,0]
other [0,1,0,0,...,0]
floor
Fully Convolutional Multi-class Voxel labels
Neural Network Classification (Output)
Raw Point Cloud Voxelized 3D Enclosed

Space (Input)
Figure 1: Frameworks for 3D Parsing of Large-Scale Indoor Point Clouds into their Space-Semantics. Exploring two different
network architectures: (1) A fully 3D CNN receives as an input a 3D voxelized sliding cube with binary occupancy and performs a per
voxel multi-class classification into 5 semantic labels. (2) A fully 3D CNN receives as an input a voxelized enclosed space with binary
occupancy and performs a per voxel multi-class classification into 10 semantic labels. The result in both cases is a class prediction per
output voxel.
Abstract they need to have an understanding of it. 3D depth sensors

are quickly becoming standard practice and have enabled
Point clouds are comprehensive and detailed 3D repre- the representation of our surroundings as whole scenes in
sentations for indoor spaces, however they contain no high- an ever increasing number of point clouds. Although such
level information of the depicted area. In contrast to previ- data is becoming more and more available, it is not directly
ous methods that have mostly focused on more traditional useful since it does not contain high-level information of the
pipelines, I propose a Fully 3D Convolutional Neural Net- depicted elements, as e.g. space-semantics. Such an under-
work for the semantic parsing of such data. In this pa- standing would be beneficial to many applications related to
per I explore different network architectures and perform augmented reality, robotics, graphics, the construction in-
per voxel multi-class classification of the input into its se- dustry, etc.
mantics. Two main inputs are explored: (a) a 3D sliding Previous work has focused on extracting this informa-
cube on the large-scale point cloud and (b) a consistently tion from point cloud data by following semantic parsing
aligned and normalized enclosed space of the point cloud approaches (e.g. segmentation or object detection). How-
(i.e. rooms, hallways, etc.). Both inputs are voxelized with ever the great majority of them resorts to more tradi-
values equal to their binary occupancy. The network out- tional pipelines [26], where hand-engineered features are
puts a class prediction per voxel. I provide experiments and extracted and fed into an off-the-shelf classifier such as Sup-
results on the different architectures and used the Stanford port Vector Machines (SVMs). The past few years these
Large-Scale 3D Indoor Dataset for their evaluation. methods are gradually being abandoned in favor of Deep
Learning ones that produce superior results by learning fea-
tures and classifiers in a joint manner. Convolutional Neural
1. Introduction Networks (CNNs) in specific have made great progress and
are widely used especially in the case of 2D data. Neverthe-
We spend 90% of our time indoors [3] and for systems to less, the area of CNNs on 3D data, especially for the task
operate in this environment (e.g. assist us in daily activities) of parsing, is significantly less explored to its 2D counter-
1
parts [18]. of the 3D information in the depth, but remains 2D-centric.
Most work in the context of 2.5D and 3D using The presented work differs from these in that I employ a
ConvNets is targeting other applications like depth from fully volumetric representation, resulting in a richer and
RGB [7], camera registration [13], and human action recog- more discriminative representation of the environment.
nition [11]. This has limited the amount of produced knowl- 3D Convolutional Neural Networks: 3D convolutions
edge and available implementations, pretrained models and have been successfully used in video analysis ([11], [12])
training data for 3D related tasks. Alternative approaches to where time acts as the third dimension. Although on an al-
the problem of detecting 3D space semantics in large-scale gorithmic level such work is similar to the proposed one, the
indoor point clouds could be formed by posing the problem data is of very different nature. In the RGB-D domain, [16]
as 2D or 2.5D. Although these could benefit from pretrained uses an unsupervised volumetric feature learning approach
models, existing architecture and other 2D or 2.5D datasets as part of a pipeline to detect indoor objects. [32] proposes
for training (e.g. [27] and [22] respectively), they would a generative 3D convolutional model of shape and apply it
not take advantage of the rich spatial information provided to RGB-D object recognition, among other tasks. VoxNet
in 3D point clouds, which can help disambiguate problem- [20] presents a 3D CNN architecture that can be applied
atic cases. It has been shown that 3D parsing methods can to create fast object class detectors for 3D point cloud and
perform better than their 2.5D counterparts [5]. RGB-D data. This work has similarities, however among
I propose instead a framework for the task of parsing 3D the differences: it uses a different input representation, it is
point clouds of large scale indoor areas into their space- not performing voxel-to-voxel classification and since the
semantics using an end-to end 3D CNN approach. In a task is detection it uses fully connected layers.
higher level, the network receives as an input a voxelized
3D portion of a large-scale point cloud 1 and through a se- 3. Method
ries of fully 3D convolutional layers it performs multi-class
The proposed method receives as an input a voxelized
classification on the voxel level, and outputs the predicted
3D portion of the point cloud and through a series of 3D
class for each voxel. The network classifies each input voxel
convolutions results to a class label prediction for each
into 10 semantic labels2 related to structural and building
voxel. I gradually explored 3 different approaches:
elements, clutter and empty space.v
• 3D Sliding Window: In this network, the input is a
2. Related Work voxelized 3D cube of constant size with binary oc-
Traditional Approaches: Semantic RGB-D and 3D cupancy that is sled over the large-scale point cloud.
segmentation have been the topic of a large number of pa- When fed to the network, it passes through a series of
pers and have lead to a considerable leap in this area dur- 3D Fully Convolutional layers which result to a per-
ing the past few years. For instance [29, 24, 22] propose a voxel multi-class classification (see Figure 1-Left).
RGB-D segmentation method using a set of heuristics for • Adding Context: The previous approach does not
leveraging 3D geometric priors. [21] developed a search- provide any context about the content of the sliding
classify based method for segmentation and modeling of in- cube in relation to the rest of the point cloud. How-
door spaces. These are different from the proposed method ever, context can strongly influence inference. To this
as they mostly address the problem in a small scale. A few end, I provide the global position of the sliding cube in
methods attempted using multiple depth views [28, 9], but the point cloud as a second input to the network, fol-
they remain limited to small scale. Unlike approaches such lowing a similar approach to [6] (see Figure 1-Right).
as [26], [15], my method learns to extract features and clas-
sify voxels from the raw volumetric data. Vote3D [31] pro- • Enclosed Spaces: The use of a sliding cube with con-
poses an effective voting scheme to the sliding window ap- stant size cannot account for the different sizes that el-
proach on 3D data to address their sparse nature. ements in the point cloud appear with. Although for el-
2.5D Convolutional Neural Networks: A subsequent ements that belong to the category of things (e.g. chairs
extension to RGB-D data followed the success of 2D CNNs or tables) one can learn a dictionary of shapes, for el-
([17], [30], [4], [10]). However, most work handles the ements that can be categorized as stuff (e.g. walls or
depth data as an additional channel and hence it does not ceiling) it is harder to identify repetitive shape and size
make full use of the geometric information inherent in the patterns. To address this issue I explored an approach
3D data. [8] proposes an encoding that makes better use similar to [5] to take advantage of the repetitive layout
1 The scale of the point cloud can range from a whole building to a floor,
configuration that indoor enclosed spaces present (e.g.
or any large portion of the former.
elements are placed in a consistent way inside a room
2 Due to memory restrictions some of the presented experiments use with respect to the entrance location). The semantics
either 5 labels. For more details see 4.3. in such spaces remain intact (e.g. the wall, ceiling and
2
3.1. Input
n
(Sliding 3D Cube or
x,y,z global
Enclosed Space)
1 location The input to the method is a voxelized cube that repre-

0 1
sents the underlying spatial data with binary occupancy.
Input
1 0 1
0 1
n 1 0 0
1
0
3D Sliding Window: Here the input is a 3D sliding cube
0
0 1
1
0 binary
of size 10x10x10 voxels. The stride of the cube on the point
00 1 1
n value cloud in all 3 dimensions is 10 voxels, which means that

there is no overlap between spatially consecutive cubes 3
Global The size of the input is heuristically defined and takes into
3DConv Information consideration the voxel resolution and space-structure. In
Leaky ReLU specific, I targeted to capture the representation of walls
and room borders in point clouds as an empty space in be-
FC tween two surfaces. To reflect that in the voxelized space
3DConv
we chose a voxel resolution of 5x5x5cm, since the standard
Leaky ReLU FC
(3DConv - Leaky ReLU) x6
minimum wall width is 7-10cm. As a result, a sliding cube

of 10x10x10 voxels corresponds to 50x50x50cm in space,
Fully Convolutional
3DConv FC which can encompass both the minimum wall width and a
gap between rooms larger than the standard wall size (ei-
Leaky ReLU ther due to noise or occlusion of the wall surfaces by e.g. a
bookcase in highly cluttered scenes).
3DConv Unpool3D Adding Context: In this approach a second input to the
Leaky ReLU 3DConv
voxelized sliding cube is fed to the network, which is the
global location of the cube with respect to the whole point
3DConv Leaky ReLU cloud. This is represented by its x, y, z coordinates from a
defined starting point (one of the point cloud’s corners) and
Leaky ReLU Unpool3D forms a vector of size 3.
Enclosed Spaces: As mentioned above the input here
3DConv
3DConv is an enclosed space. One can segment the point cloud
Classification
into such spaces with a variety of different approaches ([5],

Multiclass
[23]). Once these spaces are identified, they are projected to

Softmax
a canonical reference coordinate system. In this reference
[1,0,0,0,...,0] system all spaces are systematically aligned with respect
ceiling to their entrance location and subsequently normalized in
[0,0,0,0,...,1]
a unit cube. Before they are fed to the network, they are
null
[0,0,1,0,...,0] voxelized with binary occupancy values. Due to memory
wall restrictions, the resolution of the voxelization was selected
as 0.2x0.2x0.2, thus forming an input of size 50x50x50
[0,1,0,0,...,0]
[0,0,0,1,...,0] floor voxels.
other
3.2. Fully Convolution 3D Neural Network
Figure 1. 3D Convolutional Neural Network Architecture. Left:
3D Sliding Window or Enclosed Space.Right: Adding Contextual The input is then fed into a 3D Fully Convolutional Neu-
Information. ral Network. The choice of not including any Fully Con-
nected layer is based on the task in hand; since we are
performing a voxel to voxel operation, retaining the spa-
other elements appear in the data as a whole). These tial information throughout the network is considered es-
spaces are voxelized, consistently aligned and normal- sential. The network comprises of the following repetitive
ized in a unit cube before they are fed into a 3D Fully unit: a 3D Convolutional Layer (3D Conv) followed by a
Convolutional Neural network. The task is again that Leaky Rectified Linear Unit (ReLU) [19] (apart from the
of per voxel classification (see Figure 1-Right). last layer). The choice of Leaky ReLU over e.g. ReLU is
3 The size of the stride as well as other network details decisions have
been largely driven by memory limitations. An overlapping stride would

In the remainder of this section I will explain the main net- allow to infer the class label of each voxel not only based on its neighbors
work architecture and offer details about each of the above- in one cube location, but also by taking into account a larger area around
mentioned approaches. it.
3
Table 1. Details of 3D Fully Convolutional Neural Network 4. Experiments
Approach 3D Sliding Cube Enclosed Space
Input 4.1. Dataset
Size: N x10x10x10 N x50x50x50
Number of Channels: 1 1 For the evaluation I used the Stanford Large-Scale 3D
Stride: 10x10x10
-
Indoor Dataset [5] which comprises of six large indoor
3D Conv1 parts in three buildings of mainly educational and office
Number of Filters: 32 use (see Figure 2). The entire point clouds are automati-
Filter Size: 5x5x5 cally generated without any manual intervention as the out-
Stride: 1x1x1
Padding: 2x2x2
put of the Matterport camera ([1]). Each area covers ap-
Output Size: N x32x10x10x10 N x32x50x50x50 proximately 965, 1100, 450, 870, 1700 and 935 square me-
3D Conv2 and 3D Conv3 ters (total of 6020 square meters). Conference rooms, per-
Number of Filters: 64 sonal offices, auditoriums, restrooms, open spaces, lobbies,
Filter Size: 5x5x5
stairways and hallways are commonly found. The areas
Stride: 1x1x1
Padding: 2x2x2 show diverse properties in architectural style and appear-
Output Size: N x64x10x10x10 N x64x50x50x50 ance. The dataset has been annotated for 12 semantic el-
3D Conv4 and 3D Conv5 ements which pertain in the categories of structural build-
Number of Filters: 128 ing elements (ceiling, floor, wall, beam, column, window
Filter Size: 5x5x5
Stride: 1x1x1 and door) and commonly found furniture (table, chair, sofa,
Padding: 2x2x2 bookcase and board). A clutter class exists as well for all
Output Size: N x128x10x10x10 N x128x50x50x50 other elements. The dataset was slpit into training, vali-
3D Conv6 dation and testing as follows: 4 areas for training, one for
Number of Filters: 5 10
Filter Size: 5x5x5
validation and one for testing. In this way I ensure that the
Stride: 1x1x1 network sees areas from different buildings during training
Padding: 2x2x2 and testing. The same data split was used in all approaches.
Output Size: N x5x10x10x10 N x10x50x50x50
Output
Size: N x5x10x10x10 N x10x50x50x50 4.1.1 Preprocessing
3D Sliding Cube: The raw colored point clouds were vox-

made to avoid saturated neurons. Mathematically, we have elized in a grid of 5x5x5cm (for more details on the choice
y = xi if xi ≥0, else y =xi /ai , where ai is a fixed param- of grid resolution see Section 3.1). After voxelization I as-
eter in the range (1, ∞). I followed the original paper’s signed binary occupancy to all voxels as 0 if the voxel is
configuration and set ai to 100. For this experiment I used empty, 1 if occupied. The ground truth labels were gener-
6 layers 4 , which details are tabulated in Table 1 1. ated by finding the mode of all point labels per voxel. Due
to memory restrictions, the number of classes was limited
Adding Context: In this approach we use a double input
to 5: walls, floor, ceiling, other and empty space.
to the network and follow a similar architecture to the one
proposed in [6]. The voxelized cube passes first through Enclosed Spaces: Similarly, the aligned point clouds
2 3D Convolutional layers to encode spatial information that correspond to each enclosed space were normalized in
and then gets concatenated with the global position. The a 0.2x0.2x0.2 grid. The generated voxels was the populated
concatenated vector passes first through a Fully Connected with binary occupancy values. The ground truth labels were
layer and subsequently through layers of deconvolution to generated as in the previous case. Due to memory restric-
acquire again its spatial configuration. For more details see tions, the number of classes was limited to 10: walls, floor,
Figure 1. ceiling, door, beam, column, chair, table, other, and empty
space.
3.3. Multi-class Voxel Classification 4.2. Implementation
At the end of the final 3D Convolutional Layer the net- I performed three different experiments, one for each in-
work performs multi-class Softmax classification and will put approach. The implementation details are:
P the scores of each class label per voxel: fj (z) =
predict
ezj / k ezk , where z is each voxel, j is the class evaluated • I implemented the framework in Python3 using the
and k represents all classes. deep learning Python library Theano [2].
4 The number of layers and filters in the network was a direct result of • All data preprocessing steps were implemented in
the memory limitations. Python3 as well.
4
Figure 2. Stanford Large-Scale 3D Indoor Dataset [5]: I split the dataset into training (4 areas), validation (1 area) and testing sets (1
area). The raw point clouds are shown in the first row and the voxelized ground truth ones in the second row.
• 3D Sliding Cube: After generating the input, I no- although substantial effort was put to tune the hyper param-
ticed that the amount of sliding cubes that contained eters, the network did not learn. I identify four main factors
only empty voxels was greatly larger than the sliding as the principal reasons: (a) memory limitations did not al-
cubes that contained at least one voxel of the rest of the low to explore a number of hyper parameters such as using
classes. To counterpoise the skewness of the distribu- all available classes in the dataset, or different sliding cube
tion I removed part of the empty sliding cubes. sizes. In both cases the resulting matrices were too large and
the GPU would fail to handle them; (b) limiting the number
• I shuffled the data before training since without it of classes forced to place a great number of elements under
the learning process was getting compromised. The the other class, which as a result created a class with low
network was receiving sequentially inputs of similar discriminative power due to the resulting amorphous shape
classes in the first case due to the sliding nature of the and geometry, but highly represented in the dataset due to
input and the semantic consistency of the configuration the number of voxels falling in this category; (c) using a
of spaces and in the second due to spaces with similar generic constant size of the cube did not permit to capture
functions. the geometry of other elements; and (d) there was a lack
of context regarding the content of the voxelized input with
• I used the Adam [14] adaptive learning rate method,
respect to the rest of the point cloud. An example of the
with parameters: 0.9, 0.99, and 1e − 08.
training loss can be seen in Figure 3.
• The size of the batch per iteration was limited to 500 Adding Context: Following the previous failed attempt
sliding cubes for the first and second approaches (slid- to learn space-semantics, my next step was to add global
ing cube) and 4 for the third (enclosed space) due to context as a second input to the network. Following the ar-
memory restrictions. chitecture described above, the network continued not to be
able to learn. Once again memory limitations restricted the
• I used as metric the mean accuracy of prediction per number of layers, number of filters, and other network pa-
voxel. rameters. The training loss of this network is marked with
green line in Figure 4. Although the results are not as ex-
4.3. Results pected (see Figure 5), it did perform better than the previ-
3D Sliding Cube: The initial idea towards this problem ous network, which demonstrates that the global informa-
was to use a 3D sliding window approach. The main moti- tion was helpful, however not powerful enough to solve the
vation behind it was the fact that the input size to the net- ill-posed problem the sliding window approach created. I
work and the size of the voxelization grid could remain con- also experimented with the same architecture as that pro-
stant no matter the size of the point cloud (buildings have posed in [6], however the results did not differ.
different sizes). Previous experience with this framework in Enclosed Spaces: For the enclosed spaces approach I
traditional pipelines has been shown successful. However, experimented with 5 and 10 semantic classes. In this case
5
Figure 5. Testing Accuracy.
Figure 3. Training Loss of 3D Sliding Cube Approach.

iments it became obvious that information about the context
enabled the network to learn, from adding the global loca-
tion of the sliding cube to providing whole space semantics.
The final results of the enclosed space approach did boost
the accuracy, however there are aspects of it that require im-
provement. First of all it assumes that one is able to identify
rooms and consistently align them in a canonical reference
coordinate system. Second, by normalizing spaces of dif-
ferent sizes into a unit cube, the objects in the scene un-
dergo distortion. Although this step allows to convert stuff
into things since now their dimensions are more consistent
among different spaces, it compromises objects with consis-
tent dimensions to arbitrary ones, which makes the learning
process more difficult. Third, the grid resolution of the vox-
elization that was used was quite coarse, especially in the
case of larger spaces.
Figure 4. Training Loss. A more sophisticated method needs to be used in or-
der to achieve an end-to-end 3D Fully Convolutional ap-
Table 2. Mean testing accuracy of different apporaches proach for the problem of understanding space semantics,
Approach Sliding Cube+Context Enclosed Space
by e.g. combining segmentation, object proposal [25] and
Mean Acccuracy 0.05 0.60
object detection approaches. This would allow to address
both amorphous and not objects. Another interesting aspect
the network was able to learn (see Figure 4), lines blue and would be to incorporate RGB information and observe how
red). Due to the smaller amount of training data than in it affects the performance. However, this would essentially
the other experiments, I reduced the number of iterations to mean adding 3 more channels to each input and thus going
avoid over-fitting. In both cases the network performance back to the issue of memory limitations.
shows similarities, although the behavior when using 10
classes appears slightly more stable (Figure 5). References
The mean accuracy of the of the networks is also pre-
sented in Table 2. [1] Matterport 3d models of interior spaces. http://
matterport.com/. Accessed: 2016-13-03.
5. Conclusion [2] Theano theano 0.7 documentation. http:
//deeplearning.net/software/theano/.
During this project I faced a lot of challenges related to Accessed: 2016-13-03.
the size of the data and the memory limitations. These fac- [3] U. E. P. Agency and the United States Consumer Product
tors hindered the process in a great degree. From the exper- Safety Commission.
6
[4] L. A. Alexandre. 3d object recognition using convolutional [19] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
neural networks with transfer learning between input chan- earities improve neural network acoustic models. In Proc.
nels. In Intelligent Autonomous Systems 13, pages 889–898. ICML, volume 30, page 1, 2013.
Springer, 2016. [20] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
[5] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, neural network for real-time object recognition. In IEEE/RSJ
M. Fischer, and S. Savarese. 3d semantic parsing of large- International Conference on Intelligent Robots and Systems,
scale indoor spaces. In Proceedings of the IEEE Interna- September 2015.
tional Conference on Computer Vision and Pattern Recogni- [21] L. Nan, K. Xie, and A. Sharf. A search-classify approach for
tion, 2016. cluttered indoor scene understanding. ACM Transactions on
[6] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learn- Graphics (TOG), 31(6):137, 2012.
ing to generate chairs with convolutional neural networks. [22] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
In Proceedings of the IEEE Conference on Computer Vision segmentation and support inference from rgbd images. In
and Pattern Recognition, pages 1538–1546, 2015. ECCV, 2012.
[7] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction [23] S. Ochmann, R. Vock, R. Wessel, M. Tamke, and R. Klein.
from a single image using a multi-scale deep network. In Automatic generation of structural building descriptions
Advances in neural information processing systems, pages from 3d point cloud scans. In GRAPP 2014 - International
2366–2374, 2014. Conference on Computer Graphics Theory and Applications.
[8] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning SCITEPRESS, Jan. 2014.
rich features from rgb-d images for object detection and seg- [24] J. Papon, A. Abramov, M. Schoeler, and F. Worgot-
mentation. In Computer Vision–ECCV 2014, pages 345–360. ter. Voxel cloud connectivity segmentation-supervoxels for
Springer, 2014. point clouds. In Computer Vision and Pattern Recogni-
[9] A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic tion (CVPR), 2013 IEEE Conference on, pages 2027–2034.
mapping of indoor scenes from rgb-d images. In Robotics IEEE, 2013.
and Automation (ICRA), 2014 IEEE International Confer- [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
ence on, pages 2631–2638. IEEE, 2014. real-time object detection with region proposal networks. In
[10] N. Höft, H. Schulz, and S. Behnke. Fast semantic segmen- Advances in Neural Information Processing Systems, pages
tation of rgb-d scenes with gpu-accelerated deep neural net- 91–99, 2015.
works. In KI 2014: Advances in Artificial Intelligence, pages [26] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features
80–85. Springer, 2014. and algorithms. In Computer Vision and Pattern Recogni-
[11] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neu- tion (CVPR), 2012 IEEE Conference on, pages 2759–2766.
ral networks for human action recognition. Pattern Analysis IEEE, 2012.
and Machine Intelligence, IEEE Transactions on, 35(1):221– [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
231, 2013. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
and L. Fei-Fei. Large-scale video classification with convo- Recognition Challenge. International Journal of Computer
lutional neural networks. In Proceedings of the IEEE con- Vision (IJCV), 115(3):211–252, 2015.
ference on Computer Vision and Pattern Recognition, pages [28] T. Shao, W. Xu, K. Zhou, J. Wang, D. Li, and B. Guo. An
1725–1732, 2014. interactive approach to semantic modeling of indoor scenes
[13] A. Kendall and R. Cipolla. Modelling uncertainty in deep with an rgbd camera. ACM Transactions on Graphics (TOG),
learning for camera relocalization. Proceedings of the In- 31(6):136, 2012.
ternational Conference on Robotics and Automation (ICRA), [29] N. Silberman and R. Fergus. Indoor scene segmentation us-
2016. ing a structured light sensor. In Computer Vision Workshops
[14] D. P. Kingma and J. Ba. Adam: A method for stochastic (ICCV Workshops), 2011 IEEE International Conference on,
optimization. CoRR, abs/1412.6980, 2014. pages 601–608. IEEE, 2011.
[15] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se- [30] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng.
mantic labeling of 3d point clouds for indoor scenes. In Convolutional-recursive deep learning for 3d object classifi-
J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, cation. In Advances in Neural Information Processing Sys-
and K. Q. Weinberger, editors, NIPS, pages 244–252, 2011. tems, pages 665–673, 2012.
[16] K. Lai, L. Bo, and D. Fox. Unsupervised feature learning for [31] D. Z. Wang and I. Posner. Voting for voting in online point
3d scene labeling. In Robotics and Automation (ICRA), 2014 cloud object detection. In Proceedings of Robotics: Science
IEEE International Conference on, pages 3050–3057. IEEE, and Systems, Rome, Italy, July 2015.
2014. [32] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and
[17] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting J. Xiao. 3d shapenets: A deep representation for volumetric
robotic grasps. IJRR, 2015. shapes. In Proceedings of the IEEE Conference on Computer
[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional Vision and Pattern Recognition, pages 1912–1920, 2015.
networks for semantic segmentation. CVPR (to appear),
Nov. 2015.
7
Adding Shot Chart Data to NBA Scenes
Neerav Dixit
Stanford University
450 Serra Mall, Stanford, CA 94305
ndixit@stanford.edu
Abstract
Shot charts have become a popular way to visualize the

shooting efficiency of NBA players and teams from differ-
ent parts of the court. Despite their utility, it is difficult to
present shot charts as an effective visual aid in a TV broad-
cast. Here, a technique is demonstrated to superimpose shot
charts onto NBA broadcast images, enhancing the TV view-
ing experience for NBA games.
Previous work has discussed the processing of basket-
ball broadcast images to identify the court outline and per-
form camera calibration [1]. In this work, court identifica-
tion and camera calibration methods are implemented and
shown to be robust in the presence of most typical court
occlusions. Following camera calibration, the desired shot
chart coloring is projected onto the broadcast image of the
playing court. The proposed method works well across a Figure 1: A shot chart obtained from NBA.com/stats
wide range of courts and performs accurate calibration and
projection. The shot chart coloring is superimposed onto
the court only, improving aesthetic presentation and visual whole.
depth perception by excluding players, referees, and fans
A shot chart (Figure 1) provides an effectively way
blocking the court in the image.
to do this visually. On the chart, the court is divided
into 14 regions. Each of these regions is assigned a
color – either red, yellow, or green – according to the
1. Introduction shooting percentage compared to the NBA average in
The tracking and analysis of data in the National that region. This provides an effective visualization
Basketball Association (NBA) has seen a marked in- of specific locations on the court from which certain
crease in recent years. Data and analytics has al- players excel or struggle shooting the ball. Shot charts
tered teams’ understanding of the sport, resulting in have become widespread enough that they are made
changes to coaching and decision making in the NBA. easily available to fans on NBA.com/stats.
As data changes the way the game is played, it be- While shot charts are useful visual tools, it is dif-
comes more important to be able to accurately and in- ficult to use them to convey meaningful information
tuitively present informative data to fans. during a TV broadcast of an NBA game. Viewers see
One of the more commonly cited statistics in bas- the court from a broadcast camera angle, but a shot
ketball is a player’s or team’s shooting percentage. chart is only displayed as a separate image shown
This gives some idea of offensive efficiency and is from an overhead view of the court, such as in Figure
often used to compare players. However, a much 1. Projecting a shot chart onto the court in a broad-
more insightful view can be gained when looking at cast image could add the context needed to help a
shooting percentages from different parts of the court, shot chart easily convey the desired information to the
rather than looking at a single number to represent the viewer. While adding color to the court may be exces-
1
(a) Broadcast image of left side of court (b) Left side court geometry (c) Right side court geometry
Figure 2: Broadcast image (a) with image coordinates (u, v). Court geometry shown in (b), (c) along with court
coordinates (x, y) used in each case
sive during live gameplay, a projected shot chart could

help commentators provide salient insight about a    
player’s or team’s shot selection quality during a re- u x
play. p0 =  v  p= y  (1)
This work presents an algorithm to project shot 1 1
chart coloring onto a broadcast image. Given a broad- p0 = Hp (2)
cast image, the pixels corresponding to the different
region in the shot chart must be identified, and pixels Camera calibration using at least 4 known points
corresponding to players, referees, or fans obstructing on the court can be used to find the matrix H. Once
the court should be excluded from this classification. H is known, the (x, y) coordinate on the court for
Then, the court can be colored like the appropriate re- each pixel in the court mask can be obtained using
gion of the shot chart. The key (colored rectangle near p = H −1 p0 . Each (x, y) position maps to one of the re-
the hoop) is excluded from the shot chart projection gions in the shot chart, allowing the appropriate color
in this work. A broadcast image and shot chart are to be assigned to each court mask pixel. Finally, an
assumed to be given; there is no attempt to identify image of the colored court mask is combined with the
which player is shooting the ball. The broadcast im- original image to yield the image of the projected shot
ages used are typical views used of half-court sets on chart.
the left or right side of the court.
3. Technical Content
2. Problem Statement The basic algorithm and corresponding sample im-
In order to project the desired shot chart onto ages are shown in Figure 3. The steps shown in Figure
the broadcast image, camera calibration must be per- 3 are detailed in the following sections.
formed to obtain a mapping between the image coor-
3.1. Court Mask Identification
dinates and the position on the court. As shown in
Figure 2a, each pixel in the image is assigned a (u, v) Previous work has shown that detecting the domi-
coordinate pair. Each position on the court has a (x, y) nant colors in an image of a basketball court can seg-
coordinate pair, with the origin located according to ment out the court from the rest of the image[1]. Since
Figure 2b for the left side of the court or Figure 2c for the entire court, excluding the key, is of a relatively ho-
the right side of the court. mogeneous color, identifying which pixels contribute
As illustrated by Hu et al. [1], the camera matrix to the most common colors in the image identifies the
to map from homogeneous court coordinates p to the pixels on the court fairly selectively. In this work, a
homogeneous image coordinates p0 can be expressed Hough voting scheme was used to identify these pix-
as a 3x3 homography matrix H. els. The Hough space was 3-dimensional, correspond-
2
Figure 3: Flow chart of algorithm including images generated at each stage
ing to the HSV colorspace of the pixels. Each dimen- Using this Hough voting scheme, some isolated
sion was divided into a n bins (n = 3 or n = 4) for a pixels would end up as false positives or false nega-
total of n3 bins in the Hough space. Each pixel in the tives. Since the court is mostly continuous in the im-
bottom 75% of the image voted for the bin correspond- age, a median filter was applied to the court mask out-
ing to its HSV numbers. The top 25% of the image was put from the Hough voting scheme in order to elimi-
discarded because the court typically does not extend nate isolated pixels. The size of the median filter was
to that part of the image. Bins corresponding to low varied for images of different courts to give the best
value pixels were discarded since the court is not dark results. After thresholding the output of the median
and black is a common background color that can re- filter, the final court mask, shown in Figure 3b, was
sult in coherent votes for non-court colors. The high- obtained. This court mask does not include players or
est remaining m bins (m = 1 or m = 2) were retained, referees, and it successfully separates the court from
and pixels contributing to these bins were identified most of the occlusions on the court.
as the court mask. The values of n and m were var-
ied depending on which court was shown in the im- 3.2. Key Point Identification
age. Because of some differences in the court designs
of the 30 NBA teams, different values of n and m gave Following identification of the court mask, this
the best results for the different courts. mask was used to identify key points on the court to
be used for camera calibration. Without prior knowl-
3
fitting the selected points using the Hough transform,
so long as enough of the points fall along the desired
line of interest. These detected lines are superimposed
on the image in Figure 3d. Once the court lines in the
image have been found, their intersection can be used
to locate the key points in the image. These key points
are shown in green in Figure 3d.
3.3. Camera Calibration

Once the key points have been identified, the 3x3
camera matrix H can be identified using camera cali-
bration. If we express H as
 
h1
Figure 4: Key points used for camera calibration for H =  h2  (3)
images of the left (red) and right (blue) sides of the h3
court
then the image coordinates (ui , vi ) for key point i can
be expressed in terms of the homogeneous court coor-
edge of the scene, it cannot be known whether any dinates pi defined in (1) as follows:
particular point on the court is occluded. To make
the identification of key points for camera calibration h1 pi h2 pi
ui = , vi = (4)
more robust to variations between broadcast images, h3 pi h3 pi
key points are selected at the intersection of lines on
the court, as shown in Figure 4. To find these points, From (4), the following steps can be performed to
the lines are identified in the image, and the point obtain a homogeneous system to be solved for the
where they intersect is taken as the key point. As long camera matrix H using the 5 correspondence points.
as enough of the relevant court lines are unobstructed
in the image, key points can be accurately located. The h1 pi − ui h3 pi = 0, h2 pi − vi h3 pi = 0 (5)
 T 
3x3 camera calibration matrix H has 8 independent h1
unknowns, so at least 4 points are required for cam- h ,  hT2  (6)
era calibration. In this work 5 points shown in Figure hT3
4 were used for camera calibration.  T
p1 0T −u1 pT1

The court lines that intersect at the key point lo-  0T pT1 −v1 pT1 
cations in Figure 4 are all along the perimeter of the
 
P ,  ... .. .. (7)
 
court mask shown in Figure 3b. Therefore, potential  . . 

points that lie on these lines can be identified by find-  pT5 0T −u5 pT5 
ing the edges of the court mask. First, the side of the 0T pT5 −v5 pT5
court shown in the image is determined by counting Ph = 0 (8)
whether there are more court mask pixels near the left
or right edge of the image. Then, the extent of the The matrix P defined in (7) is known from the court
court in the image is determined by finding the mini- coordinates pi of the key points in Figure 4, as well
mum and maximum rows and columns in the image as the corresponding image coordinates (ui , vi ) found
that contained a significant number of court mask pix- using the key point identification algorithm. The set
els. Using this information, the appropriate places in of court coordinates pi are referenced using the coor-
the image to search for the 5 lines of interest are esti- dinate system in Figure 2b or 2c for images of the left
mated. or right side of the court, respectively. These coordi-
Figure 3c shows the points identified as possibly nate systems allow the different shot chart regions to
belonging to the lines of interest on the court. Some be treated identically for images of the left and right
of these points clearly do not lie on the desired court of the court following camera calibration. The vector
lines because the edge of the court mask is misshapen h defined in (6) is unknown and can be solved for from
due to the presence of players on the court. However, the homogeneous system in (8) using SVD. Once h is
the proper lines on the court can still be determined by known, it can be rearranged into the camera matrix H.
4
3.4. Shot Chart Projection
Following camera calibration, the court coordinates 20
(x, y) of each pixel in the court mask can be found by
first solving for the homogeneous court coordinates p
Normalized Pixel Error

and then extracting the x and y coordinates. 15
 
u
p = H −1  v  (9) 10
1
p1 p2
x= , y= (10)
p3 p3 5
Using the (x, y) coordinates of each court mask pixel,

the corresponding shot chart region is be identified
by converting the (x, y) court coordinates to polar co- 0
0 1 2 3 4 5 6
ordinates centered under the hoop. Figure 3e shows Key Point
the coloring of court mask pixels according to the shot
chart in Figure 1. Figure 5: Normalized distance between inferred and
actual key point locations for 9 images, with each im-
The final image combination is performed by com-
age represented by a different color and shape. The
bining the pixels that are not in the court mask with
key points are numbered according to the labels in
a weighted average of the original image pixels in the
Figure 4. Key point 5 was not visible in all images.
court mask and the colored court mask. One of these
final combined images is shown in Figure 3f. The shot
chart is projected onto the court (excluding the key), using the algorithm are compared to the manually se-
while players, referees, fans, and other areas that are lected key point coordinates, and the distance in pixels
not part of the court maintain their original coloring. between the accurate and inferred key points is nor-
malized to a 640x360 image size and used as an error
4. Experimental Setup and Results metric. Nine separate broadcast images of either the
left or right side of the court were tested, and the error
The entire shot chart projection algorithm was
values are shown in Figure 5.
tested using several combinations of NBA broadcast
The greatest inaccuracies appear to occur in cases
images and shot charts. Some of the resulting shot
in which there are significant player occlusions along
chart projections are shown in Figure 6. The algo-
the court lines whose intersections are used to locate
rithm appears to perform fairly accurately, although
the key points for camera calibration. In these cases,
slight differences between the desired and projected
many of the points used for Hough line detection are
shot chart boundaries can be observed in some in-
located along the player outlines rather than the court
stances. In each case, the court is successfully identi-
lines, resulting in some error in the court line returned
fied and largely separated from occlusions. However,
by the Hough transform method. If the returned line
in some images there are variations in the lighting on
is inaccurate, the key points taken at the intersection
the court that result in small patches of the court not
of the line with other court lines will also have errors.
being recognized by the dominant color detection and
The results shown in Figure 6 demonstrate that the
therefore being excluded from the court mask.
method is mostly robust to player occlusions of court
The error in key point location in the image was lines, but the algorithm does not perform well in ex-
used as a quantitative metric to evaluate the algo- treme cases when one of the court lines used in the
rithm. Inaccuracies in the inferred key point loca- calibration is almost completely blocked.
tion in the image result in inaccuracies in the camera
matrix H obtained from the camera calibration. This 5. Conclusions
leads to error in the shot chart regions on the court
when the shot chart is projected onto the image. Ac- A method of projecting shot charts onto NBA
curate pixel coordinates for each key point were se- broadcast images has been demonstrated. The major
lected manually and tested to ensure that they gave steps of the algorithm – illustrated in Figure 3 – in-
good shot chart projections when used for camera cal- clude dominant color detection to identify which pix-
ibration. The key point image coordinates obtained els belong to the court, identification of lines on the
5
Figure 6: Four results of the algorithm projecting shot charts onto broadcast NBA images. Image (a) corresponds
to blue X in Figure 5; (b) to green X; (c) to red X, (d) to blue O
court to find key points in the image, camera cali- culties were encountered in images of certain playoff
bration and assignment of pixels on the court to shot games, in which fans were provided with shirts of the
chart regions, and combination of the original im- same color. Rather than the region of fans near the top
age with the image of the projected shot chart. The and bottom of the image having an arbitrary assort-
method shows reasonable accuracy in separating the ment of colors, in these cases there were many pixels
court from other features in the image as well as hav- in the image that would vote for the same incorrect
ing boundaries between shot chart regions at the ap- bin in the HSV Hough space that was used. This issue
propriate locations in the image. was overcome by ignoring the top 25% of the image,
The method was applied to a variety of different which is nearly always occupied by the crowd, and by
NBA courts, and the dominant color detection method enforcing that the Hough bins corresponding to dark
that was implemented through a Hough transform pixels with low values were ignored when identifying
appeared to work very well despite some differing the dominant colors.
color schemes for these different courts. Some diffi- Only images using a broadcast angle and focusing
6
primarily on the left or right side of the court were
considered. However, the camera angle in the broad-
cast is not stable in these situations, so there was some
variability in the images used to test the algorithm.
This variability resulted in some difficulty in the step
for detection of potential points on the court lines of
interest, shown as the transition from Figure 3b to 3c.
Knowledge of the court based on the shape and ex-
tent of the court mask had to be incorporated in order
to accurately identify these candidate points across a
range of camera angles.
Considering images of both the left and right sides
of the court also resulted in a need to differentiate be-
tween these images when identifying points along the
lines of interest. Using different sets of court coordi-
nates, shown in Figures 2b and 2c, for camera calibra-
tion of images showing the left and right sides of the
court allowed these different cases to be treated iden-
tically for the later steps in the algorithm following
camera calibration.
This method excludes the key, the colored rectangle
near the hoop, when projecting the shot chart onto the
court. In order for a good projection to be made onto
the key, its color would have to be changed so that
the projected shot chart would be visible in the final
image. This was not done for this work due to dif-
ficulties in changing the color of the entire key while
ignoring occluding players and maintaining the court
markings that exist in the key. However, implement-
ing this into the algorithm would be a good next step
to extend this work further.
This algorithm could definitely improve the viewer
experience of an NBA broadcast if implemented ap-
propriately. The addition of shot chart data in proper
context to the broadcast would help commentators
make salient points with the help of a great visual
aid. Future work to further increase the accuracy of
the projection and identification of court pixels, along
with the inclusion of the key in the shot chart pro-
jection, would result in a useful tool to enhance NBA
broadcasts.
The Matlab code and images used for
this work can be accessed at the follow-
ing link: https://stanford.box.com/s/
akh4sdyput0ez37yy0he3la43y2ono3s
The shot chart projection is done by running pro-
gram main.m.
6. References
[1] Hu, Min-Chun, Ming-Hsiu Chang, Ja-Ling Wu,
and Lin Chi (2011), ”Robust camera calibration and
player tracking in broadcast basketball video.” IEEE
Transactions on Multimedia, 13(2): 266- 279.
7
Optical Recognition of Hand-Drawn Chemical Structures
Bradley Emi
Stanford University
Dept. of Computer Science
and inaccurate in many cases. Wide availability of this

Abstract kind of data from scientific patents, journal articles,
textbooks and other printed sources would lead to major
Optical chemical structure recognition is the task of progress in not only chemistry, but also pharmaceuticals
converting a graphical image of a chemical molecule into [1], chemical biology [2], medicine [3], and several other
its standard structural representation. Specifically, the fields of scientific research. A tool for chemical structure
chemical structure recognition algorithm should correctly recognition would also open up new possibilities for
identify the graph structure with correct atomic/group modern artificial intelligence and data mining applied to
labels for each node, and the correct type of bond label for already existing datasets that currently only exist in image
each vertex. We introduce a novel method to improve upon format. Furthermore, virtually no work has been done on
state-of-the-art methods with an eye towards solving the the generalization of optical structure recognition to hand-
problem in the face of the additional difficulties when drawn chemical molecules. Current research approaches
molecules are hand-drawn. We employ basic text focus on optimizing computer-generated algorithms to
recognition and corner detection methods to first label the improve accuracy on extremely large molecules typically
atoms and groups that form the nodes of the chemical presented in scientific patents. (Fig. 1) Little work has
structure graph, and conclude that our approach to corner been done on smaller, more common hand-drawn
detection outperforms the line vectorization algorithms molecules. (Fig. 2) While computer generated molecules
typically used in other systems. A Hough transform is used are difficult enough to recognize, smaller hand-drawn
to recognize the presence of bonds between the nodes. The molecules present different and unique challenges from a
major difference in our approach is to use a new computer vision perspective.
technique that classifies bonds according to various
feature descriptors of sliding-window cross-sections of
bonds using a supervised machine learning approach. In
addition to the baseline method of using the Hough
transform to also classify bonds, we use local maxima
detectors on single-pixel slices of bond cross-sections and
histogram of oriented gradients (HOG) features of wider
bond cross-sections coupled with support vector machine
(SVM), logistic regression, decision tree, and neural
network classifiers. We compare the results of these
feature descriptors, analyzing our pipeline on a hand- Fig. 1: A molecule from U.S. Patent Class #435; most
drawn dataset of 360 simple molecules and conclude that current optical structure recognition research focuses on
these new bond recognition technique leads to major similar molecules.
improvements in recognition performance over the
baseline.
1 Introduction
The standard presentation of organic chemical data in a

wide variety of fields, such as biology, chemistry, and
medicine, remains the structural diagram, which contains
all the chemical information of a given molecule, but is
unsuitable for computational analysis. The problem of Fig. 2: A molecule from our dataset. Its overall structure is
optical structure recognition, the conversion of these much simpler, but its bonds and labels require different
images of structures into the usable, machine-readable treatment.
labeled graph data formats, remains highly inconvenient
1
2.2 Improvements to Existing Approaches
There are several additional advantages of being able to
recognize handwritten molecules in addition to computer- The focus of this paper is the novel approach to the correct
generated molecules. For example, the computer identification of hand-drawn bonds in the low-level
generation of molecular structures is currently quite module, the correct identification of atoms and edges
tedious, and an application to perform real-time without the use of high-level correction using chemical
recognition of small components of hand-drawn structures and graphical knowledge. Previous attempts at optical
does not yet exist. structure recognition, even state-of-the-art approaches, are
heavily dependent on the correct identification of fine
2 Review of Previous Work lines (the individual thin lines constituting double, triple,
and dashed bones), which fails in the case of imperfect
2.1 Summary of Previous Work hand-drawn bonds. Frasconi et. al.’s algorithm, MLOCSR,
uses the Douglas-Peucker algorithm [9] to approximate
Previous work on the optical structure recognition the contour of the molecule with a polygon which fits the
problem has to date focused exclusively on computer- least-vertex polygon to the contour within a certain
generated structures. Early research began in the 1990s, precision. We hypothesize that line detection-only
with IBM receiving a patent for recognition of chemical vectorization algorithms such as the Douglas-Peucker
graphics among other printed material on a page as well as algorithm may fail in cases where bonds are not straight
basic line tracing techniques (Fig. 3) to recognize (Fig. 4), assigning too many vertices to the molecule.
structures. [4] A similar approach was developed by Furthermore, classification algorithms can fail when
University of Leeds researchers and called CLiDE in the dashes follow an irregular pattern and/or touch (Fig. 5).
same year. [5]
Fig. 4: Bonds are not Fig. 5: Dashes

straight, a difficulty for touch and are
Fig. 3: IBM’s line-tracing algorithm. line-only vectorization irregular
More modern approaches, such as ChemReader from the

University of Michigan [6], and the National Cancer Several algorithms use the Hough transform to detect lines
Institute’s open-source OSRA [7] have both employed and line segments. However, Fig. 6 displays the difficulty
more sophisticated text recognition (OCR) and line of using the Hough transform on a hand-drawn image.
detection algorithms; ChemReader uses a generalized Even when the threshold for the required number of votes
Hough Transform and OSRA uses the Potrace library. for the Hough transform to detect a line is optimized, both
false positives and false negatives still occur for reasons
State-of-the-art approaches, such as the approach of specific to hand-drawn images.
MLOCSR, developed by Italian researchers Frasconi,
Gabbrielli, Lippi, and Marinai, generally recognize The solution to these difficulties is by using the Hough
molecules in two stages. The first stage is a low-level transform to only recognize bonds. Bonds are then
processing module, which detects edges, corners, and text; classified according to the features of their horizontal
and a high-level reasoning engine, which uses Markov cross-sections. In this paper, we experiment with a number
logic networks, utilizing prior chemical and graphical of features and classifiers to optimize the accuracy of bond
knowledge to correct errors in the low-level module. [8] type recognition. These experiments constitute a majority
Modern approaches, such as a more recent iteration of of the current paper, with enhancements to text
CLiDE, also use a specialized artificial neural network to recognition and further experimentation on the higher-
classify text labels. level module incorporating chemical knowledge
representations forthcoming in future work.
2
approach was ultimately unsuccessful. HOG classifiers
were found to suffer from a large number of false positives
due to a lack of negative training examples. With more
data, this approach could also prove to be more successful,
but was inapplicable with our limited amount of training
data.
Bond and corner detection was used to identify various

points of interest of the coarse line features; intersections
of lines that represent a carbon, or lines which end in a
connection to a text box, representing a different atom or
group. A coarse Gaussian filter was applied to the image
before applying a Hough transform to find the lines and
points of interest.
Finally, these points of interest were assembled to locate

the atoms and groups of the molecule as well as the bonds,
Fig 6: Difficulty of using the Hough transform on leaving bond classification as the final task. Cross-sections
a hand-drawn molecule. A false positive is of the bonds were analyzed using various feature
detected on the bottom left because of the bend in descriptors and classified according to several machine-
the single bond. The double bond on the right, learning classifiers trained on 45 hand-drawn molecules.
meanwhile, is undetected as a false negative. The success of our approach with a small training set size
is again reason to believe this approach will become more
accurate as more data is acquired. More details on the
3 Methods bond classification algorithm is described in the methods
section.
3.1 Summary
3.2 Technical Solution
The abstract general structure of the pipeline we use in
order to recognize the chemical structure of our molecules 3.2.1 Data
is as follows:
Our data comprises 360 images of 9 different simple hand-
• Recognize text labels using scale-invariant drawn molecules (40 images of each molecule) drawn on
template matching standard white printer paper in fine point black Sharpie
• Removal of text from image marker. Each image was taken with an iPhone 6 camera at
• Bond and corner detection 3264 x 2448 resolution with three color channels (no alpha
• Bond detection channel) downsampled to 400 x 300 grayscale using
• Bond classification bilinear interpolation. All the images were taken in
• Association of corners to atoms and groups identical lighting, and were drawn by three different
people, so our model would not overfit to a particular
Due to the lack of a large amount of hand-drawn chemical person’s drawing style. Each image was preprocessed with
data and the small number of labels, a simpler scale- binarization using a 40% threshold. No other
invariant template-matching approach was used to detect preprocessing stages were applied.
text in images with reasonable accuracy. Other
approaches, such as Google Tesseract [9] and other 45 of the images (5 per molecule) were set aside as a
supervised learning classifiers with histogram of oriented training set.
gradients (HOG) [10] features were attempted. Tesseract
was exceedingly difficult to configure due to its more 3.2.2 Text Recognition
general use parameters, which use a number of language
models which make a large number of assumptions in For text recognition, we tried a number of approaches,
order to more accurately recognize more structured including scale-invariant template matching using 5
sources of text. However, these assumptions, which do images of each of 6 templates (“O,” “H”, “OR”, “RO”,
not apply to our purposes, were difficult to remove “N”, and “OH”), and a number of supervised learning
programmatically and debug. Future work may classifiers using 4 templates (“O,” “H,” “R,” and “N”). By
incorporate a state-of-the-art OCR engine; however, this visual of the dataset, we estimated a minimum and
3
maximum scale for the images at 20x20 pixels and 60x60 For clarity, we use the terminology of MLOCSR, defining
pixels respectively and implemented a spatial pyramid a C-point to be a corner corresponding to the intersection
sliding window with the length of each square window of the main bonds of a carbon, a D-point to be the
increasing from 20 to 60 pixels in steps of 5 pixels. This endpoint of a line segment not connected to the main bond
was a very conservative estimate, and for to represent a double or triple bond, and a T-point to be the
reimplementation on different image sizes, we recommend end of a line segment drawn to a text box to indicate a
scaling from 0.3% of the total area of the image to 3%. bond to a non-carbon.
For the supervised learning classifiers, to collect negative 3.2.4 Best-Fit Polygon Reimplementation
training examples, we randomly selected 1200 of these
windows that were verified by hand to have no text from As in MLOCSR, we use the Douglas-Peucker algorithm to
the training set. We then collected 5 of each of the 4 detect the vicinity of C-points and T-points, and look for
templates from the training set. To augment the number of D-points later once the main corners are located. This
positive examples, we additionally used 55 images for algorithm for each contour iteratively tries to fit n-vertex
each template from the open source Chars74K handwritten polygons to the contour, increasing n until no point on a
dataset [11]. We cropped each image to eliminate contour is further than a threshold distance away from the
whitespace and extracted histogram of oriented gradients polygon. The algorithm then returns the vertices of the
features from each using 64 bins. We then compared the polygon.
performance of a logistic regression classifier, a linear
SVM classifier, and a neural network with one hidden We search for clusters of all points of polygons that fit the
layer with 30 nodes. Results are presented in section 4. opposite contours of the image after a Canny edge detector
is applied in order to accomplish this goal. We use the
For the scale-invariant template-matching, we applied a threshold of √2 times the edge length as prescribed in
Gaussian filter with size equal to half the width of the MLOSCR. Fig. 8 shows the results of this stage.
measured strokes to all training templates and the image
for matching, and then used the spatial pyramid sliding
window described above to match the images. We then
chose the tolerance level, 0.77, for which the F1 score was
maximized. Non-maximal suppression is used to remove
overlapping bounding boxes. A sample of the output of
this stage is presented in Fig. 7. We used the results of this
algorithm for the next stages of the pipeline. More details
are presented in section 4.
Fig. 8: Result of the Douglas-Peucker algorithm to fit a

polygon to the contours of the Canny edges of the
molecule image.
We then use a basic agglomerative clustering algorithm,

setting a maximum distance between clusters to 50 pixels..
If a polygon vertex is less than 50 pixels away from an
existing cluster center, we assign it to that cluster,
updating the center point of that cluster. Otherwise, we
Fig. 7: Sample results of template matching OCR. initialize a new cluster. The results of this stage applied
correctly to a molecule image are shown below in Fig. 9,
3.2.3 Corner Detection Overview with the blue points representing final cluster centers.
Testing results are shown in section 4.
We reimplement the Douglas-Peucker algorithm on our
dataset for comparison with MLOCSR, and we also
implement a corner detection algorithm based on a broad
Harris corner detector.
4
3.2.6 Bond Detection
In contrast to methods based on line vectorization, which

comprises not only MLOCSR but also a majority of the
existing methods in the literature, we use the Hough
transform only to detect bonds, rather than to classify
them. Line vectorization methods make several errors and
in the case of hand-drawn molecules, are not precise
enough to detect the D-points that can be detected by the
polygon reconstruction method when molecules are
perfectly straight. As the state-of-the-art method in
MLOCSR only recognizes under 80% of the C- and T-
points, there is very little hope for such an algorithm to be
able to detect the finer D-points given the large amount of
Fig. 9: Results of agglomerative clustering stage. variability in hand-drawn bonds.
3.2.5 Harris Corner Detector Since a carbon can only have four bonds, for each of the
nodes detected in the previous stage, we look at the four
The goal of the Harris detector [12] in this context is the closest nodes to see if there is a bond between them. While
same; to identify the C- and T-points but not necessarily further molecules are not strictly forbidden from being
the finer D-points that distinguish double and triple bonds. connected to a carbon, it is extremely uncommon, and this
The Harris corner detector looks for a high variation in the case does not occur in any of the molecules in our dataset.
gradient of an image in two directions. For more general molecules, more nodes can be examined
and spurious matches can be removed using a Markov
We first apply a coarse Gaussian filter to the image with logic network similar to what is implemented in
the size of the estimated stroke width. We then run the MLOCSR, but we do not implement that here for
Harris corner detector, once again requiring corners to be a simplicity.
threshold distance apart.
The other heuristic we use is that if three nodes are
A sample result on the same molecule after filtering is collinear, there is not a bond between the two outer nodes.
presented in Fig. 10. This situation only occurs when there are two bonds at a
180-degree bond angle, so the outer nodes cannot have a
bond between them.
These heuristics leave us with a number of candidate

bonds, a list of possible node-node pairs that could contain
a bond between them. To refine this list, we create a
bounding box of the edge between the two nodes at a fixed
width (40 pixels) and split the bounding box into windows
of fixed size. On each window, we then apply the Hough
transform with a very low threshold to look for lines in the
window. We only accept lines that are within 1 degree of
the expected direction of the bond. We then require that all
of the windows in the bounding box contain a line
detected by the Hough transform. We assume that if a
node-node pair does not contain a bond, at least one of its
windows will not have a matching line in the orientation
of the node-node pair. This approach leads to several false
Fig. 10: Results of Harris corner detector. negatives, but these can be rectified by a Markov logic
network in later steps, because most of the false negatives
A comparison of the results is presented in detail in Sec. 4, are simply a missing bond in a ring or another predictable
but we find that the Harris corner detector (89% accuracy) structure. This approach does lead to a relatively low false
outperforms the polygon reconstruction method (75% positive rate, and the false positives are generally simply
accuracy). spurious triangular closings, which are very rare in organic
5
molecules and would also be removed by a Markov logic window gets a “vote” for the overall type of bond. A
network. Results are presented in Sec. 4. The process is sample result is shown below in Fig. 12, and overall
visualized in Fig. 11. results are presented in Sec. 4.
Fig. 12: Final result of bond classification for a benzene

ring.
4 Results
4.1 Text Recognition Results
Fig. 11: Top left, top right: Hough line detections (blue) 4.1.1 Scale-Invariant Template Matching
for the node-node pairs with a bond for a given window
(red) in the bounding box. Bottom: A window between The results of text recognition are presented here. In order
two opposite nodes that will not have a Hough line to optimize the tolerance of the scale-invariant template
detection, so the algorithm will not assign it a bond, even matching, we measured the precision and recall on the test
though there is contamination elsewhere in the bounding set. The results are presented in Fig. 13. We chose the
box. tolerance that maximizes the F1 score, 0.77.
3.2.7 Bond Classification
The final part of the pipeline is bond classification, before

higher-level modules use chemical knowledge to correct
errors (future work).
We use a sliding window moving along the cross-sections

of bonds extracted from our training set via screenshots,
using a window size of 10 pixels down the length of the
bond and 40 pixels across. Typically this will result in
around 3 to 10 windows per bond. Our training set
contains 62 single bonds, 33 double bonds, 10 wedge
bonds, 10 dashed bonds, and 5 triple bonds.
We then use HOG features on each of the sliding windows

and use these features to train a supervised learning
classifier. We experiment with a multiclass logistic Fig. 13: OCR precision and recall on test set. The optimal
regression classifier, a linear support vector machine, and value was found to be 0.77.
a decision tree. (We avoided using a neural network since
it is prone to over-fitting in the case of a small training We then apply the template matching to the test set. By
set). molecule and in total, the results are presented below in
Table 1. Accuracy is by molecule image, so a molecule
Then, for each bond in the test set, we extract the same has to have all of its text completely recognized with no
sliding window HOG features and use the classifier to false positives for it to count positively towards the
predict the type of bond that appears in each of the accuracy metric. Diagrams of the molecules associated
windows. Then we employ a voting system where each with each molecule ID can be found in the Appendix.
6
Molecule ID Precision Recall Accuracy For future work we would like to expand the size of the
1 1.0 1.0 1.0 training set to improve the accuracy; but for now we use
2 1.0 1.0 1.0
3 0.54 1.0 0.50
template matching.
4 1.0 1.0 1.0
5 0.95 0.95 0.90 4.2 Corner Detection Results
6 1.0 1.0 1.0
7 0.96 0.79 0.40 As shown in Table 3, we conclude that the Harris corner
8 0.79 0.65 0.53
detector outperforms the baseline method of the MLOCSR
9 0.98 0.90 0.58
Total 0.91 0.92 0.77 polygon reconstruction method quite significantly, by
approximately 15% on the molecule level (node level
Table 1: Results of scale-invariant template matching on refers to the number of correctly detected nodes over the
test set. total number of nodes, molecule level refers to the number
of correctly detected molecules with no false positives
While an accuracy of 77% is far from ideal, it is divided by the total number of molecules. There are
surprisingly effective considering we only used 5 training several reasons that this result is the case. First, the
images to build the templates. With more examples, this polygon reconstruction method performs very poorly in
method could perform even better in future work. We use the case of dashed bonds; whereas the Gaussian smoothing
the images where text was accurately identified from this applied to the image before applying Harris corner
stage in the further stages of the pipeline. detection “blends” dashed bonds into an edge before
finding the corners. This kind of preprocessing is not
4.1.2 Supervised Classifiers feasible for the polygon reconstruction method since it
relies on the narrow opposite contours that form the edges
Parameters Train Cross- #Iter- Avg. of the thick lines. The performance on a dashed molecule
Set Validation ations Acc. is demonstrated in Fig. []. While neither method performs
Logistic Regularization 1330 10-fold 100 0.97 particularly well on dashed bonds (polygon method
Regression coeff. = 1.0, L2 examples
with HOG norm
performs at 15% on these dashed bonds, while Harris
features method performs at 45% on dashed bonds), which when
Linear Regularization 1330 10-fold 100 0.96 blended are very wide, making corner detection difficult
Support coeff. = 1.0, L2 examples especially when they a dashed bond is near other corners.
Vector norm
Machine Molecule Overall Overall
with HOG accuracy Precision Recall
features
One-Layer One hidden 1330 10-fold 100 0.99
Polygon 0.748 0.960 0.970
Neural layer with 30 examples
Method
Network nodes
with HOG Harris Method 0.896 0.987 0.989
features
Table 2: Results of supervised classifiers on the OCR
training set.
The performance of the supervised learning classifiers

using HOG features with 64 bins was evaluated using
cross-validation, training a classifier with 90% of the 1100 Table 3: Comparison of corner detection methods.
negative test images and 60 positive test images per
character, and testing on the remaining 10%. We conclude
our training set is not large enough to provide accurate
detections on the test set, recording less than 1% overall
accuracy. This is because although the performance of the
supervised learning classifiers on the cross-validation set
is relatively good, perfect matching on the test set
requires a correct match on each of the 1000+ sliding
windows used for detection, so even the neural network,
with 99% accuracy on the cross-validation set, is unable
to perform well in the natural setting, with nearly 0%
accuracy and several false positives.
Fig. 14
7
Fig. 14: Harris corner detection on a molecule with dashed Molecule ID Molecule Overall Overall
bonds (left) and polygon reconstruction method with Accuracy Precision Recall
initial corners in blue and clustered corners in red. (right). 1 1.0 1.0 1.0
A wide Gaussian filter helps “blend” dashed bonds 2 0.31 0.94 0.95
together. Since each dash of the dashed bond is a contour, 3 0.0 1.0 0.78
many spurious corners in blue are detected with the 4 0.26 1.0 0.79
polygon reconstruction method and make the 5 0.70 1.0 0.88
agglomerative clustering inaccurate. 6 0.82 1.0 0.94
7 0.22 0.86 0.97
As we hypothesized, the Harris method also outperforms
8 0.70 1.0 0.90
the polygon reconstruction method when bonds are not
9 0.10 0.90 0.83
perfectly straight. This was particularly evident in the
benzene rings, which the Harris corner detector (95% Total 0.55 0.96 0.91
accuracy on benzene rings) was able to substantially Table 4: Bond detection results.
outperform the polygon reconstruction method (50%
accuracy on benzene rings). This effect is shown in Fig. Typical errors included a missing bond, as shown in Fig.
15. 16, and false triangular closures with bonds at very wide
angles (bonds that are nearly, but not quite collinear), as
shown in Fig. 17.
Fig. 15: The polygon reconstruction method detects

several incorrect corners due to curved lines, as shown in
blue. The Harris method does not suffer from this
Fig. 16: A typical bond detection error, a missing bond,
drawback.
likely due to slight inaccuracy of corner detection. These
“missing bonds” can be detected in later stages as long as
4.3 Bond Detection Results
most of the structure is correct.
The bond detection is the worst performing stage of the
pipeline, but fortunately it is the most correctable by a
higher-level Markov model as described in MLOCSR.
Still, large improvements can still be made to the
algorithm by applying more heuristics about how the
various atoms bond. We did not take chemical knowledge
about the topological structure of molecules into account
when looking for bonds, but there exist further constraints
that can reduce our false positive rate, which would let us
adjust the tolerance thresholds on the Hough detector and
the angle acceptance to reduce the false negative rate. The
results are presented below in Table 4 and errors are
characterized further in Figs. 16 and 17.
Fig. 17: Another common bond detection error, triangular

closure of wide bonds, due to there being too much
contamination in the bounding box. These can also be
corrected later if most of the molecule is correct.
8
4.4 Bond Classification Results with more training data we will be able to obtain nearly
100% accuracy with this method in the future.
4.4.1 Comparison of Classifiers
We split our training bonds randomly according to a 90-10

split and run cross-validation 10 times.
Classifier Cross-Validation Cross-Validation

Accuracy
Logistic Regression 10-fold 0.85
Linear Support 10-fold 0.97
Vector Machine
Decision Tree 10-fold 0.88
Table 5: Cross-Validation Results on a 90-10 training set Fig. 18: Two examples of correctly recognized molecules
split of known bond labels. after completion of the full pipeline. These can be easily
converted to a standard chemical data format.
4.4.2 Performance on Test Set
5 Conclusion
Based on the results on the cross-validation set, we use the
SVM for classification on the full set of molecules. We Although our overall accuracy is low, we believe that the
find that there is no great disparity in confusing one type work presented in this paper will lay the foundation for
for another; despite the only 5 training examples of triple hand-drawn structure recognition in the future.
bonds, we find that double bonds are no more often
mistaken for single bonds as triple bonds, for example. Much of the low accuracy can simply be attributed to a
lack of training data. State-of-the-art OCR methods, for
Molecule ID Accuracy By Bond Accuracy By example, would boost the accuracy of text recognition
Molecule
1 0.98 0.90 from 77% to near perfect. We also believe that more
2 0.97 0.81 training data will ultimately allow us to use a
3 0.57 0.0 convolutional neural network for the bond classification
4 0.80 0.30 stage rather than an SVM, and more data will be able to
5 1.0 1.0 significantly improve the accuracy of bond classification
6 0.83 0.50
7 1.0 1.0
as well.
8 0.98 0.93
9 1.0 1.0 Additionally, as mentioned previously, the focus of this
Total 0.94 0.75 project was on the low-level recognition of atoms and
Table 6: Test results by molecule using an SVM classifier. bonds; or the nodes and vertices that make up the overall
graph. There are additional heuristics that can be applied
4.5 Overall Results in higher-level modules, such as bonding patterns like
valence rules that we did not take into account, which will
When the overall pipeline is run on the entire set of significantly improve the performance of bond detection.
molecules, 94 out of the original 360 molecules are
correctly recognized in their entirety. While this accuracy Accuracy may also be a misleading metric for certain
may seem low, it is still higher than the performance of the applications of hand-drawn structure recognition as well,
“out of the box” existing optical structure recognition in cases where more information is obtained. For example,
algorithms, the most well-known being OSRA, which in an electronic tablet drawing application, in a similar
when used on handwritten data have nearly 0% accuracy. way to how Chinese and Japanese characters are
We also find that even the approach of MLOCSR applied recognized by OCR software, information about how the
to the data, which relies on the Douglas-Peucker polygon user is drawing the structure is also available. This can
fitting algorithm, does not even detect C- and T- points as improve the localization of corners (using information
successfully as our algorithm on our hand-written dataset. about when the user picks up and puts down the pen) and
We also find that our supervised learning bond identify bonds with much greater accuracy (based on
classification algorithm performs extremely well given the speed of stroke, etc.). Additionally, if there is a limited
very small training data set, which was extracted from subset of molecules that the engine is required to
only 5 images of each molecule. We are optimistic that recognize, various molecule similarity algorithms can be
used to compare the molecule against the database of
9
possible molecules and return the one with the greatest [7] Filippov, I. and Marc Nicklaus. Optical Structure
similarity. This is often the case for simple molecules and Recognition Software to Recover Chemical Information:
could be very useful in chemistry education. OSRA, An Open Source Solution. J. Chem. Inf. Model., 49
(3), pp. 740-743, 2009.
We conclude that handwritten structure recognition and
analysis is a difficult problem, one that cannot be treated [8] Frasconi, P. et. al. Markov Logic Networks for Optical
in the same way that computer-generated structure Chemical Structure Recognition. Journal of Chemical
recognition is treated. More flexibility must be applied in Information and Modeling. 54, pp. 2380-2390, 2014.
accounting for the greater degree of variability in hand-
drawn images, and we have accounted for that in this work [9] Douglas, D.; Peucker, T. Algorithms for the reduction
with modern corner and line detection techniques. The key of the number of points required for represent a digitized
insight of this project was analyzing small cross-sections line or its caricature. Can. Cartogr. 1973, 10, 112−122.
of bonds so the algorithm can gain a consensus from many
cross-sections instead of trying to analyze bonds as a [10] Navneet Dalal, Bill Triggs. Histograms of Oriented
whole, as previous algorithms have done. Overall, there Gradients for Human Detection. Cordelia Schmid and
are many parts of this pipeline that can be improved as Stefano Soatto and Carlo Tomasi. International
mentioned, but much progress has been made towards Conference on Computer Vision & Pattern Recognition
being able to apply these methods to a public application. (CVPR ’05), Jun 2005, San Diego, United States. IEEE
Computer Society, 1, pp.886–893, 2005.
6 References
[11] T.E. de Campos, B.R. Babu and M. Varma.
[1] Gaulton, A.; Overington, J. P. Role of open chemical Character Recognition in natural images. In Proceedings
data in aiding drug discovery and design. Future Med. of the International Conference on Computer Vision
Chem. 2010, 2, 903−7. Theory and Applications, Lisbon, Portugal, February
2009.
[2] Kind, T.; Scholz, M.; Fiehn, O. How large is the
metabolome? A critical analysis of data exchange [12] Harris, C. and M. Stephens, A Combined Corner and
practices in chemistry. PLoS One 2009, 4, e5440. Edge Detector. Plessey Research, 1988.
[3] G.R. Rosania, G. Crippen, P. Woolf, D. States, and K. 7 Code Access

Shedden, R. A cheminformatic toolkit for mining
biomedical knowledge. Pharmaceutical Research, vol. 24, The code is available open-source. The repository is
(no. 10), pp. 1791-1802, Oct 2007. located at https://github.com/bradleyemi/chemtype2.
Instructions for downloading data and usage are located on
[4] Casey, R. et. al. Optical Recognition of Chemical GitHub.
Graphics. Document Analysis and Recognition:
Proceedings of the 2nd International Conference on 8 Acknowledgements
Document Analysis and Recognition. 1993.
This project was implemented in Python 2.7 with
[5] Ibison, P. et. al. Chemical Literature Data Extraction: additional use of the Anaconda distribution, Scikit-Learn
The CLiDE project. Journal of Chemical Informatics and (for machine learning classifiers), and OpenCV for CS
Computer Science, 33, pp. 338-344. 1993. 231A at Stanford University.
[6] Park, J. et. al. Image-to-Structure Task by

ChemReader. Text Retrieval Conference, 2011.
10
A Appendix: Molecule Table
2
Pedestrian Detection and Tracking in Images and Videos
Azar Fazel Viet Vo

azarf@stanford.edu vtvo@stanford.edu
Abstract the success of SVM versus Random Forests for pedes-

trian detection. We implemented our algorithms on
The increase in population density and accessibil- Python and utilized several computer vision pack-
ity to cars over the past decade has led to extensive ages from OpenCV, machine learning packages from
computer vision research in recognition and detection sklearn, and imaging processing packages from scikit-
to promote a safer environment. Primarily, much of image. In order to train our model, we used a com-
the research focuses on detecting pedestrians in or- bination of 5,400 positive 64 x 128 images from the
der to reduce the chance of collision, and to improve Inria’s Person dataset [2], PETA dataset [3], and the
traffic control. The need for increased surveillance at MIT database [4]. Our negative images were also from
the work place and at home also promotes research these databases and images from the Daimler Mono
in this area. We implemented a pedestrian detection dataset [5], containing a total of 2,100 images. For
and tracking algorithm by using histogram of oriented each non-pedestrian image, 10 random windows of 64
gradients (HOG) features and a linear support vec- x 128 pixels were extracted for training, giving a to-
tor machine (SVM) and Random Forest classifier. Our tal of 21,000 negative images. This trained model was
goal was to analyze and generate different bounding then used to test the detection accuracy on images, and
boxes for people within static images, and finally ap- track pedestrians in videos.
ply this strategy to localize and track people within
videos. We benchmarked different HOG parameters to 2. PREVIOUS WORK
find the best model, and furthered our experimentation
by comparing the effectiveness of SVM versus Random Many techniques are being used today for pedes-
Forests. Our implementation was able to achieve 80% trian detection. One such technique that is similar
accuracy in static images, and was able to track pedes- to HOG is Scale Invariant Feature Transform (SIFT).
trians in videos if the detected pedestrians’ poses do This technique generates features by using Difference-
not vary significantly. of-Gaussians (DoG) in an image’s scale-space pyra-
mid to find interesting local keypoints. Each keypoint
will have a orientation vector, and is invariant to scale
1. INTRODUCTION and rotation. Due to the high dimensionality of SIFT
features, principal component analysis is often used
Just in the United States, 5,000 of the 35,000 annual in conjunction with SIFT [10]. Although SIFT can
traffic crash fatalities involve pedestrians [1]. Com- be effective for detecting human features, Dalal and
puter vision research in the area of pedestrian detection Triggs explained in their paper that locally normalized
is becoming increasingly crucial as more intelligent HOG descriptors were more effective [6]. They ex-
motor vehicles are introduced into the streets. How- perimentally showed that the dense grid of uniformly
ever, pedestrian tracking and detection is inherently a spaced cells and overlapping local contrast normaliza-
hard problem to solve due to high intra-class variabil- tion improved the detection performance as compared
ity and partial occlusions. Our goal was to benchmark to SIFT. In fact, HOG features have been shown to
different feature parameters from HOG, and compare have 1 to 2 orders of magnitude less false positives
1
then other approaches.
The Deformable Parts Model (DPM) is another
technique for object detection that performs well at
classifying highly variable object classes. In this tech-
nique, for each image, a HOG feature pyramid is
formed by varying the scale of the image, and defin-
ing a root and parts filter. The root filter is coarse and
is used to capture the general shape of the object, while
higher resolution part filters are used to capture small
Figure 1: Example of HOG features, the right picture is the
parts of the object. Objects are then detected by com- original image and the left one is the extracted HOG
puting the overall score for each root location based on features.
the best possible placement of the parts [7].
The other technique in pedestrian detection is Con-
volutional Neural Networks (CNN). This technique
shows outstanding power in addressing the pedestrian found from our SVM model. Since our sliding window
detection problem, especially in the context of au- was kept at a constant size of 64x128 pixels, we imple-
tonomous driving. In CNN, it learns which convo- mented an image pyramid approach for our detection.
lution parameters can produce better features to eas- In this approach, for each image, we scaled down the
ily predict an optimal output. Then it uses these image by 15% of its original size for several iterations
features by extracting them from the last fully con- until the size is below a threshold of 64 pixels for width
nected layers to train an SVM model for pedestrian and 128 pixels for height. For each iteration, our de-
detection[11][12]. tector window searched the entire scaled image and
calculated scores using our SVM model. Once this al-
3. TECHNICAL APPROACH
gorithm was finished, scaled bounding boxes was dis-
Using our dataset of positive and negative images, played on the original image and non-maximal sup-
we extracted features using the histogram of oriented pression applied to eliminate redundant boxes.
gradients technique described by Dalal and Triggs.
In order to reduce false positive rates, we mined
This technique divides the image into dense equal
for hard negative examples using our negative train-
sized overlapping blocks. Each of these blocks are
ing data. We extracted all false positive objects found
further divided into cells which will be used to find
within negative images and included these examples
a 1-D histogram of gradient edge orientations over the
into our training data for retraining the classifier.
pixels of the cell. For this project, we experimented
with block sizes that were 2x2 and 4x4, and cells sizes Due to the exhaustive search performed during
that were 8x8 and 16x16 in order to find the best com- HOG feature extraction, the time complexity for
bination. For our histograms, we used 9 orientation object detection is very high. This poses a problem for
bins across all experiments. Histograms for each block pedestrian tracking in videos because detection rates
are combined and finally normalized to have better in- would be too slow. In order to remedy this problem,
variance to illumination and shadowing [6]. Figure objects that are moving will be extracted from each
1 shows an example of extracted HOG features for a frame using background subtraction [8]. Using this
pedestrian. method, we detected motion by segmenting moving
In order to train the model, the feature vectors for objects from the background and passing these smaller
the images were fed into a linear SVM classifier. This images into our model instead of passing the whole
model was then used to classify pedestrians from non- frame for detection. The nth frame can be represented
pedestrians. We implemented a sliding window ap- as In which is its intensity value. In−1 will correspond
proach to exhaustively search static images for win- to the previous frame. Doing a pixelwise subtraction,
dows with the scores greater than 0.2. The scores for we get the equation
each window was calculated using the weight and bias
2
(
In (i, j) ∆(i, j) ≥ Tthreshold from learning more trees will be lower than the cost in
Mn = computation time for learning these additional trees.
0 ∆(i, j) < Tthreshold
For our dataset, Random Forest provided the best ac-
where i and j are pixel positions and Mn is the mo- curacy with 1,000 trees.
tion image. By finding the motion image, we can dra- Furthermore, in order to tune the hyper parameters
matically reduce the complexity of our computation for HOG features, we extracted them using different
[9]. Using these motion images, we were be able to block sizes and cell sizes. Table 1 shows the results
run our model on video frames much faster than when of these experiments. As seen in this table, the block
we did not have any motion detection. size and the cell size have a significant affect on the
Figure 2 provides a summary of the steps we have in accuracy of our models. In the other words, the effec-
our detection algorithm. tiveness of the models strongly depends on the HOG
feature parameters. Also from the table, we can see
4. EXPERIMENTS AND RESULTS that the Random Forest outperforms SVM in all the
To obtain the model with the highest accuracy, we cases except when the block size is 2x2 and the cell
tried two different classifiers: SVM and Random For- size is 4x4 which the accuracy of the SVM model is
est. To evaluate our models, we tested them on vali- higher than the Random Forest. According to these re-
dation set including 1,000 pedestrian images. For the sults, we can conclude that there is no optimal config-
SVM classifier, we investigated different regulariza- uration for HOG features and it depends on the dataset
tion parameters(C) to get the highest accuracy. The we are using. In order to reduce false positive rates in
regularization parameter tells the SVM optimization our models, we exhaustively searched all 2,100 nega-
how much we want to avoid misclassifying each train- tive images and extracted 5,800 windows with the size
ing example. For large values of C, the optimization of 64x128 pixels as false positive objects and then re-
will choose a smaller-margin hyperplane if that hy- train our model with the new augmented set. Using
perplane does a better job of getting all the training 1,000 new negative images for validation, the original
points classified correctly. Conversely, a very small model had a false positive rate of 0.005% while the
value of C will cause the optimizer to look for a larger- new model with hard negative mining had a 0% false
margin separating hyperplane, even if that hyperplane positive rate. Most of the false positives came from ob-
misclassifies more points. We got the highest accuracy jects that are erect and skinny such as poles and trees.
for the SVM model when the regularization parameter However, our hard negative mined model eliminated
has the value of 0.001. many of these false positives. An example of this im-
For Random Forest, we examined different number provement is seen in Figure 3.
of trees in training the model. Random Forest uses As mentioned in the section 3, since we found mul-
bagging (picking a sample of observations rather than tiple bounding boxes for each object, we used non-
all of them) and random subspace method (picking a maximal suppression to remove the redundant bound-
sample of features rather than all of them) to grow a ing boxes. Figure 4 shows an example of using non-
tree. If the number of observations is large, but the maximal suppression for two images.
number of trees is too small, then some observations For the purpose of background subtraction, we
will be predicted only once or even not at all. If the calculated a reference image using a Gaussian
number of predictors is large but the number of trees Mixture-based background/foreground segmentation
is too small, then some features can be missed in all algorithm. Then, we subtracted each new frame from
subspaces used. Both cases results in the decrease of this image to compute a foreground mask. The result
random forest predictive power. But the last is a rather is a binary segmentation of the image which highlights
extreme case, since the selection of subspace is per- regions of non-stationary objects. This way we were
formed at each node. In general, the more trees we be able to get the segmentation of moving regions in
use the better get the results. However, the improve- image sequences in Real-time. Figure 5 shows an ex-
ment decreases as the number of trees increases, i.e. ample of the background subtraction for one frame of
at a certain point the benefit in prediction performance a video.
3
Figure 2: Flow Chart of Major Steps for Pedestrian Detection and Tracking.
Block size Cell Size SVM Accuracy Random Forest Accuracy

2 8 80.8% 67.7%
2 16 69.7% 81.1%
4 8 66.7% 69.1%
4 16 0% 80.4%
Table 1: The accuracy of SVM and Random Forest models using different HOG parameters
Figure 3: Reduction in false positive rates using hard

negative mining. Poles and other erect patterns were
eliminated in the new SVM model trained with the
augmented negative dataset.
To track the pedestrian in videos, after applying

the background subtraction and getting the foreground
mask, we found the contours for each frame and then
computed the bounding boxes for each contour of that
frame. Since all the training images have the size of 64
Figure 4: Redundant bounding boxes were eliminated
x 128, we re-sized the contours whenever their heights using non-maximal suppression on these images.
and widths were smaller than our training image sizes.
This was accomplished by adding some padding to the
heights of widths of the contours. Afterward, we ap-
plied our classifier on that contour to see if the contour one frame of a video. As seen in this figure, there are
is a pedestrian or not. We then used the non-maximal six pedestrian in the frame and the classifier detected 3
suppression technique to remove multiple bounding of them. The others either are occluded or they are in
boxes for each object. Figure 6 shows the results for a pose that the classifier can not detect. To observe if
4
Figure 5: Example of applying background subtraction on
one frame of a video.
Figure 7: Pedestrian detecting and tracking in a frame with

different type of moving objects.
in section 6.
5. CONCLUSION AND FUTURE WORK

We have demonstrated that HOG feature descriptors
combined with SVM or Random forest and negative
hard mining provides an effective strategy for pedes-
trian detection and tracking. The main draw backs
of this approach is that our model is unable to track
a large variety of human poses, and can only track
pedestrians after some delay is added to the video due
Figure 6: Pedestrian detecting and tracking in one frame of to the high complexity of HOG extraction. For many
a sample video. videos, occlusion is often present while pedestrians are
moving in the scene, causing difficulties in detection.
In the future, other techniques for pedestrian tracking
can be added to our system such as optical flow and
there are any confusion for the classifier when there are
Kalman filtering. Tracking humans is inherently a dif-
moving objects other than the pedestrians, we tested it
ficult problem in the computer vision society, but solv-
on the videos that have different type of moving ob-
ing this problem can greatly reduce the number of an-
jects. Figure 7 is an example of a frame that contains
nual motor vehicle casualties, and reduce crime rates
pedestrians, motorcycles and a truck. As shown in the
through improved surveillance systems at home and at
figure, the classifier only detected pedestrians and ex-
work.
cluded the truck and motorcycles.
Furthermore, to evaluate our classifier, we tried 5 6. GITHUB AND YOUTUBE LINKS
different videos with different duration. The total
number of the pedestrians in these videos was 40 and Our GitHub Code:
our classifier detected 24 of them which means it has https://github.com/afazel/CS231A_
60% accuracy. As a side note, we should mention Project
that finding videos that have both moving pedestrians Our Youtube Video on Pedestrian Tracking:
and moving non-pedestrian objects was difficult since https://www.youtube.com/watch?v=
since the cameras of most videos were not fixed and so 0lEJIh6dWAE
the detection was not possible. We instead evaluated
References
our classifier on a few number of videos. To watch
a complete demo of the performance of our classifier, [1] 2014 Motor Vehicle Crashes: Overview. U.S.
please refer to the Youtube link that we have provided Department of Transportation, March 2016.
5
[2] http://pascal.inrialpes.fr/data/human/
[3] http://mmlab.ie.cuhk.edu.hk/projects/PETA.html
[4] http://cbcl.mit.edu/software-
datasets/PedestrianData.html
[5] S. Munder and D. M. Gavrila. An Experimental

Study on Pedestrian Classification. IEEE Trans-
actions on Pattern Analysis and Machine Intelli-
gence, vol. 28, no. 11, pp.1863-1868, November
2006
[6] Dalal Navneet, Triggs Bill Histograms of Ori-

ented Gradients for Human Detection. Interna-
tional Conference on Computer Vision & Pattern
Recognition - June 2005.
[7] Felzenszwalb Pedro, Girshick Ross, McAllester

David and Ramanan Deva. Object Detection with
Discriminatively Trained Part Based Models.
[8] Nan Lu, Jihong Wang, Q.H. Wu and Li Yang

An Improved Motion Detection Method for
Real-Time Surveillance . Annalen der Physik,
322(10):891921, 1905.
[9] Nan Lu, Jihong Wang, Q.H. Wu and Li Yang His-

tograms of Oriented Gradients for Human Detec-
tion. IAENG International Journal of Computer
Science, 35:1.
[10] Zickler Stefan, and Efors Alexei Detection of

Multiple Deformable Objects using PCA-SIFT.
Carnegie Mellon University. 2007.
[11] Canyameres Masip, Sergi, and Antonio Manuel

Lpez Pea On the use of Convolutional Neural
Networks for Pedestrian Detection. 2015.
[12] Szarvas, M., Yoshizawa, A., Yamamoto, M.,

and Ogata, J. Pedestrian detection with convolu-
tional neural networks. Intelligent Vehicles Sym-
posium, 2005.
6
Reconstructing Roller Coasters
Tyler J. Sellmayer
Stanford University
tsellmay@stanford.edu
Abstract frames at 30 frames per second of video, using the avconv

tool [13]. This gives us a collection of frames stored as in-
In this paper, we describe a method of three-dimensional dividual PNG images. Our MATLAB code then reads these
reconstruction whose input is a series of two-dimensional images from disk.
images recorded by a passenger on a roller coaster (a We author our own MATLAB code for automated
”first-person ride video”), and whose output is a three- calculation of the color of the track. This color,
dimensional model which approximates the path of the track color centroid, is later used to paint the fi-
roller coaster’s track. We also describe a method for de- nal rendering of points, and in the future could be used to
termining the approximate color of the roller coaster track choose an appropriate color for a 3D-printed model of the
from the same video. We conclude that our methods are ir- track.
revocably flawed and cannot be used to achieve our goal of We extend MATLAB’s tutorial code [12] to make use of
printing scale 3D models of roller coasters. the high framerate of our input video [3]. Instead of tak-
This builds on previous work in the structure from motion ing each and every frame of video as input, we choose a
problem, wherein an entire three-dimensional scene is re- start point s and desired number of frames n, and intro-
constructed from a series of images. We modify this problem duce a frameskip parameter which controls how many
by attempting to only reconstruct one element of the scene, video frames we ignore between SFM input frames. We call
the roller coaster’s track. We also implement a method of these ignored frames inbetweens. The unignored frames are
choosing a subset of input frames from a video. Previous called keyframes.
SFM work has largely focused on either individually-taken The SFM pipeline takes these keyframes as input. For
photographs [6, 12] or complete video sequences [2], tak- each keyframe, it corrects for image distortion (using a
ing every frame as input. manually-tuned set of camera parameters), detects SURF
Our approach relies on a fundamental assumption about features [1], finds feature correspondences with the previ-
first-person ride videos: the camera’s path through the ous frame, and uses these correspondences to estimate the
world is an approximation of the roller coaster track. In fundamental matrix [6, p. 284] between the previous frame
other words, the rider keeps all hands and feet, and his and current frame.
camera, inside the ride at all times. This allows us to use
This estimated fundamental matrix is then used to com-
the camera pose location as an approximation of the track
pute the relative camera pose adjustment between frames,
location.
which is used to update a list of world-coordinate camera
poses. After each new frame is successfully processed in
this manner, we perform bundle adjustment [15] to increase
1. Introduction the quality of our camera poses. Once we have processed n
The motivation of this paper is to enable the author to frames, we report the list of camera pose locations in world
create 3D-printable models of roller coasters automatically. coordinates and plot them. The plot points are colored with
The problem is a modification of the structure-from-motion the RGB value from track color centroid and dis-
(SFM) problem as described in [6], simplified because we played.
only need to recover the camera poses for our final output. To enhance our results, we enable our algorithm to sub-
We do not attempt to recover the 3D world points compris- stitute a keyframe with a nearby inbetween frame when this
ing the roller coaster track itself. To accomplish the cam- fundamental matrix estimation fails for any reason, trying
era pose estimation, we rely on MATLAB’s structure-from- up to frameskip inbetweens.
motion toolkit as described in [12]. In our experimental results, we report plot of camera
We begin by breaking apart a video into individual poses generated under a variety of SURF parameters and
1
video frames. But our images are noisy, and the light-
ing changes throughout the video, so this ideal scenario
doesn’t quite work if we use the pixel colors directly from
the recorded images. Instead we bucket these pixel colors
into a color palette using nearest-neighbor search [11], then
find the palette color whose member pixels occur most of-
ten in the bottom-center of the frame. This color we call our
track color. The full explanation of this algorithm is in the
Technical Content section 3.
Figure 1. Frame number 3070 from a first-person ride video [3], 2.2. Estimating Track Structure
unedited.
As stated above, we use the locations in world space of
our camera as an approximation of the track position. This
lets us use the camera pose estimation stage of structure-
from-motion [12] as the basis of our algorithm. We mark
every frameskipth frame as a keyframe with the frames
between them called inbetweens.
Before processing any frame, we first undistort it using
manually-tuned camera parameters for correcting fisheye
distortion and a calculated intrinsic camera matrix K. To
obtain the approximate focal length of our camera, we com-
pute the average width-at-widest-point of the roller coaster
Figure 2. Frame number 2784 from a first-person ride video [3], track’s image in pixels across a random subset of frames,
unedited. Here, the rider is inside a dark tunnel. and use this average width to obtain a ratio between a width
in world-space (the real width of the track, which we as-
frameskip values. We then draw conclusions from these sume to be 48 inches) and a width in the image plane in
results. pixels. We use this ratio to convert the known focal length
of our camera (14mm, according to [5]) to pixels. We as-
2. Problem Statement sume square pixels and zero skew, so this focal length is all
we need to compute our intrinsic camera matrix K.
Our problem has two independent pieces: estimating For each keyframe, we attempt to compute the camera
the roller coaster track’s color, and estimating a three- pose relative to that of the previous frame. MATLAB’s pose
dimensional model of the roller coaster track’s path. We estimation toolkit assumes that the camera poses are exactly
examine these problems separately. 1 unit distance apart, an assumption which is corrected by
performing bundle adjustment [15] after each frame’s pose
2.1. Estimating Track Color is added.
This problem can be concisely stated as ”Given a first- During the computation of the relative camera pose,
person ride video of a roller coaster, return the RGB value MATLAB’s helperEstimateRelativePose func-
which most closely approximates the paint color of the tion [12] attempts to estimate the fundamental matrix F [6,
roller coaster’s track.” p. 284]. This estimation can throw an exception when there
First-person ride videos have the property that the image are not enough matching correspondence points to complete
of the track always touches the bottom of the frame near the eight-point algorithm, or when there is no fundamen-
the center, as shown in 1. This property is only untrue in tal matrix found which creates enough epipolar inliers to
cases where the camera is not pointing forward along the meet the requirement set by our MetricThreshold pa-
track (we found no examples of this) or when the track is rameter. When an exception occurs, we do not want to
not fully visible. For example, the track is not visible at simply stop calculating. Instead, we make use of the in-
the bottom of the frame when the camera’s automatic white between frames, retrying the relative camera pose computa-
balance adjustment causes it to be blacked out (or whited tion with each inbetween frame after our second keyframe
out) in response to changing environment light, as seen in until we find one that succeeds, or until we run into the
figure 2. next keyframe, whichever comes first. If we run out of in-
Using this mostly-true property, we can conclude that betweens without successfully computing a relative cam-
ideally, the track color will be approximately that color era pose, we terminate our SFM computations immediately
which appears most often in the bottom-center area of our and return a partial result. In our full results (see section 6)
2
we report the mean and maximum numbers of unsuccessful
fundamental matrix computations per keyframe for each of
our experimental runs.
SFM requires feature correspondences for the funda-
mental matrix calculation [6], which requires features. We
choose to use SURF [1] as our feature detection algorithm Figure 3. Ten color centroids calculated from a random subset of
because it provides scale- and rotation-invariant features. pixels in [3]. Notice that these colors are similar to those found in
This is necessary because roller coasters often rotate the figure 1.
rider relative to the environment (which rotates the pro-
jected images of features in our scene between frames), and
because the camera moves forward along the z-axis between We take this set of several thousand pixels, and run k-
frames which changes the scale of the projected images of means clustering [10] on it. This gives us k centroids in
features in our scene. In our results we report experiments RGB space, and we use the colors those centroids represent
with controlling the SURF parameters NumOctaves, as our color palette. Because of the randomness in this al-
NumScaleLevels, and MetricThreshold as defined gorithm, we do not get the exact same palette every time.
in [7]. One example of a k = 10 color palette is seen in seen in
To avoid needless reimplementation of past work, we use figure 3.
MATLAB’s built-in toolkit [12] for computing correspon-
dences between features, estimating camera poses, tracking 3.2.2 Finding The Track
views, triangulating 3D world points, and performing bun-
Once we have established our color palette, we need to de-
dle adjustment [15]. Together, these produce a final set of
termine which of the colors in the palette most closely ap-
camera poses, including camera location and orientation in
proximates the color of our track. To accomplish this, we
world-space. We then plot the camera locations and color
must rely on our knowledge that in first-person ride videos,
our plot using the RGB value of the calculated track color.
the roller coaster track usually touches the bottom of the im-
age frame, near the center, and almost never touches the left
3. Technical Content
or right sides of the frame.
3.1. Splitting Video Into Individual Frames We first select a new random subset of t frames
[r1 , . . . , rt ] ⊂ [f1 , . . . , fe ]. In each frame, we examine only
Our video is downloaded from YouTube [3]. We run the
the bottom 10 rows of pixels. We split this 10-pixel-high
following command to split it into its individual PNG for-
strip horizontally into g 10-pixel-high segments. For an im-
mat images at a rate of 30 frames per second of video [13]:
age of width W , this gives us g regions [γ1 , . . . , γg ] each
$ avconv -i video.mp4 -r 30 -f image2 \ of size 10 × W g . Our ultimate goal with these regions is to
output_dir/%05d.png find which palette color is least often present in the left- and
right-most regions.
We manually select the range of frames [f1 , . . . , fe ] from Rather than just counting every pixel, we choose to count
the video which comprises the first-person ride video, ex- only those pixels which lie on either side of an edge. This
cluding the copyright notice at the beginning and the credits increases the number of pixels we count that represent track
at the end. (which is made of hard-edged steel parts, in focus, and rel-
atively large in the frame, giving it more sharp edges) com-
3.2. Calculating Track Color pared to the number of pixels we count in noisy background
regions (which tend to be out of focus, motion-blurred, or
3.2.1 Determining The Color Palette
so far away that their edges are not distinguishable at the
We first decide on a palette size. For our experiments, we camera’s resolution, giving them few sharp edges). We
use palette size 10, meaning we will calculate 10 centroids count the pixels (ex − 1, ey ), (ex + 1, ey ) which lie on ei-
in the RGB space. ther side of the edge, rather than the pixel (ex , ey ) which
Our code examines a random subset of t lies directly on the edge, because we want to capture the
frames [s1 , . . . , st ] ⊂ [f1 , . . . , fe ], and takes colors inside the regions more than we want to capture the
a random subset of q pixels in each frame colors of the edges themselves. We call the set of points
[p1,1 , . . . , pq,1 , p1,2 , . . . , pq,2 , . . . , pq,t ], where each (ei,x − 1, ei,y ), (ei,x + 1, ei,y ) for all edge pixels ei our set
pixel pi,j is represented as a triplet of values between 0 and of half-edge pixels.
255, indicating the red, green, and blue values comprising Furthermore, because we care about hue more than satu-
the color of that pixel, respectively. This is a standard ration or value when determining which pixels to count, we
representation of colors in RGB space. perform the edge-finding computation on the ’hue’ layer of
3
the column [0, 5, 124, 26, 0]T satisfies this condition, so our
track color is the one corresponding to that column. In this
example, that color happens to be the reddish-orange palette
color which covers the bulk of the track in figure 4.
3.3. Calculating Track Width

Now that we know the color palette and the track color,
we can examine yet another random subset of frames and
compute the average width (in pixels) of our image of the
track. This will allow us to scale our model appropriately
by adjusting the focal length parameter in our camera matrix
K. We again examine only the bottom 10 rows of pixels of
each frame. In each row of pixels, we find the leftmost and
rightmost pixel whose color lands in the track-color bucket,
and consider the distance between these two pixels to be the
track width at that row. We simply take the mean of these
track widths for each of the 10 bottom rows in each image
as our average track width.
Figure 4. Top: Frame number 3854 from a first-person ride video 3.4. Camera Pose Estimation
[3], unedited. Bottom: the same image with each pixel’s color
replaced by its nearest RGB-space neighbor from the color palette The camera pose estimation task is the first part of the
in figure 3. Structure-From-Motion problem as described in [12]. We
operate on grayscale version of our frames, color is not im-
Table 1. A histogram of half-edge pixel counts over 5 regions and portant for this part. We use a manually-tuned parameter for
10 colors.
correcting the radial distortion caused by the GoPro’s fish-
1612 163 15 14 244 117 72 0 59 818 eye lens effect, giving us less distorted frames, like the one
471 47 276 18 253 112 41 5 0 433 seen in figure 5.
683 0 70 55 32 285 142 124 0 251 We use SURF [1] to detect features as seen in
182 0 168 1 281 104 0 26 45 441 figure 6. We modify the parameters NumOctaves,
312 141 74 0 94 183 18 0 14 663 NumScaleLevels, and MetricThreshold as defined
in [7] in our experiments. SURF provides rotation- and
scale-invariant features, which is useful to us because we
want to find feature correspondences between frames in
our images after converting them to HSV (hue-saturation- an environment where our camera is rotating and moving
value) format. We use MATLAB’s edge function to ac- through space (because it’s on a roller coaster!). As de-
complish this [8]. scribed in the Problem Statement 2 section, we use MAT-
We use a nearest-neighbors search [11] to bucket our LAB’s toolkit to accomplish fundamental matrix estima-
half-edge pixels’ RGB values into our color palette. To il- tion, triangulation, and bundle adjustment. We extend the
lustrate this concept, we provide figure 4 where every pixel tutorial code [12] to make use of the relatively large num-
in the image has been replaced with its bucketed color. ber of frames available to us.
We then count the number of half-edge pixels in each We begin our SFM camera pose update loop with
color palette bucket, in each segment [γ1 , . . . , γg ]. This the intention of operating only on every frameskipth
gives us a two-dimensional histogram where each cell rep- frame [f1 , f1+f rameskip , f1+2∗f rameskip , ...], called our
resents the total number of half-edge pixels of one particu- keyframes. At each step in the loop, we first find feature
lar color ci in one particular region γj across all the frames correspondences between the next keyframe and the cur-
in [r1 , . . . , rt ]. Table 1 gives an example of one such his- rent last-used frame. Figure 7 shows one such set of cor-
togram. From this histogram we can find a color column respondences. When these feature correspondences pro-
which most closely matches the pattern we expect for our vide enough high-quality points for us to estimate the fun-
track color. As stated above, we expect the track color to damental matrix and meet our threshold for epipolar in-
appear almost-only in the center region(s), and almost-never lier count, the calculation succeeds and our loop contin-
in the far left and far right regions. We get good results sim- ues on to the next keyframe. When the feature corre-
ply choosing the color whose far-left region and far-right spondences do not meet this requirement, or if MATLAB’s
region counts have the minimum sum. In table 1 we see that estimateFundamentalMatrix function [9] fails for
4
Figure 7. SURF feature correspondences between frames 192 (red)
and 212 (blue).
Figure 5. Above: Frame number 839 in [3], unedited. Below: The

undistorted frame.
Figure 8. Top: SURF feature correspondences between frames 212

Figure 6. SURF features in frame 91.
(red) and 232 (blue). Bottom: SURF feature correspondences
between frames 212 (red) and 241 (blue). Both of these sets
of features lead to a failed fundamental matrix estimate calcu-
any other reason, we re-attempt the calculation with an lation. Notice the yellow lines which are much longer than the
inbetween instead of the failed keyframe. We try subse- distance between the image centers - these are incorrect corre-
quent inbetween frames in the same order they appear in spondences. Having so many of these creates the situation where
the video. estimateFundamentalMatrix finds only fundamental ma-
trices that do not meet the threshold for number of epipolar inliers.
For example when frameskip = 20 the algorithm
will first attempt to compute the fundamental matrix be-
tween frames 1 and 21. If this computation fails, we try 3.5. Output
the computation again between frames 1 and 22, then 1 and
23, and so on, until the computation succeeds, or until we Our output is a set of points, plotted in a 3D view, col-
reach frame 41 (which is the next keyframe) and terminate. ored to match our calculated track color. Each point repre-
Examples of feature correspondences which led to a failed sents the estimated camera pose location from one frame of
fundamental matrix calculation can be seen in figure 8. video. An example plot is given in figure 9.
5
in section 3.2.2. Defines the experiment parameter
BOTTOM STRIP SEGMENTS.
• average track width.m – Determines the width

of the track, as described in section 3.3.
• sfm.m – Computes camera poses using the

SFM pipeline as described in section 3.4. De-
fines the experiment parameters NUM OCTAVES,
NUM SCALE LEVELS, and METRIC THRESHOLD.
For our color experiments, we run our main MAT-

LAB function once per experiment, altering either the
NUM COLORS parameter in cluster colors.m
or the BOTTOM STRIP SEGMENTS parameter in
track color.m for each experiment. We record
the output of main, which includes the list of color cen-
Figure 9. Plot of 53 estimated camera pose locations, troids, the histogram of pixels-per-centroid-per-segment,
generated with frameskip = 16, NumOctaves =
and the final track color.
4, NumScaleLevels = 3, MetricThreshold = 1100.0.
For our SFM experiments, we run the same MAT-
LAB function (main) once per experiment, changing
4. Experimental Setup and Results one parameter out of NumOctaves, NumScaleLevels,
MetricThreshold, and TEST FRAMESKIP each time.
4.1. Implementation And Setup We conduct testing in an exploratory manner, altering only
For all our experiments, we operate on the same set of one parameter at a time, searching for a configuration which
images from [3]. This data set is roughly 14GB, and we’re gives us the longest possible series of camera poses before
running our code on Stanford’s corn servers, so this uses running into the failure condition of being unable to cal-
99% of our filesystem quota. We did not have a computer culate a fundamental matrix for a keyframe or any of its
with enough available storage to hold another data set. subsequent inbetweens.
All our code is implemented in MATLAB. We imple-
4.2. Results
ment the following functions:
4.2.1 Track Color Estimation
• main.m – This is the main function of our code. Re-
sponsible for generating figures 4, 5 and all of the cam- We experimented with computing the track color for various
era pose plots, as well as running all other functions values of NUM COLORS and BOTTOM STRIP SEGMENTS.
described below. Controls which image files are used These results are so uninteresting - the track color always
as input to the other functions. Defines the experiment comes out as a shade of orange (as seen in figures 1, 3, and
parameter TEST FRAMESKIP, the desired number of 4) or nearly white (the color of the sky in those same figures)
inbetween frames between keyframes (also referred to - that it is wholly unnecessary to present more than one full
as frameskip above.) example. We present this example in tables 2 and 3.
We also present a table of track color centroids as cal-
• random subset images.m – Used by main.m to
culated under different parameters in table 4. Remember
select random subsets of image paths, which are then
that these color coordinates are on a scale of 0-255 with
passed to cluster colors, track color, and
(0, 0, 0) representing pure black and (255, 255, 255) repre-
average track width.
senting pure white.
• cluster colors.m – Determines the color palette
of a set of images using k-means [10], as described 4.2.2 Camera Pose Plots
in section 3.2.1. Defines the experiment parameter
NUM COLORS, the number of color centroids in the We present a representative subset of our camera pose esti-
palette. Returns a NUM COLORS × 3 matrix of RGB mation results. Figures 9, 10, 11, 12, and 13 show a variety
values, with each row representing one palette color. of camera pose location plots. Note that the long, straight
path of points in each plot corresponds to the long, straight
• track color.m – Determines which color centroid lift hill of the roller coaster. The jumbles of points are
from the output of cluster colors best approxi- badly-reconstructed camera locations on and after the peak
mates the color of the roller coaster track, as described of the lift hill, or during the initial turn out of the ride shelter
6
Table 2. A histogram over 5 segments and 10 colors.
902 1 153 87 4 191 775 3 467 428

454 4 45 206 7 146 344 98 293 721
429 4 20 114 6 492 172 7 203 1084
375 6 60 233 0 333 0 0 103 1131
918 0 148 338 0 334 0 0 97 730
Table 3. The color centroids for each of the 10 colors in 2.
Red Green Blue
81.1008 76.3207 57.9355
187.2241 128.2615 83.6437
72.8899 45.4787 24.6535
244.1506 242.3728 241.9210
171.5434 90.5595 36.1833
127.8087 105.3556 84.7148
36.8842 28.1582 19.3491
177.7524 181.0321 182.5577
Figure 10. Plot of 62 estimated camera pose locations
184.3315 152.2799 126.4728
(between frames 91 and 249) in [3], generated with
129.8242 55.6536 23.2184 frameskip = 4, NumOctaves = 4, NumScaleLevels =
Table 4. Estimated track colors.
3, MetricThreshold = 900.0.
#colors #segments Red Green Blue
3 3 161.7516 117.9050 85.0151
5 3 243.3168 241.6485 241.0941
7 3 243.2237 242.1020 242.0921
14 3 244.6650 243.3350 243.2677
3 5 161.7516 117.9050 85.0151
5 5 243.3168 241.6485 241.0941
7 5 142.3456 67.5242 28.2738
14 5 244.6650 243.3350 243.2677
area. Only figure 13 represents our coaster somewhat accu-

rately - watch the video in [14] and observe the presence
of the roller coaster’s second hill. We were unable to find
any configuration of our code which could reconstruct past
frame 2689 of the video [3] without hitting the failure con-
dition of being unable to find an inbetween that allows for
successful fundamental matrix estimation. For full results, Figure 11. Plot of 19 estimated camera pose locations
see the link in section 6.
frameskip = 8, NumOctaves = 4, NumScaleLevels =
3, MetricThreshold = 900.0.
5. Conclusions
5.1. Color Extraction
color of a roller coaster track from a first-person ride video.
Our relatively low level of success in determining track This method is extensible to other color extraction prob-
color (see table 4, often the track color is determined to be lems, where the general region that contains an object is
white, which is actually the color of the sky in our video) known across multiple views. For instance, one might ex-
suggests that our goal was not achieved, and our method tract a person’s eye color by overlaying a grid on the image,
was too reliant on manual tuning of the NUM COLORS and then creating a 2D histogram similar to table 1 with each
BOTTOM STRIP SEGMENTS parameters. We suggest that row referring to a single cell in the grid. An existing face-
future work on this topic should ignore our results and use detection algorithm would be used to determine the general
better, existing image segmentation algorithms like the ones location of the eye in each frame. After doing the nearest-
found in [4]. Should anyone choose to use our method, we neighbors color palette bucketing and counting half-edge
offer some suggestions and conclusions about our results. pixels to create the histogram, whichever color was present
In section 2.1, we introduced a method for extracting the near the center of the eye, but not present at all in the bound-
7
ange from red or brown or any other primary or secondary
color, and we have achieved this level of accuracy in this
paper, but only by manually tuning the NUM COLORS and
BOTTOM STRIP SEGMENTS parameters until we got the
desired result. This is less useful than simply picking the
color manually.
5.2. Camera Pose Estimation

In section 3.4 we describe our approach to reconstruct-
ing camera poses for a series of frames taken from a video.
Our experimentation with modifying the frameskip pa-
rameter reveals that frameskip = 18 gives the longest
reconstructable sequences, but the resulting plot looks far
less smooth than with frameskip = 16. For our ultimate
goal of creating 3D-printed models of the track, we want a
Figure 12. Plot of 52 estimated camera pose locations smooth-looking plot, so we chose frameskip = 16 and
(between frames 91 and 833) in [3], generated with explored the behavior when other parameters are modified.
frameskip = 16, NumOctaves = 4, NumScaleLevels = Increasing SURF’s NumOctaves parameter allows
3, MetricThreshold = 900.0. SURF to find larger blob features [7]. This is useful
when our roller coaster moves past objects that are large
in the frame, such as the tree seen in figure 8. Mod-
ifying this did not affect our results very much. Run-
ning with NumOctaves = 5, NumScaleLevels =
3, MetricThreshold = 2000.0, frameskip = 16
allowed our algorithm to reconstruct 168 total frames,
whereas the same configuration with NumOctaves ∈
[3, 4, 6] only allowed our algorithm to reconstruct 165 to-
tal frames. This is only a 1.8% increase.
Increasing SURF’s NumScaleLevels parameter al-
lows SURF to find a greater quantity of small blobs [7].
It cannot be less than 3, but increasing it above 3 did
not improve our results. Running with NumOctaves =
5, NumScaleLevels = 4, MetricThreshold =
2000.0, frameskip = 16 allowed our algorithm to recon-
struct 40 total frames, and running with NumOctaves =
5, NumScaleLevels = 5, MetricThreshold =
Figure 13. Plot of 165 estimated camera pose locations 2000.0, frameskip = 16 allowed our algorithm to re-
construct 167 total frames.
frameskip = 16, NumOctaves = 4, NumScaleLevels =
3, MetricThreshold = 2000.0. A video of these points is
Increasing SURF’s MetricThreshold parameter in-
available in [14]. creases the minimum threshold for feature ’strength’ [7].
This gives us a greater quantity of high-quality fea-
tures which are more likely to find strong correspon-
ary regions around the eye, would most likely represent the dences in subsequent keyframes. This also makes us
closest approximation of the person’s eye color. Further re- less likely to find correspondences in general, because
finements could be made by detecting circles to more accu- it may be that the same feature has ’strength’ higher
rately locate the iris and pupil. than MetricThreshold in one image but not in the
For our specific roller coaster problem, our goal was to other. The higher we set this threshold the more likely
find a close approximation to our track color. Here we de- it becomes that we will fail to find the same feature
fine ”close approximation” in the context of our original in two consecutive images, thus the less likely we are
goal, which was to create 3D-printed models of roller coast- to find a correspondence. When the threshold is much
ers, so we only need to approximate up to the color res- higher than 2000 (we tested with MetricThreshold =
olution of available 3D-printing filaments. In general this 4000), we will reach a failure state earlier, because
means we only need to be able to accurately distinguish or- we will be unable to calculate a fundamental matrix
8
due to too few features. When the threshold is much
lower than 2000 (we tested with MetricThreshold ∈
[800, 850, 900, 1000, 1100]) we will reach a failure state
earlier. With low thresholds we obtain so many erroneous
feature correspondences that they will cause MATLAB’s
estimateFundamentalMatrix function to fail with
an exception because there are never enough epipolar in-
liers for any of the sampled fundamental matrices [9]. Un-
fortunately, we cannot provide a specific suggestion for a
good MetricThreshold parameter, because the effects
of this value are entirely dependent on the quality and struc-
ture of the input images. We can suggest that future work
start by doubling MetricThreshold until the quality of Figure 14. Frame number 2814 from a first-person ride video [3],
their output degrades, then doing binary search to find a unedited.
good-enough MetricThreshold between the two best
values.
6. Code And Full Results
Choosing a high MetricThreshold also increases
MATLAB code and the full experimental results of this
the (totally subjective) smoothness of our point plot. This is
paper are available at https://github.com/tsell/
because the high quality features are less likely to be incor-
reconstructing-roller-coasters.
rectly corresponded with the wrong feature in an adjacent
frame. This is especially important because the scene in and
around a roller coaster is full of repetitive elements, like the References
repeating structure of the track, the similar pieces of support [1] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded
steel, and repeating patterns in the nearby rides and build- up robust features. Computer Vision and Image Understand-
ings. These elements are often incorrectly matched as corre- ing, 110(3):346–359, 2008.
spondences, as seen in figure 8. Modifying NumOctaves [2] P. Beardsley, P. Torr, and A. Zisserman. 3d model acquisition
and NumScaleLevels also helps with this by narrow- from extended image sequences. Computer Vision, pages
ing the range of feature scales we detect, reducing the oc- 683–695, 1996.
currence where a nearby feature in one frame is incorrectly [3] FrontSeatCoasters. Six flags magic mountain goliath pov hd
corresponded with a far-away feature in another frame. roller coaster on ride front seat gopro steel 2013. Web, 2014.
https://www.youtube.com/watch?v=N uV0Q2UH98.
[4] M. Frucci and G. S. di Baja. From segmentation to binariza-
tion of gray-level images. Journal of Pattern Recognition
5.3. Final Word Research, 1:1–13, 2008.
[5] GoPro. Hero3 field of view (fov) information. Web,
Overall, we consider these experiments a failure. Our 2016. https://gopro.com/support/articles/hero3-field-of-
camera pose estimation is not robust enough to create view-fov-information.
smooth models of the entire track. There are large portions [6] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, 2003.
of first-person ride videos which are totally inscrutable to
our methods, including frames like the ones seen in figure [7] MathWorks. detectsurffeatures: Detect surf fea-
tures and return surfpoints object. Web, 2016.
2 and 14 which have been nearly destroyed by the cam-
http://www.mathworks.com/help/vision/ref/detectsurffeatures.html.
era’s auto white-balance feature. We were unable to find a
[8] MathWorks. edge: Find edges
configuration of SURF parameters and frameskip value
in intensity image. Web, 2016.
which reduced the reconstruction error sufficiently to make http://www.mathworks.com/help/images/ref/edge.html.
a smooth-looking track model, so none of our results are [9] MathWorks. estimatefundamentalmatrix: Es-
worthy of being 3D printed. Also, the processing takes so timate fundamental matrix from correspond-
long (on the order of 1 hour per 150 frames successfully ing points in stereo images. Web, 2016.
processed, though we did not take explicit notes of our tim- http://www.mathworks.com/help/vision/ref/estimatefundamentalmatrix.html.
ing), and needs to be manually re-calibrated for each video [10] MathWorks. kmeans: K-means clustering. Web, 2016.
(because the scale and quality of features in different videos http://www.mathworks.com/help/stats/kmeans.html.
varies widely depending on video quality and camera reso- [11] MathWorks. knnsearch: Find k-nearest
lution), that this is not faster or better than simply construct- neighbors using data. Web, 2016.
ing the model manually in some 3D modeling software. http://www.mathworks.com/help/stats/knnsearch.html.
9
[12] MathWorks. Structure from motion
from multiple views. Web, 2016.
http://www.mathworks.com/help/vision/examples/structure-
from-motion-from-multiple-views.html.
[13] J. Nielsen. How to extract images from a video with av-
conv on linux. Web, 2015. http://www.dototot.com/how-to-
extract-images-from-a-video-with-avconv-on-linux/.
[14] T. Sellmayer. Rotate camera points fig. 13. Web, 2016.
https://www.youtube.com/watch?v=N uV0Q2UH98.
[15] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
Bundle adjustment: A modern synthesis. Proceedings of the
International Workshop on Vision Algorithms, pages 298–
372, 1999.
10
Recovery and Reconstruction of Blackboard or
Whiteboard images with occlusions

Vijayaraj Gopinath
vgopinat@stanford.edu
Computer Vision: From 3D reconstruction to recognition
CS231A
Stanford University

Abstract 2 Proposed solutions
We have all taken the pictures of In this work, the multiple pictures
blackboard for reference purpose, but the of the blackboard can be in any orientation
problem is most of the time those pictures comes and we assume the first picture dictate our
with occlusion such as professor standing in required orientation for result purpose and
between or other students. Even if you can take
all other pictures would be orientated
multiple image, there will one or other occlusion
towards our first picture. Also the
obstructing the blackboard information. By the
time one student or professor has moved out proposed algorithm will find out whether
another will often have moved in, making it the given pictures would able to recover
difficult and timeconsuming to get the perfect the blackboard (or whiteboard) or not
blackboard picture. Also we can’t wait for the otherwise report an error. Since we have a
right moment, since the blackboard can be specific scenario here, we would tap on the
erased multiple time which makes this problem properties of this scenario (blackboard
even more difficult. In this project we will make only) to segment the occlusion with a
use of available multiple images and better techniques such as background
automatically recover the blackboard by
subtraction and image morphology
removing all the occlusions.
techniques which provides better results

than without worrying about other more
1 Review of previous work
complex segmentation techniques.
We can find many works related to

generic scene reconstruction with occlusion
3 Summary of the technical solution
using multiple images. Especially the work
To solve this problem, we have
[1] from Microsoft research which tries to
remove any occlusion from landmarks or proposed the following solution to this.
monuments from multiple pictures. Generic 1, Since we are dealing with multiple
scene construction has lot of assumptions and image, find homography for first image
work around with that, but here in this work, and to all other image and using the found
we are only concerned about a specific homography warp all other images to the
scenario, reconstructing the blackboard first image. 2, Detect, segment and label
all the occlusions in all the images. Use
images from occlusions such as person
image subtraction and other morphology
(professors or students) or any other objects.

techniques to detect and segment. 3, Once In figure 1, you can see that lot of features
labelled, we need to find out, which (more outliers than inliers) are found around
occlusions comes from which original occlusion since those regions have more
image and decide whether the scene can be intensity changes. Also, we will not get
recovered or not. 4, Using the label and enough features on the corners always since
identification of occlusion from the original the color of the whiteboards sometimes
image, recover the scene by copying the closely match the background of the scene
pixels from non occluded region to the itself. For all these reason, we need user
occluded region. 5, Finally blend the image feedback to select four corner of the
to complete the reconstruction. blackboard for us. We use these four points
to calculate our initial homography and we
4.1 Finding Homography remove all the outliers using RANSAC or
It is important to have an accurate other such methods.
homography since we later use image
subtraction to identify occlusions. It is
common nowadays we have more
whiteboards than blackboards which makes
finding homography more challenging.
With our experiments, we found that
detecting rich features proven very difficult
around whiteboards because of
homographic surface nature of the scene.
Also most of the found features are also
around the occlusion since we often see a
steep change in intensity levels near the Figure 2 Scene after the removing all
occlusions. outlier features.

Only using the inliers set which we found
using the last step, we recalculate the
homography which becomes our final
homography. We need these two step
approach to get our homography since we
need accurate homography and also we
assume that the user selection of four points
will be not accurate enough for homography
calculation but accurate enough to remove
outliers. We have used SURF features here.
Using the final homography we warp all the
Figure 1 with outlier features. images to match our base first image.

4.2 Detection Segmenting and Labeling accurate homography, results after image
occlusion. subtraction will have noises around it which
The first attempt we made in we will deal later. After image subtraction,
detecting the occlusion was face detection. we can use global image threshold using
Since most of the time, our occlusion will Otsu’s method [2], to compute the global
be a person occluding the blackboard it threshold level which is normalized to [0 1].
would be very interesting to take this Using the found threshold level, we can
approach. We can find the bounding box for convert the intensity image to a binary
face after detecting it and We can use face image.
to body ratio to define a segment box to
cover entire occlusion.

Figure 4 After image subtraction
Figure 3 Face detection

This idea has clearly many issues, We will
get little noise in terms of detecting other
small or non faces in the images also the
occlusion need not to be a person here or
even if the occlusions are assumed to be
only person, finding bounding box would be
wrong to do since person can take other
forms. After experimenting with lot other
methods, we found out that the
morphological techniques would suit this Figure 5 Binary image with noise
scenario more perfectly.
To remove noises in the binary image we
Once we have accurate homography can use morphological techniques. Use
and good warping, we can use image structural element of ‘Disk’ with radius of 3
subtraction between base image and all and erode the picture. The above step helps
other image to remove everything except to remove the noise which we got after
the occlusion. Though we found out image subtraction and thresholding.

The binary image we found from the last
step will have lot of discontinuity. In order
to get the complete segment, we need to
dilate the binary image to get supersize of
the occlusion. We can use the structural
element of ‘Disk’ with radius of 25 to get
the supersize of the occlusion. Figure 6
shows the binary image after dilation.

Figure 7 After filling the holes

Figure 6 Image dilation, Disk radius 25

Depends upon the intensity level of
occlusion in comparison to the blackboard Figure 8 Erode to get original size
or whiteboard, the found occlusion
potentially can have holes inside it. Which After finding the binary image of the
can be noticed from the figure 6. Here occlusions, label all the independent
occlusion originally had intensity level connected components to get the accurate
similar to the background so after image boundary for every occlusions in the scene.
subtraction and thresholding will have holes
in it. We can find holes by finding the
connected components in the binary image
and finding the missing pixel inside it. Find
such missing pixel and fill it. Figure 7
shows the binary image after filling up the
hole. Once finding the supersize occlusion,
erode the binary image with SE of radius of
22 to get back the original size of the
occlusion. Figure 8 shows the final binary
image with the occlusion having same size.
Figure 9 Color map of the found label.

4.3 Mapping occlusions to the original We address this in the future works section.
image region. Blackboard reconstruction by removing
After the segmenting and labeling, it occlusion is very interesting project and has
is important to map each found occlusions an important application in the education
to the original image. We assume that the domain. This can be pretty much a cool app
object which is occluding has totally which students can install in the smartphone
different average intensity value in and take multiple pictures and app can
comparison to the blackboard itself. We can automatically recover the entire blackboard
build a model based on the intensity level scene in a self contained manner.
around the occluded object and occluded
region in the original images, which can be 6 Future works
used to map the objects. 1, As previously discussed
sometimes we won't get enough features to
find homography, and so need user
selecting conors for us. We can work on
sophisticated rectangle detection to avoid
user input. 2, Fine tuning segmentation
using other techniques such as Fast
marching method, 3, For mapping occlusion
to the original image, currently we build
model based on intensity value, this can be
improved to other sophisticated model using

Figure 10 Recovered image features. 4,After the scene has been
reconstructed, we can try recognizing the
4.4 Recovering and Blending text in it and create a doc.
Once we identified, labeled and
mapped the occlusions we can recover the REFERENCES
scene by copying the pixels from non
occluded portion of the images to the 1. http://research.microsoft.com/pubs/69386/peop
occluded portion of the images. Since, we lemover.pdf
copy pixels come from different image to 2. http://ijarcet.org/wpcontent/uploads/IJARCET
recover the scene, the boundary of the VOL2ISSUE2387389.pdf
3. http://www.eiti.uottawa.ca/school/research/viva
recovered region in the final image will be
/papers/homographie.pdf
visible, so we need to blend the image to
4. http://www.cescg.org/CESCG2016/papers/Jari
complete our reconstruction. Figure 10
abkaGeneration_of_lecture_notes_as_images
shows the final recovered image by _from_recorded_whiteboard_and_blackboard_
removing all the occlusions. based_presentations.pdf
5. http://visual.cs.ucl.ac.uk/pubs/learningOcclusio
5 Conclusion n/CVPR_2011_learningoccl.pdf
Dealing with similar background in
computer vision is very challenging, human
eyes has evolved to detect this seamlessly.

Real-Time Semi-Global Matching Using CUDA Implementation
Robert Mahieu Michael Lowney
Department of Electrical Engineering Department of Electrical Engineering
rmahieu@stanford.edu mlowney@stanford.edu
Abstract—With recent rise of technology such as these techniques are significantly more memory intensive
augmented reality and autonomous vehicles, there and end up being much slower.
comes a necessity for speedy and accurate depth
estimation to allow these products to effectively interact To obtain both reasonable accuracy as well as real-time
with their environments. Previous work using local performance, we instead move to what can be referred to as
methods to produce depth maps has generally been fast a semi-global matching technique [2], which makes some
but inaccurate and work using global methods has been use of both local and global methods. Additionally,
accurate but too slow. A new technique referred to as offloading data calculation onto the GPU, which is ideal for
semi-global matching combines local and global handling SIMD (single instruction multiple data)
methodologies to balance speed and accuracy, computation, allows us to exploit the parallelizable nature
producing particularly useful results. This project of our image-based calculations and significantly reduce
focuses on the implementation of slight variations on the computation time for estimating the optimal disparity map.
original algorithm set forth by Hirschmuller [2] to
increase accuracy and using CUDA to accelerate the
runtime. Results show sufficiently low error, though
runtime was found to be imperfectly optimized. 2. Problem Statement
In order to tackle this problem, we make the assumption
that the input images are a rectified stereo pair. This is
1. Introduction inherently the case when two cameras are orthogonal to
Acquiring depth information from sets of images is the baseline, and point in the same direction. Mpopular
incredibly important in many emerging fields such as stereo vision datasets, such as the Middlebury dataset,
augmented reality, robotics, autonomous vehicles, etc. which is used in this paper [3][6][8], provide stereo pairs
However, these applications rely on the produced depth that have been rectified. The benefit to having rectified
information to be both accurate and generated in a short images is that the epipolar lines are horizontal and
amount of time—ideally close to real-time—to ensure corresponding lines are at the same height in each image.
safety of the system and users, as well as for reliable pose This simplifies the problem because corresponding points
tracking. will lie on the same epipolar line, and so we only have to
search in horizontal directions.
Many algorithms for computing this information have Local methods are more prone to noise in their disparity
been previously explored, either looking at localized maps due to the fact that there may be several local
information or global information throughout the entire minima in their cost function. Because of this the semi
image. Local techniques such as Winner-Takes-All (WTA) global approach uses a model which penalizes changes in
or scanline optimization (SO) [7] compute results for pixels disparity values in local neighborhoods. This causes the
independently and require minimal computation, but due to resulting disparity map to be smoother by attenuating high
lack of consideration for global trends typically result in frequency noise, which provides a clearer estimate of the
inaccurate conclusions. The dynamic programming [5] true relative depth of the objects in the scene.
approach is also computationally efficient, but because the
algorithm only looks at a single row per iteration, it also
lacks consideration for global trends and commonly causes 3. Technical Content
streaking patterns to show up in the output. On the other
The implementation of the semi-global matching method
hand, while global techniques such as a Graph Cuts [1] and comes down to minimizing an energy function describing
Belief Propagation [9] produce more accurate results and the quality of a potential disparity image. This is
better avoid the errors encountered in the local methods,
represented by the expression below:
reads than global memory.
𝐸(𝐷) = ∑ (𝐶(𝑝, 𝐷𝑝 ) + ∑ 𝑃1 𝟏{|𝐷𝑝 − 𝐷𝑞 | = 1} To carry out the next step denoted “cost aggregation”, we
𝑝 𝑞∈𝑁𝑝
iterate and compute the energy function locally over 8
directions (two horizontal, two vertical, two for each
diagonal). An example for the recursive expression is
+ ∑ 𝑃2 𝟏{|𝐷𝑝 − 𝐷𝑞 | > 1})
shown below for the horizontal direction, going from left to
𝑞∈𝑁𝑝
right across the image:
Where 𝐷 is the disparity map, 𝑝 is a pixel location on the

map, 𝐷𝑝 is the disparity value at pixel 𝑝, 𝑁𝑝 is the 𝐸(𝑝𝑥 , 𝑝𝑦 , 𝑑) = 𝐶(𝑝, 𝑑)
neighborhood of pixels around 𝑝, 𝟏{? } is an indicator + min(𝐸(𝑝𝑥 − 1, 𝑝𝑦 , 𝑑),
function that is equal to one if the argument within the
braces is true and zero if false, and 𝑃1 and 𝑃2 represent 𝐸(𝑝𝑥 − 1, 𝑝𝑦 , 𝑑 − 1) + 𝑃1 ,
penalty value given to various changes in disparity within 𝐸(𝑝𝑥 − 1, 𝑝𝑦 , 𝑑 + 1) + 𝑃1 ,
the local neighborhood. 𝐶(𝑝, 𝑑) represents an initial cost 𝑚𝑖𝑛 (𝐸(𝑝𝑥 − 1, 𝑝𝑦 , 𝑖) + 𝑃2 ))
𝑖
function which is based on the absolute difference between
gray levels at a pixel 𝑝 in the base reference image and pixel
𝑝 shifted by 𝑑 along the epipolar line in the matching image This step can be parallelized by having a different block
(assumed to be the right image in the stereo pair): for each direction, and within each block having each thread
handle one iteration for one disparity. Note that after each
𝐶(𝑝, 𝑑) = |𝐼𝑏 (𝑝𝑥 , 𝑝𝑦 ) − 𝐼𝑚 (𝑝𝑥 − 𝑑, 𝑝𝑦 )| step along the direction, the threads must be synchronized.
This can be done by utilizing the CUDA command
Note also as stated in section 2 that the images are __syncthreads().
assumed to be rectified. The energy function thus penalizes
heavily (with 𝑃2 ) large jumps in the disparity map and less Once all paths have been traversed, results are compiled
heavily (with 𝑃1 ) small changes that may represent sloped into a single value:
surfaces. Note that 𝑃1 < 𝑃2 . This allows us to reduce high
frequency noise in the resulting disparity image.
𝑆(𝑝, 𝑑) = ∑ 𝑤𝑟 ∗ 𝐸𝑟 (𝑝, 𝑑)
Optimizing this energy function using global 𝑟
minimization in 2-dimensions is NP-complete, and requires
too much computation to solve for many practical Where 𝑟 represents a given direction and 𝑤𝑟 represents a
applications. On the other side of the spectrum, minimizing weight value for that particular direction. By introducing
in 1-dimension over image rows, such as in the dynamic this weighting term, we account for the fact that results
programming approach, is light on computation, but suffers obtained from a certain path orientation may still lead to
from accuracy issues as discussed above. To handle this better estimates than the results from the other orientations.
problem, semi-global matching leverages several Therefore, we are able to weight each direction accordingly
1-dimensional minimization functions to more efficiently based on the scene type.
construct an adequate estimate of the solution.
The final step is to, for each pixel, iterate over the
The first step in the actual implementation of the calculated energy values and determine which disparity
algorithm is to calculate initial costs for all pixels in the value corresponds to the minimum energy. This is
image pair at disparities ranging from 0 to some selected represented by the following function:
𝑑𝑚𝑎𝑥 . We found that using a 𝑑𝑚𝑎𝑥 value of 64 was more
than enough for all images we tested. This computation can 𝐷(𝑝) = argmin 𝑆(𝑝, 𝑑)
be efficiently carried out on the GPU with a kernel that 𝑑
gives each thread a computation for one pixel and one
disparity value. The system takes an RGB stereo pair as an This can be parallelized on the GPU by having each
input and then converts the image to grayscale. The cost thread handle all disparities for a given pixel. The
function uses the difference in grayscale intensity values as disparities returned from this step represent the optimal
a metric to determine how good or bad a potential match is. disparity map that minimizes the energy function.
Once the images are converted to grayscale they are store
in texture memory on the GPU to increase speed. Texture Once we have a reasonable disparity map, the next step
memory is cached on the chip and allows for much faster
Figure 1- First row shows the left input image of the stereo pair, second row is the depth
map using semi-global matching, third row is ground truth
is to determine areas of occlusion in the image, meaning rows in the image, 𝑛 is the number of columns in the image,
areas that are visible in the base image, but blocked in the and 𝑑𝑚𝑎𝑥 is the number is the maximum number of
matching image. This can be done by running a slightly disparity values. Storing the initial cost matrix and the
modified version of the algorithm as defined above again to matrices for the 8 search directions may exceed the total
generate a disparity map for the match image. The only amount of memory on the GPU. For this paper the
modification is in the initial cost function which becomes: algorithm was run on a laptop with a NVIDIA GeForce
940M GPU with 2GB of memory. In order to stay below
𝐶(𝑝, 𝑑) = |𝐼𝑚 (𝑝𝑥 , 𝑝𝑦 ) − 𝐼𝑏 (𝑝𝑥 + 𝑑, 𝑝𝑦 )| this 2GB threshold the input images must be down sampled.
Unless otherwise specified the images will be down
Once we have disparity maps for both the base and match sampled so that the number of columns will be 450, and the
image we can compare the results to identify occluded number of rows will be scaled accordingly.
regions. For each pixel in the base disparity map we sample
the disparity value. We then compare this to the value in the
match disparity map at the same pixel location shifted by 4. Results
the base disparity value we just sampled. If these two values The quality of our algorithm was tested using the
are the same (within some small tolerance), we judge them Middleburry dataset. Figure 1 shows the results of our
to be true correspondences, otherwise we make them as algorithm compared to the ground truth of the depth
occluded pixel and set them to zero in the base disparity map for various image pairs. For these trials we used
map. This technique outputs a refined base disparity map. the values suggested in [4] for 𝑃1 , 𝑃2 . All 𝑤𝑟 were set
to the same value to ensure equal weighting. Table 1
Finally, to eliminate residual noise in the output we filter shows the values of 𝑃1 and 𝑃2 used for our results.
the disparity map using a median filter. Good results were
observed while using a small kernel of 3x3. This allows us Table 1: Penalty Values
to keep the major edges and details in the map while ↔ ↕ ↖↘ ↙↗
removing the unwanted high frequency components.
𝑃1 22.02 17.75 14.93 10.67
A consideration worth noting is the memory 𝑃2 82.79 80.87 23.30 28.80
requirements of this algorithm. The amount of memory
used scales like 𝑂(𝑚𝑛𝑑𝑚𝑎𝑥 ), where 𝑚 it the number of
Qualitatively the results appear to be quite a close Unfortunately, due to time constraints on the project, we
match to the ground truth. The regions towards the left were unable to spend much time optimizing the CUDA
border of the image are consistently unlabeled. This is implementation, so tests on runtimes have returned sub-
due to the fact that they represent pixels that are only optimal results. In the current somewhat naïve
seen in the base image. Only pixels that are in the field implementation, we are still able to get runtime down to
of view of both cameras will result in accurate disparity around one frame/sec for images with sizes below about
values. 250x217 (54250 pixels). The relationship between runtime
and input image size is illustrated in Figure 2. As shown in
Table 2: MSE and runtime for images in Midlleburry the figure, the Semi-Global Matching algorithm has a
dataset runtime that scales linearly with the number of pixels in the
input images. It is also worth noting that the graphics cards
IMAGE PAIR MSE Runtime
used in modern stereo research are much more powerful
aloe 0.0296 3715ms than the ones used in this paper (consumer grade laptop
GPUs). This difference in hardware is a main contributor to
books 0.0687 3394ms
the longer runtimes found in this project.
dolls 0.0712 3385ms
It is important to note that one of the most significant
laundry 0.0935 3517ms
factors in the runtime, however, is actually the resizing of
pots 0.1053 3607ms images that occurs at the start of the program after being
baby 0.0385 3759ms read in by the CPU. This is necessary to ensure that we do
not overload the GPU memory, however the time cost is
bowling 0.1083 3692ms very high. When images do not need to be resized on-the-
art 0.0949 3385ms fly, total speeds are greatly increased (about 2x speedup).
In future implementations, intelligent use of shared
cones 0.0409 3152ms memory and memory access within warps should be able to
wood 0.0671 3403ms dramatically increase performance. Some of these
techniques are outlined by Michael et al. [4].
Table 2 shows the mean-squared error (MSE) between For posterity, to demonstrate the robustness of our
our experimental depth maps and their respective ground algorithm we also selected an arbitrary stereo image pair
truths. Note that the error values are generally quite low,
never reaching any higher than around 10% for any of the
images we tested. Although differences in scaling of depth
to grayscale may be present between the experimental
results and the ground truth, this appears to be minimal and
therefore the MSE should still provide a good metric for
analyzing the success of the algorithm.
While the performance of the algorithm seems to

decrease in highly cluttered scenes such as the Art image
pair or scenes with many strong gradients such as the Pots
image pair, in general, variation in the scene structure
appears to affect the quality of the produced depth map very
little, both qualitatively and quantitatively.
These results also appear to indicate that our decision to

weight the results from all path directions equally, as well
as our decision to keep constant the penalty values from reddit (https://www.reddit.com/r/crossview) of a
throughout all tests, had relatively small impact on the Figure 2- Runtime vs Image Size
quality of our depth maps. The results from Michael et al.
[4] report that training these values for specific scenes does Batman figurine. The results are displayed in Figure 3.
increase quality noticeably, however our results seem to While some poorly defined occlusion areas create a good
show that it is still possible to get reasonably good results bit of distracting noise in the depth map image, the large
without worrying about changing these parameters for non-occluded areas actually show quite a high amount of
every scene. detail. Looking closely, we can even make out definition at
the level of the muscles on the figurine. has been shown to be the case in [2][4].
The error could be further reduced in the future by

changing how we handle occlusions. Currently we are able
to detect occlusions by comparing the depth map for both
the base and match image. Once occlusions are found they
are set to zero, which can cause an increase in the measured
MSE. By adapting an interpolation technique as suggest by
[2] we can increase clarity of the depth map and increase
the accuracy.
Implementing a mutual information based initial cost

function can also further increase the accuracy of the depth
maps. We experimented with a basic mutual information
approach, but were unable to produce reasonable outputs.
We believe this is due to the need of multiple iterations for
the depth map to converge. Future work would also
explorer further the concept of mutual information, and
finding a way to incorporate it into stereo matching with as
little iterations as possible.
The results from this project further indicate that Semi-

Global Matching is a robust approach to estimating depth
information from a scene. Semi-Global Matching well
balances the tradeoffs of both speed and accuracy, making
it a strong contender for use in new technologies.
Link to code:
https://github.com/rmahieu/SemiGlobalMatching
References
[1] Y. Boykov, O. Veksler, and R. Zabih. Efficient

approximate energy minimization via graph cuts.
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 23(11):1222–1239, 2001.
[2] H. Hirschmuller, "Accurate and efficient stereo

processing by semi-global matching and mutual
information," 2005 IEEE Computer Society
Conference on Computer Vision and Pattern
Recognition (CVPR'05), 2005, pp. 807-814 vol. 2.
Figure 3- Batman robustness test
[3] H. Hirschmüller and D. Scharstein. Evaluation of cost
functions for stereo matching. In IEEE Computer
5. Conclusions Society Conference on Computer Vision and Pattern
Recognition (CVPR 2007), Minneapolis, MN, June
In this paper we demonstrate the effectiveness of the 2007.
Semi-Global Matching technique for depth estimation from
a set of stereo images. Leveraging a combination of global [4] Michael, M., Salmen, J., Stallkamp, J., & Schlipsing,
and local techniques our results exhibited low error when M. (2013, June). Real-time stereo vision: Optimizing
compared to ground truths. With more powerful hardware semi-global matching. In Intelligent Vehicles
and optimized CUDA implementation, our initial results Symposium (IV), 2013 IEEE (pp. 1197-1202). IEEE.
imply that this would be possible to run in real time. This
[5] G. Van Meerbergen, M. Vergauwen, M. Pollefeys, and algorithms,” International Journal of Computer
L. Van Gool. A hierarchical symmetric stereo Vision, vol. 47, pp. 7–42, 2002.
algorithm using dynamic programming. International
Journal of Computer Vision, 47(1/2/3):275–285, [8] D. Scharstein and C. Pal. Learning conditional random
April-June 2002. fields for stereo. In IEEE Computer Society
Conference on Computer Vision and Pattern
Recognition (CVPR 2007), Minneapolis, MN, June
[6] D. Scharstein and R. Szeliski. High-accuracy stereo 2007.
depth maps using structured light. In IEEE Computer
Society Conference on Computer Vision and Pattern [9] J. Sun, H. Y. Shum, and N. N. Zheng. Stereo matching
Recognition (CVPR 2003), volume 1, pages 195-202, using belief propagation. IEEE Transactions on
Madison, WI, June 2003. Pattern Analysis and Machine Intelligence,
25(7):787–800, July 2003
[7] D. Scharstein and R. Szeliski, “A taxonomy and
evaluation of dense two-frame stereo correspondence
Appendix:
CS231A FINAL PROJECT, JUNE 2016 1
Solving Large Jigsaw Puzzles

L. Dery and C. Fufa
Abstract—This project attempts to reproduce the genetic algorithm in a paper entitled ”A Genetic Algorithm-Based
Solver for Very Large Puzzles” by D. Sholomon, O. David, and N. Netanyahu. [3] There are two main challenges in
solving jigsaw puzzles. The first is finding the right fitness function to judge the compatibility of two pieces. This has
thoroughly been studied and as a result, there are many fitness functions available. This paper explores the second part
that is crucial to solving jigsaw puzzles: finding an efficient and accurate way to place the pieces. The genetic algorithm
attempts to do just that. The crucial part of the algorithm is in generating a new ordering of pieces called ’child’ from
two possible orderings of pieces, called ’parents’. Each generation learns from good traits in the parents. After going
through a hundred generations, the ordering will reflect the original image to a high accuracy. This paper also makes
use of CNN to start with reasonable orderings of ’parents’. This cuts down on the number of generations required to
reach the correct ordering of the pieces.
Index Terms—CS231A, Jigsaw Puzzles Algorithms
1 I NTRODUCTION DNA/RNA modeling, image base CAPTCHA

construction, and speech descrambling.
T HE problem of automating the solving of
jigsaw puzzles is one that has been around
since at least the 1950s. Jigsaw puzzles are im-
2 R EVIEW OF PREVIOUS WORK
age reconstruction problems where the image 2.1 Previous Work
provided has been cut into non overlapping In 1964, Freeman and Garder[1] proposed a
boxes and shuffled around. The problem is solution for a 9 piece problem. The shapes were
then to reconstruct the original image from the allowed to be of different dimensions. After
shuffled pieces. For the problem to be tractable, Freeman and Garder, most of the work has
the puzzle pieces are assumed to be of identical been based on color-based solvers, with the
dimensions and that no piece has been rotated. assumption that all pieces are rectangles of the
same dimension. Recently, probabilistic puzzle
solvers have been developed[2]. These algo-
rithms were solving 432 pieces. These solvers
however require apriori knowledge of the puz-
zle. There are also particle filter-based solvers
which are improvements over the probabilistic
puzzle solvers.
The In 2013 Sholomon et Al[3] introduced a ge-
problem has multiple applications, both in and netic algorithm based technique for solving
outside of image reconstruction. Puzzle solu- large jigsaw puzzles. It is our goal in this paper
tion techniques can be applied to broken tiles to replicate the results of this paper and also
to simulate the reconstruction of archaeological suggest areas where it could be improved.
artifacts. In fall, 2011, DARPA held a compe-
tition, with a fifty thousand dollar prize, to 2.2 This paper’s contribution
automatically reconstruct a collection of shred- This paper uses the genetic algorithm as a
ded documents.Other applications include the strategy for piece placement. It uses a stan-
molecular docking problem for drug design, dard estimation function. While this is not the
first time that the genetic algorithm has been Two chromosomes from the current population
used to solve the jigsaw puzzle problem, it has are selected, and a function called crossover
only been used to solve puzzles of a limited generates a child chromosome that learns from
size. This paper attempts to solve puzzles with the parents, and has a better reordering of the
larger pieces. In addition to the genetic algo- pieces, and hence, a better fitness score. It is
rithm, this paper also attempts to use CNN to via this mechanism that each generation gets
arrive at the correct reconstruction of the image a better fitness score than the previous genera-
in less iterations. tion. The selection process of which parents to
choose to give birth to a new child chromosome
3 T ECHNICAL D ETAILS discriminates towards parents with a better
3.1 Genetic Algorithm fitness score. The selection process is called
a roulette selection. The likelihood of being
The genetic algorithm as implemented for solv- selected is directly proportional to how good
ing the jigsaw puzzle problems starts out with the fitness score is. This way, the algorithm
a thousand different ways to order the pieces. makes sure that selected parent chromosomes
Each way of ordering a piece is called a chro- have good traits (as evidenced by their fitness
mosome. The entire set of a thousand chro- scores) to be passed on to the children.
mosomes is called a population. At each stage
of the process, called a generation, we have a
population of a thousand chromosomes. Now, Fitness Function
the goal is that with each passing generation, The estimation function utilizes the fact that
i.e. with the next thousand chromosomes or adjacent pieces in the original image will most
a population, the orderings of the pieces will likely share similar colors along their edges.
begin to look more and more like the original Hence, computing the sum of the squared color
or correct image. During each population, the differences along pixels that are adjacent to
best chromosome will be determined by the each other (between two different pieces) will
estimation function. give us an indication of whether the two pieces
belong adjacently in the direction they shared
the pixels. Hence, the less the specific sum is ,
the more likely they are to be adjacent to each
other. From the image below for example, we
can expect the fitness function to give us a high
score for piece 5 and 6 as the color difference in
the edges seem to be high, while piece 8 and
9 will have a very low fitness score. We can
further assume that piece 5 and 8 will have a
high fitness score while 6 and 9 will have a
lower one in comparison.
Above is the higher level pseudo code for the

Genetic Algorithm framework. The four best
scorers according to the estimation or fitness
function will automatically be placed into the The fitness function for a given chromosome
next generation. The rest of the chromosomes will compute the sum of the score for every
for the next generation are going to be hy- edge in the chromosome. Below are examples
brids of chromosomes from the current one. of functions which compute the compatibility
score of two pieces in a left-right adjacency are similar to that of the parents. The process
relationship, and a function which computes of growing the kernel will go on until all the
the fitness score of a given chromosome (i.e. pieces have been used.
computes the score for all edges and direc- The final absolute location of a given piece
tions.) is only determined after all the pieces have
been used. This is because as recommended
earlier, the kernel growing process must allow
for independence or flexibility in the placement
as the algorithm plays out. To begin, crossover
K is the number of pixels in each piece in the selects a random piece from either parent and
vertical direction. places it in the kernel. After that, it keeps
track of all the available boundaries for a new
piece to be added to the kernel. An available
boundary can be thought of as a piece and the
This way, it covers all the available edges direction in which a new piece can be placed
in a chromosome. Note that D is the fitness adjacent to it. There are three main phases
score for the compatibility of piece xj to the involved in crossover.
direction (left, right, down, or up) of xi. In
the selection process of the algorithm (roulette Phase One
selection) make sure that a lower fitness score It goes through the boundary pieces in the
is treated as more likely to be chosen. kernel. Let’s say that piece xi in the direction
d, for example is selected. Phase one checks to
Crossover see if both parents have the same piece xj in
the direction of d of xi . If it so, xi is added
Crossover can be considered the heart of the to the kernel. If x has been already added,
i
algorithm. Crossover receives two parent chro- it will of course be skipped. The only pieces
mosomes and creates a child chromosome. It under consideration should be unused pieces
allows ”good traits” to be transmitted from (pieces not in the kernel). This phase keeps on
the parents to the child. The goal is to have going until there is no boundary on which both
the child with a better fitness score than both parents agree.
parents. The fitness function does a good job
of discriminating between adjacent pieces, but Phase Two
does not give any indication of whether the
pieces are placed at the correct absolute po- Assume (xi ,R) is available on the kernel. Check
sition in the image. The implementation of if one of the parents contains a piece xj in
crossover then must allow for independence spatial relation R of xj , which is also a best-
in the placement of pieces. (It should be a buddy of xi in that relation. Two pieces xi and
dynamic process. Just because a piece was at xj are considered best-buddies if D(xi ,xj ,R) is
some point assigned to say, (2,3) of the the the lowest fitness score they can achieve.I.e.
image, it must not remain there, it should be there is no better piece xk , that will give a
able to transition into a different place based lower fitness score D(xk ,xj ,R) as well as no xk
on how the pieces build up around it.) available, that can give D(xi ,xk , R) lower than
The implementation of crossover suggested D(xi ,xj ,R). The piece considered xj must be
starts out with a single piece and then grad- adjacent to xi in one of the parents.
ually joins other pieces at available bound- If such a piece is found, go back to phase
aries.The image is always contiguous since one, if not, proceed to three
new pieces are only added adjacent to existing
ones. Keeping track of the pieces used and Phase Three
the dimensions of the child being formed is Pick random (xi,R) from the kernel and assign
important so that the dimensions of the child it xj from available pieces such that D(xi ,xj ,R)
is lowest. Go to Phase One. The three phases Problem Formulation

keep on going until all the pieces are used
up. Mutation is introduced in phases one and
three. With a 5 percent probability, a random
available piece is assigned as opposed to one The input to the neural network is an image
that both parents agree on in phase one and whose color channel are made up the jigsaw
the most compatible on e in phase three. pieces stacked side by side. The task of the
network was to predict the order in which the
pieces were stacked together, thus assigning
each piece its right position in the original
image. To clarify, say we have a 3x3 puzzle. The
pieces are numbered 1 to 9 according to their
position in the original image. They are then
stacked in a random configuration along the
color channel. It is then the task of the network
to predict the configuration in which they were
stacked. We cast the problem as a classification
problem. Since the configurations space is re-
ally large, 9 pieces produce 9! = 362990 possi-
bilities, it would be near impossible, given the
computing resources at our disposal, to have
9! classes. As such, we decided to reduce the
problem to the following. We keep track of 100
classes representing 100 randomly generated
configurations. The objective of the network is
now to predict the configuration (1 - 100) that
has the closest Hamming Distance to the actual
configuration of the given image. This work
around made the solution space of the problem
tractable.
3.2 Convolutional Neural Network Augmen-

tation
The Algorithm proposed as is, always re- Network Architecture and Implementation
quires 1000 randomly initiating chromosomes. Details
As an extension to the Algorithm proposed by
Sholomon et al[1], we decided to try to influ-
ence the starting state of the Genetic Algorithm.
The rationale is that if the the genetic algorithm The diagram below shows the network
starts out with chromosomes that are already structure. Our implementation of the
quite good, then convergence would be faster network was done in TensorFlow. We used
and the number of generations required for a softmax cross entropy loss function. A
the algorithm could be reduced. We decided to hyper parameter search lead us to Adam
train a convolutional Neural Network to solve Optimizer with a learning rate of 10−3 and
the Jigsaw task and then use its output as input a batch size of 128. We had to normalize
to the Genetic Algorithm. the image channels to 1 by dividing by 255.
4.2 Solving Jigsaws via Genetic Algorithm

3.3 Evaluation Metric The table below shows the run time and ac-
curacy results for difference puzzle piece sizes
There are two major metrics used to evaluate a averaged over 10 runs. As can be seen from the
jigsaw reconstruction. There is the direct com- table, we were able to achieve results compa-
parison which measures the fraction of pieces rable to those of the original paper in terms of
located in their correct absolute location, and the accuracy of rearrangement of jigsaw pieces
there is neighbor comparison, which measures and the fitness score of the result returned.
the fraction of correct neighbors. The direct Our reconstructions, though not always as ac-
method has been shown [1] to be less accurate curate as the original by the neighbor accuracy
and less meaningful since it cannot handle metric described, produced a reconstructions
slightly shifted cases. We thus used the neigh- whose fitness score were equal to the fitness
bor comparison as our evaluation metric. score of the original image.This suggests that
the algorithm sometimes gets stuck in a local
minimum whose fitness score is the same as
the un-jumbled image’s score.
4 E XPERIMENTS
Figure 1. Algorithm Performances on Different
4.1 Convolutional Neural Network Piece Numbers
Given that we cast our solution space for the

network into 100 classes, we set our baseline to
be a validation accuracy of 0.01 corresponding
to random guessing out of the 100 classes.
Our convolutional neural network, was
able to achieve validation accuracy of 0.022
which is more than twice our baseline. This
performance is quite impressive considering
that what the network is actually trying to
achieve is predicting a configuration space of
9! that is being represented by 100 classes. The
more puzzles provided to the network to solve,
the better it got as the plot below suggests.
In the domain of run time however, we were

unable to match the original paper’s results.
The paper describes solving a 432 piece puzzle
in 43.63 seconds, however our implementation
takes around 2 hours to run on a puzzle of
the same size. We believe this huge difference
is due to differences in the specifics of the
implementation of the crossover function.
Figure 2. 96 PIECES:GENERATION 1
The table above contrasts the performance of

the CNN augmentation in different regimes
of training data size with the pure Genetic
Algorithm performance on the 3x3 Jigsaw. In
general, the augmented algorithm has a better
Left: Best Reconstruction so far. Right: Sec- run time. Though the difference is 1 second
ond Best for this regime of 3x3 puzzles, it is easy to
see how this gain could be more significant
as the puzzle size increases. In general, the
Figure 3. 96 PIECES:GENERATION 100 reconstructed puzzle from the augmented
model are not as accurate as the pure Genetic
Algorithm. However, the reconstructions
always had the same minimal fitness score
as the original, meaning that it is finding a
good minimum, even if this is not the original
reconstruction.
5 C ONCLUSION
The Jigsaw puzzle problem is an interesting
Left: Actual Image. Right: Reconstructed Im- problem with applications in many domains.
age. Looking forward, one extension we plan to
explore is to solve the jigsaw problem using
only a neural network. We envision embed-
ding convolutional layers in a Long Short Term
4.3 Genetic Algorithm + Convolutional Memory or Recurrent Neural Network which
Neural Network (CNN) would directly predict the right configurations
instead of using our current trick of having 100
representative configurations. We would also
As an augmentation to the original algorithm,
like to investigate more avenues for improving
we fed the reconstruction output of the CNN
the run time of our current model.
as the starting population of the Genetic
Algorithm. We were mainly interested in two
effects
R EFERENCES
1.Did the run time of the algorithm improve? [1] H. Freeman and L. Garder. Apictorial jigsaw puzzles:
The computer solution of a problem in pattern recog-
2.How did the accuracy of the reconstruction nition. IEEE Transactions on Electronic Computers, EC-
change? 13(2):118127, 196
[2] T. Cho, S. Avidan, and W. Freeman. A probabilistic image

jigsaw puzzle solver. In IEEE Conference on Computer Vi-
sion and Pattern Recognition, pages 183190, 2010.
[3] D. Sholomon, O. E. David, and N. S. Netanyahu. A genetic
algorithm-based solver for very large jigsaw puzzles. In
IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 17671774, 2013.
[4] D. Pomeranz, M. Shemesh, and O. Ben-Shahar. A fully
automated greedy square jigsaw puzzle solver. In IEEE
Con- ference on Computer Vision and Pattern Recognition,
pages 916, 2011
Video Stabilization and Face Saliency-based Retargeting
Yinglan Ma1 , Qian Lin2 , Hongyu Xiong2

1
Department of Computer Science, Stanford University 2 Department of Applied Physics, Stanford University
Abstract from feature tracking in the video; (2) calculate a smoothed

path, which is cast as an constraint optimization problem;
Technology revolution has brought great convenience of (3) Synthesizing the stabilized video using the calculated
daily life recording using cellphones and wearable devices smooth camera path. To reduce high frequency noise, we
nowadays. However, hand shake and human body move- use the L1 path optimization method described in [1] to
ment is likely to happen during the capture period, which produce purely constant, linear or parabolic segments of
significantly degrades the video quality. In this work, we smoothed motion, which follows cinematographic rules. To
study and implement an algorithm that automatically stabi- reduce low frequency swanning in videos containing a per-
lizes the shaky videos. We first calculate the video motion son as the central object, we apply further restraint to the
path using feature matching and then smooth out high fre- motion of the facial features. In order to make the solution
quency undesired jitters with L1 optimization. The method approachable, our method uses automatic feature detection
ensures that the smoothed paths only compose of constant, and do not require user interaction.
linear and parabolic segments, mimicking the camera mo- Our video stabilization method is a purely software ap-
tions employed by professional cinematographers. Since proach, and can be applied to videos from any camera de-
the human face are of broad interest and appear in large vices and sources. Another popular class of mobile video
amount of videos, we further incorporated face feature de- stabilization methods use the phone’s build-in gyroscope to
tection module for video retargeting purposes. The detected measure the camera path. Our method has the advantage of
faces in the video also enables many potential applications, being applicable to any video from any sources, for example
and we add decoration features in this work, e.g., glasses online video, without any prior knowledge of the capturing
and hats on the faces. device or other physical parameters of the scene. Our ap-
proach also enables facial retargeting, which can be extent
to other kinds of salient features.
1. Introduction
2. Previous Work
Nowadays nearly 2 billion people own smartphones
worldwide, and an increasing number of videos are cap-
2.1. Literatures
tured by mobile devices. However, videos captured by hand Video stabilization methods can be categorized into three
handled devices are always shaky and undirected due to the major directions: 2D method, 3D method and motion esti-
lack of stabilization equipment on the handhold devices. mation method.
Even though there are commercial hardware components 2D methods estimate frame-to-frame 2D transforma-
that could stabilize the device when we record, they are rel- tions, and smooth the transformations to create a more sta-
atively redundant and not handy for daily use. Moreover, ble camera path. Early work by Matsushita et al. [5] applied
most hardware stabilization systems only removes high fre- low-path filters to smooth the camera trajectories. Gleicher
quencies jitters but are unable to remove low frequency mo- and Liu [4] proposed to create a smooth camera path by
tions arise from panning shots or walking movements. Such inserting linearly interpolated frames. Liu et al.[6] later in-
slow motion is particular problematic in shots that intend to corporated subspace constraints in smoothing camera tra-
track prominent foreground object or person. jectories, but it required longer feature tracks.
To overcome the above difficulties, we implement a 3D methods rely on feature tracking to stabilize shaky
post-processing video stabilization pipeline aiming to re- videos. Beuhler et al. [8] utilized projective 3D reconstruc-
move undesirable high and low frequency motions from tion to stabilize videos from uncalibrated cameras. Liu et
casually captured videos. Similar to most post-processing al. [9] were the first to introduce content-preserving warp-
video stabilization algorithms, our implementation involves ing in video stabilization. However, 3D reconstruction is
three main steps: (1) estimate original shaky camera path difficult and unrobust. Liu et al. [6] reduced the problem
1
to smoothing long feature trajectories, and achieved com- use Tchebycheff (L1 ) smoothing. For error distributions
parable results to 3D reconstruction based methods. Gold- at the other end of the spectrum, which is with long tails,
stein and Fattal[10] proposed an epipolar transfer method one should use L1 smoothing. In between these extremes,
to avoid direct 3D reconstruction. Obtaining long feature which are short-tail spectra such as normal distribution,
tracks is often fragile in consumer videos due to occlusion, least squares or L2 smoothing appears to be best.
rapid camera motion and motion blur. Lee et al. [11] incor-
porated feature pruning to select more robust feature trajec- 3.1.2 L1 -Norm Optimization
tories to resolve the occlusion issue.
Motion estimation methods calculate transitions between In the perspective of a single feature point, the video mo-
consecutive frames with view-overlap. To reduce the align- tion can be viewed as a path of its coordinates (x, y) move-
ment error due to parallax, Shum and Szeliski[12] im- ment with respect to the frame number. Since it is diffi-
posed local alignment, and Gao et al.[7] introduced a dual- cult to avoid jitters with hand-held devices, we will observe
homography model. Liu et al[13] proposed a mesh-based, that the path the is wiggling. Video stablization is to ob-
spatially-variant homography model to represent the motion tain the new coordinates (x, y) at each frame and thus a
between video frames, but the smoothing strategy did not new path with enhanced smoothness. In the perspective of
follow cinematographic rules. the frames, the task is to smooth the transformations be-
Our implementation, based on [1], apply L1 -norm op- tween frames so that the feature points movement would
timization to generate a camera path that consists of only be minimal. The frame transformation is generalized as
constant, linear and parabolic segments, which follow cine- affine transform, including translational and rotational mo-
matographic principles in producing professional videos. tion, and scaling caused by object/camera distance change.
We estimate the camera path by first matching features
2.2. Our Contribution between consecutive frames Ct and Ct+1 , and then cal-
culate the affine transformation Ft+1 based on the match-
In this work, we re-implement the L1 -norm optimization
ing. That is, the process can be formatted as Ct+1 =
algorithm [1] to automatically stabilize the videos captured,
Ft+1 Ct . Then we estimate the affine transformation Ft+1
with a smoothed feature path containing only constant, lin-
using these two set of feature coordinates, Ct and Ct+1 . In
ear and parabolic segments. Additionally, in order to en-
this work, we extract features of each frame (opencv func-
able the video to retarget on human faces, we use the facial
tion cv::goodFeaturesToTrack), and find the matching in the
landmark detection algorithm from OpenFace toolkit [3] to
next frame using iterative Lucas-Kanade method with pyra-
set facial saliency constraints for the path smoothing; the
mids (cv::calcOpticalFlowPyrLK).
strength of the constraint could be tuned from 0 (no facial
We denote the smoothed features as Pt , then we have a
retargeting) to 1 (video fixing on facial features), and in this
correlation between the original features in frame t and the
way we are able to combine both video path smoothing and
smoothed ones, as Pt = Bt Ct , where Bt is is the stabiliza-
facial retargeting according to specific user needs.
tion/retargeting matrix, transforming the original features to
Beyond that, in order to make our work more fun, we the smoothed ones. Since we only want the smoothed path
also manage to attach interesting decorations such as hat, to contain constant, linear, and parabolic segments, we min-
glasses, and tie above, on, or below the human faces de- imize the first, second, and third derivatives of the smoothed
tected, and their transformations are based on the movement path with weights c = (c1 , c2 , c3 )T :
of human face in the video.
O(P ) = c1 |D(P )|1 + c2 |D2 (P )|1 + c3 |D3 (P )|1 , (1)
3. Proposed Method
where
3.1. L1 -Norm Optimized Video Stablization X X
|D(P )|1 = |Pt+1 Pt | 1 = |Rt |1 ,
In this section, we describe the method of video stabliza-
t t
tion in this work. X
|D2 (P )|1 = |Rt+1 Rt | 1 , (2)
t
3.1.1 Norms of smoothing X
|D3 (P )|1 = |Rt+2 2Rt+1 + Rt |1 .
When applying path smoothing algorithm, we should al- t
ways be careful to choose which regularization method we
Here the residual is Rt = Bt+1 Ft+1 Bt .
use, since different regularization methods works differ-
For each affine transform:
ently for different error distribution. [2]

For error distributions with sharply defined edge or ex- b b t
tremes (typified by the uniform distribution) one should Bt = 11 12 x (3)
b21 b22 ty
2
in 6 DOF, we vectorize it as pt = We use Constrained Local Neural Fields (CLNF) for fa-
(b11 , b12 , b21 , b22 , tx , ty )T , which is the parametriza- cial landmark detection available on OpenFace. Detail of
tion of Bt ; correspondingly the algorithm can be found in [3]. The CLNF algorithm
works robustly under varied illumination and are stabilized
|Rt (p)|1 = |pTt+1 M (Ft+1 ) pt | 1 . (4) for video. It outputs a fixed number of facial landmarks,
We make use of Linear Programming (LP) technique including the face silhouette, the lips, nose tip and eyes, as
to solve this L1 -norm optimization problem. To minimize shown in Fig. 2c. These multiple landmarks allow a more
|Rt (p)|1 in LP, we introduce slack variables e1 0, so stable and accurate estimate of the facial position. In con-
that e1  Rt (p)  e1 ; similarly there are e2 and e3 for trary, other face detector, for example the opencv built-in
|Rt+1 (p) Rt (p)|1 and |Rt+2 (p) 2Rt+1 (p) + Rt (p)|1 , ones, were observed to produce inaccurate bounding box
respectively. For e = (e1 , e2 , e3 )T , the objective function and are not stable over video frames during our experiment.
of the problem is to minimize cT e. The detailed facial landmarks from CLNF also enable us to
In addition, we want to limit how much Bt (or pt ) could perform other post-processing on the video, for example the
deviate from the original path, i.e. the actual shift should face decoration described in Section 3.4.
within the cropping window. Thus, we can add constraints After detecting the facial landmarks in each frame t, we
on the parameters in LP, such as: lb  U pt  ub, where U estimate the center of face Cf,t by averaging all the land-
is the linear combination coefficient of pt . The complete L1 marks. Let C0 be the desired position of the center of face,
minimization LP for smoothed video path with constraints for example the center of frame. Let Pt and St be the orig-
is summarized below: inal and smoothed camera trajectory, then the saliency con-
straint can be posed as a additional term to the loss function
Algorithm 1 Summarized LP for the smoothed video path
Input: Frame pair transform Ft , t = 1, 2, ..., n Lt = (1 ws )(St P̄t )2 + ws (St Pt + Cf,t C0 )2 (5)
Output: Update transform Bt
. Bt could be transformed to pt where P̄t is average over a window of frames, and ws is a
Minimize: cT e parameter to adjust how much weight the saliecy constraint
w.r.t p = (p1 , p2 , ..., pn ) have on the optimization. Minimizing Lt then produce the
where e = (e1 , e2 , e3 )T , ei = (ei1 , ei2 , ..., ein ), c = desired smoothed trajectory St .
(c1 , c2 , c3 )T 3.3. Metrics & Characterization
subject to:
1. e1t  Rt (p)  e1t 3.3.1 Evaluation of Smoothed Path
2. e2t  Rt+1 (p) Rt (p)  e2t
For the stabilizing problem we are concerning about, it
3. e3t  Rt+2 (p) 2Rt+1 (p) + Rt (p)  e3t
would be inappropriate to simply regard the undesired shak-
4. eit 0
ing as short-tail normal distribution, so using the L1 norm
constraints:
between each frame pair during minimization is more suit-
lb  U pt  ub
able. In addition, L1 optimization has the property that the
resulting solution is sparse, i.e. the computed path there-
We use lpsolve library for modeling and solving our LP fore has derivatives which are exactly zero for most seg-
system. ments. On the other hand, L2 minimization (in a least-
squared sense), tend to result in small but non-zero gradi-
3.2. Facial Features Detection and Retargeting
ents. Qualitatively, the L2 optimized camera path always
In many videos, a particular subject, usually a person, is has some small non-zero motion (most likely in the direc-
featured. In this case it is not only important to remove fast, tion of the camera shake), while the L1 optimized we used
jittering camera motions, but also unintended slow panning (|D(P )|1 , |D2 (P )|1 , and|D3 (P )|1 ) will create path is only
or swanning that momentarily move the subject off-center composed of segments resembling a static camera, (uni-
and lead to distraction for the viewer. This can be posed form) linear motion, and constant acceleration [1].
as a constraint on the path optimization as requiring that Therefore, we will compare the L1 norm |D(P )|1 be-
salient features of the subject to be closed to the center re- tween the original video feature path and the smoothed one,
gion throughout the video. and use this comparison as metrics of our experiments de-
The first step towards salient-point-preserving video sta- scribed below. Specifically, we will calculate the average
bilization is salient feature detection and tracking. In par- absolute shift between adjacent points on the video feature
ticular, it is desirable to have the algorithm automatically path, with respect to both x and y directions, and average
recognize and detect these salient features without user in- absolute rotation angle increment. The same calculations
put. There are many face detectors available for such task. will be done to the smoothed path
3
3.3.2 Evaluation of Facial Retargeting 4.1. Video Stabilization
As for the part of facial retargeting, in addition to the com- We apply our path smoothing algorithm to shaky videos
parison between the L1 norm |D(P )|1 of the original video and observe significant reduction of jittering. An example
feature path and the new one, where we can extract the in- output can be found on Youtube.
formation about smoothing, we are also interested to see To visualize the effect of stabilization, we plot the esti-
how the facial features are targeted. So we will calculate mated camera trajectory before and after our algorithm in
the average position of the face features with respect to the Fig. 1. We also provide a quantitative measurement of the
center of frame, and simultaneously calculate the average L1 norm |D(P )|1 before and after smoothing in Table. 1.
absolute position deviation. As we can see the L1 norm decreases a lot, which means
the abrupt jitters are significantly decreased.
3.4. Face Decoration
With per frame face features detected, we can add fun
face decorations to our videos, such as glasses, hat and mus-
tache. By incorporating feature locations, we are able to
translate, scale and rotate the decorations to place them ap-
propriately onto human faces. Since our videos are stabi-
lized and focus on faces, the transitions of the decorations
are smoother. Here is an example of how we utilize the fea-
ture points in adding decorations.
Adding glasses: we extract left eye, right eye, left brow
and right brow feature points to calculate a horizontal eye
axis, and use it to estimate the orientation of the glasses.
Scale is approximated from eye distance, and translation de- Figure 1. Path before and after (Left column) L2 -norm smooth-
pends on the locations of the eye points. ing (Right column) L1 -norm smoothing. (Top)x-direction.
Since face silhouette feature points are usually less sta- (Middle)y-direction. (Bottom)rotational angle.
ble, we avoid using those points in adding face decorations.
Screenshots of adding hat and glasses are shown in Fig-
ure 4. Table 2. L1 norm |D(P )|1 between the original video feature path
and the smoothed one, in both x and y directions and the rotational
angle.
4. Experiments
Table 1 lists the algorithm run time on our laptop. The path < | xt | > < | yt | > < | at | >
second column lists time for path smoothing without facial original 1569 857 1.12
feature, and the third column lists time for path smooth- smoothed 705 234 0.44
ing with facial feature as salient constraint. In the lat-
ter case, the CNFL facial landmark detection takes up the
biggest chunk of time (⇠ 45ms per frame). [1] reported 20 4.2. Facial Retargeting
fps on low-resolution video, and 10 fps with un-optimized
saliency. Our experiment with video stabilization using facial fea-
tures are shown in Fig. 2. Fig. 2(a) is the original video,
Table 1. Timing per frame of the algorithm. Video resolution which contains slow swanning motion of both the camera
640 ⇥ 360. and the subject person. Fig. 2(b) is the stabilized output us-
ing only camera path smoothing. The slow motion of the
w.o. face w. face subject is still prominent. Fig. 2(c) is the stabilized output
motion estimation (ms) 12.1 59.1 using camera path smoothing with a constraint of the mo-
optimize camera path (µs) 0.15 0.40 tion of facial features. It leads to stabilization of the subject
render final result (µs) 2.7 2.7 at the center over frames. Both result videos can be found
face decoration (ms) - 5.7 on Youtube link 1 and link 2.
total (ms) 15 68 As expected, stabilization comes at a price of reduced
speed (fps) 67 15 resolution. The original image are cropped by 20% in
Fig. 2(b) and (c) to remove black margins due to warpping.
There are still residue margins in Fig. 2(c).
4
We also quantify the smoothing effect and the facial tar- video. In Proceedings of the 15th ACM international con-
geting, as we can see from Table. 2. With the increase of the ference on Multimedia (MM ’07). ACM, New York, NY,
facial saliency constraint ratio !, both L1 norm and the ab- USA, 27-36.
solute position shift drops, which means, the larger ! is, the [5] Yasuyuki Matsushita, Eyal Ofek, Weina Ge, Xiaoou
smoothier the video gets, and the more centered the human Tang, and Heung-Yeung Shum. 2006. Full-Frame Video
face is. The result is expected from our algorithm. Stabilization with Motion Inpainting. IEEE Trans. Pattern
Anal. Mach. Intell. 28, 7 (July 2006), 1150-1163.
4.3. Comparison with State-of-the-art Systems [6] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agar-
Due to no publicly available implementation of previous wala. Subspace video stabilization. In ACM Transactions
works, we obtain the original and output videos reported in on Graphics, volume 30, 2011.
Grundmann’s paper [1], and calculate the evaluation metrics [7] Junhong Gao , Seon Joo Kim , M. S. Brown, Con-
described in Section 3.3 on their output video and present structing image panoramas using dual-homography warp-
alongside with our results. As we can see from the compari- ing, Proceedings of the 2011 IEEE Conference on Com-
son below, our implemented algorithm is comparable to the puter Vision and Pattern Recognition, p.49-56, June 20-25,
state-of-the-art system. 2011
[8] Buehler, C., Bosse, M., and McMillan, L. 2001. Non-
4.4. Face Decoration metric image-based rendering for video stabilization. In
Proc. CVPR.
With per frame face features detected, we can add fun [9] Feng Liu , Michael Gleicher , Hailin Jin , Aseem
face decorations to our videos, such as glasses, hat and mus- Agarwala, Content-preserving warps for 3D video stabiliza-
tache. By incorporating feature locations, we are able to tion, ACM Transactions on Graphics (TOG), v.28 n.3, Au-
translate, scale and rotate the decorations to place them ap- gust 2009
propriately onto human faces. Since our videos are stabi- [10] Amit Goldstein , Raanan Fattal, Video stabilization
lized and focus on faces, the transitions of the decorations using epipolar geometry, ACM Transactions on Graphics
are smoother. Screenshots of adding hat and glasses are (TOG), v.31 n.5, p.1-10, August 2012
shown in Fig. 4. [11] Chen, B.-Y., Lee, K.-Y., Huang, W.-T., and Lin, J.-
S. 2008. Capturing intention-based full-frame video stabi-
5. Conclusion & Perspectives lization. Computer Graphics Forum 27, 7, 1805–1814.
[12] Heung-Yeung Shum , Richard Szeliski, Systems
All in all, video feature path is significantly smoothed
and Experiment Paper: Construction of Panoramic Image
using the L1 optimization stabilization algorithm; the L1
Mosaics with Global and Local Alignment, International
norm |D(P )|1 , which signifies the moving between frames,
Journal of Computer Vision, v.36 n.2, p.101-130, Feb. 2000
greatly drops after applying the stabilization.
[13] Shuaicheng Liu, Lu Yuan, Ping Tan, and Jian Sun.
If the facial retargeting method is included, the video 2013. Bundled camera paths for video stabilization. ACM
would be more focused on human faces; the larger the Trans. Graph. 32, 4, Article 78 (July 2013), 10 pages.
saliency constraint ratio ! is, the more centered the human
faces are with respect to the cropped video frame.
Decoration addition such as glasses, hat, or tie could also
be attached to the faces in the video, with the same orien-
tation as the faces. More fun stuffs will be applied to make
this work fancier in the future.
Reference
[1] Matthias Grundmann, Vivek Kwatra, Irfan Essa.
Auto-Directed Video Stabilization with Robust L1 Optimal
Camera Paths. CVPR, 2011.
[2] JR Rice, JS White. Norms for smoothing and estima-
tion. SIAM review, 1964.
[3] Tadas Baltruaitis, Peter Robinson, and Louis-
Philippe Morency. Constrained Local Neural Fields for ro-
bust facial landmark detection in the wild. ICCVW, 2013.
[4] Michael L. Gleicher and Feng Liu. 2007. Re-
cinematography: improving the camera dynamics of casual
5
Figure 2. Demonstration of facial retargeting in video stabilization. The green dot indicates the center of frame. Green lines show boarder
of frame. Red dots in (c) indicated detected facial landmarks from OpenFace [3]. They are intended as a guide to the eye. Both videos can
be found on Youtube (b) and (c).
Table 3. L1 norm |D(P )|1 between the original video feature path and the smoothed one, in both x and y directions and the rotational
angle.
! < | xt | > < | yt | > < |x xcenter | > < |y ycenter | >
original 1392 496 32805 5882
0.2 1139 254 32583 4902
0.5 792 234 21568 3433
0.95 221 247 2695 1954
6
Figure 3. Path smoothing before and after with facial saliency con-
straints. (Left column) x-direction. (Right column) y-direction.
From top to bottom, the facial constraint ratios ! are 0.2, 0.5, and
0.95, respectively.
Table 4. Comparison between our algorithm and the state-of-the-

art one from [1]
method < | xt | > < | yt | > < | at | >

state-of-the-art [1] 273 296 0.53693
our algorithm 705 234 0.44387
Figure 4. Face decoration with glasses and hat

Cs231a 2016 Reports

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cs231a 2016 Reports

Hochgeladen von

Copyright:

Verfügbare Formate

3D Person Tracking in Retail Stores

Russell Kaplan and Michael Yu

4.2.1 Single View Metrology

5. Experimental Setup and Results

BOJIONG NI, JORIS VAN MENS

Abstract Another frequently used method for object de-

3.1. Experimental setup. The hardware we use for

3.2. Color-Shape Approach. In the color-shape

3.2.1. HSV filter. We create a mask on the image by

Figure 4. Effect of erosion and

3.2.3. Canny edge detector. We apply the Canny

3.4.1. Find matching key points. As SIFT is fre-

3.4.2. Find transformation matrix. After the key

(c) Key point match-

Figure 11. Colored image with

(a) match at center (b) match at corner

(a) False positive (b) Key points match

4.2.6. Speed of SIFT. Our experiments show that our

4.2.7. Accuracy of SIFT. From the above analysis,

4.3. Experiments using Template Matching.

#rotation #sizes processing time (ms)

In addition to processing time and detection accu-

Andrei Bajenov Darshan Kapashi Sagar Chordia

Abstract for camera tracking and 3D reconstruction. There is also

There are a number of interesting applications to this.

There are a growing number of technologies that are 3. Previous Work

We tried a couple different SFM libraries. The SFM

Figure 4: Stereo setup

1. Disparity map. Disparity refers to the difference in im-

2. Depth map. A mapping for each pixel in the image to

Figure 3: Side view of the 3D reconstruction

At this point, we have the camera intrinsic and extrinsic

4.2.2 Semi-global block matching

Figure 8 is the dense 3D reconstruction obtained from

Figure 6: SGBM Disparity Map

After obtaining a disparity map, we do a first pass to

The point x maps to p using the camera transform.

The point pr is the point in the rectified image, which

To find the set of markers, we apply the following set of

This method has several parameters that can be tuned to ax + by + cz + d = 0

For a point p = (x, y) in the image plane, we can

depth = −d/(a ∗ (py − cy )/fy + b ∗ (px − cx )/fx + c)

For certain segments, because of measurement noise,

For each image segment, we collect all the 2D points

The last step of our system is augmenting the video

Figure 14: Depth map generated by merging depth maps

There are various parameters which can be tweaked

5. Evaluation Figure 17a shows an image in the Middlebury evaluation

(c) Disparity map with Semi-Global Block Matching

Figure 17: StereoMatch Evaluation using middlebury dataset

(c) Refined disparity map using image

Figure 18: Another test image from Middlebury evaluation dataset

6. Future Work We were able to successfully project a 3D object back

• Try different image segmentation approaches to re-

• Exploit the fact that we have a video instead of a set of

• Exploit the fact that we need a depth map of only a

• Use a better rectification algorithm, as described in ”A

• [17] describes how to add better depth map merging

• Project more interesting geometry into the scene. Ei-

[8] Kundu, Abhijit, et al. ”Joint semantic segmentation

[9] Bradski, Gary, and Adrian Kaehler. Learning OpenCV:

[10] Wu, Changchang. ”VisualSFM: A visual structure

[11] Snavely, Noah. ”Bundler: Structure from motion

[12] Eisert, Peter. ”Reconstruction of Volumetric 3D