Beruflich Dokumente
Kultur Dokumente
Abstract tail stores are looking to obtain and harness similar data to
shelve products, auction shelf space, and strategically place
In this project, we sought track the movement of mul- discounts and information.
tiple people in 3D given security footage that is represen- There are numerous current approaches being taken to
tative of what would be available in retail stores without obtaining real-time consumer data in retail stores. Many
modifying existing camera deployments. More specifically, of these involve tracking peoples movement throughout the
this involves using a single, possibly fisheye-distorted view store, with Bluetooth or wifi tracking, or with photogates.
to track people in 3D space and model where they are in a All of these require significant hardware and deployment
room or store. This involved bridging various papers across costs, which has hindered their scalability.
the computer vision literature, looking at radial distortion Computer vision is a promising tool to address this busi-
resolution for images from an uncalibrated camera; cali- ness need. As weve seen significant progress in the field as
bration techniques from single-view metrology, affine ap- of late (notably from convolutional neural networks), cam-
proximation, and a simplified special-case mathematical eras are an attractive option to track users in a store and
model for object detections on a ground plane; and deep obtain data on customer retail behavior. Cameras are a par-
region-based convolutional networks for 2D person detec- ticularly attractive method to obtain this data because most
tion. retail stores already have the necessary hardware in place
We define an approach that uses this single security cam- for the purposes of security.
era view to track people on a ground plane, relying on some
In this project, we track the 3D positions of multiple peo-
assumptions about the geometry of the space, but no ad-
ple throughout a store in real-time, given a camera source
ditional hardware, giving our work an advantage over ex-
and certain assumptions about the store layout (namely, that
isting companies and processes that rely on more sophisti-
only one floor is visible to the camera, and that people’s
cated sensors or sensor networks for person-tracking. The
feet are visible in the image frames). We show a live 3D
assumptions we make are realistic in the retail store context,
bounding box whose coordinates are relative to the world
and our quantitative results are compelling: significantly
frame that tracks each person as they move, with error lev-
more accurate than WiFi-based tracking solutions that are
els within 55cm in outdoor settings and 30cm indoors, much
already being deployed [9]. Our approach can be deployed
better than the existing WiFi- and Bluetooth-based tracking
in most retail spaces without any hardware modifications to
solutions [9]. To do this, we tie together numerous concepts
existing security setups. We are excited by the real-world
from computer vision in a single approach - we design an
applicability of our work.
entirely new approach for this business need by integrat-
ing various approaches. To each subproblem, we tried var-
ious options, picked the best, and identified optimizations
1. Introduction for this particular case where applicable - for instance, we
As we see much of retail moving online from brick- found that affine approximation of tiled floor grids worked
and-mortar, opportunities to analyze consumer shopping far better than vanishing points given the configuration of
behavior are rapidly growing and becoming commercial- many security cameras with respect to the floor.
ized. Recommendation engines, product positioning on In this paper, we beginning by examining the problem
webpages, and sales funnels are all relentlessly A/B tested statement, and related work, both work that we read to gain
and optimized to seduce the consumer into clicking Buy. background knowledge, and also that we read to learn from
Much as the abundance of data makes optimization and it- and build on their approaches to solving problems of dis-
eration easy in the e-commerce space, brick-and-mortar re- tortion, calibration, and object detection. We then dive into
1
our approaches to each of these three problems, and how be used consistently with a model of radial image distor-
they tie together into an end-to-end approach that could be tion to solve for the radial distortion parameters and thus
deployed into retail stores. Finally, we use a larger scale undistort the image. For a given snake, the algorithm fits it
dataset to obtain some quantitative metrics with which to to the line of best fit, rotates this line to be horizontal, and
evaluate the success of our approach, and we leave space estimates constant distortion parameters that fit all of these
for future work, such as integrating Extended Kalman Fil- snakes/lines.
ters to enforce temporal consistency. Another problem that was clear was recovering both in-
trinsic and extrinsic camera parameters from a single view.
2. Problem Statement To do this, we used the affine calibration approximation
taught in class, and covered in R. Hartley and A. Zissermans
Our objective is to accurately predict the 3-D position textbook, Multiple View Geometry in Computer Vision [7].
of a person based on their location in a security camera. In particular, we used calibration from a checkerboard with
If the person’s true location (we use the location of their the direct linear transformation algorithm, with tiled floors
feet) can be given by (x∗, y∗, Z∗), and we estimate loca- as our checkboard. We had also tried single view metrology
tion (x0 , y 0 , z 0 ) for them in 3-D space, then we are trying to with three sets of parallel lines, but this left us estimating
minimize: extrinsics.
Finally, we had the problem of object detection, to find
p people in our image frame. Cutting edge research in object
d= (x ∗ −x0 )2 + (y∗ = y 0 )2 + (z ∗ −z 0 )2 detection suggests that deep convolutional nets is the best
way to do this. Scalable Object Detection using Deep Neu-
for each person in each image. Since we use people’s ral Networks by Erhan et al. at Google demonstrated that
feet, we can constrain z 0 = 0, enabling us to use a single convolutional neural nets are very powerful for finding re-
view to predict position - this creates errors when people gions of interest, while also having an effective recognition
jump, but this is not typical behavior. path that categorizes the object of interest. The two steps
We use two coordinate systems in this paper. The pri- take a while though, and are not necessarily fast enough to
mary system is standard, where x spans the width of the build real-time bounding boxes on video - Ren et al.s Faster
image, increasing to the right, and y spans the height, in- R-CNN: Towards Real-Time Object Detection with Region
creasing downwards, and the origin (0, 0) in the upper left Proposal Networks takes this work a step further by folding
hand corner of the image. We also use a unique coordi- the localization and recognition path into the same convolu-
nate system when undistorting images, where (0, 0) is at tional neural networks, training the weights with the local-
the optical center of the image, and x and y increase right ization and recognition cost functions alternately [10]. We
and downwards respectively. This coordinate system is nec- use this work directly as a component of our solution.
essary to model radial distortion parameters by expressing Finally, after proving our concept on some footage via
points in polar coordinates from the optical center. YouTube, we were able to find a much more expansive
dataset with ground truths from A new Dataset for Peo-
3. Related Work ple Tracking and Reidentification via the Video Surveil-
There is a significant amount of work that has been done lance Online Repository [11]. This dataset was also pre-
in the space of retail analytics via camera, but this work has calibrated and undistorted for us, via the methodology out-
been done almost exclusively by startups which protect their lined in Cooperative Object Tracking with Multiple PTZ
methods as intellectual property. These include Prism Sky- Cameras, presented by Everts, Jones, and Sebe [3].
labs, Brickstream, and RetailNext. These are all dependent
on custom hardware, or sensors to augment the surveillance 4. Technical Approach
feed.
4.1. Distortion Correction
On the technical front, we had to integrate work from
various frontiers in computer vision. Solving this business While there are a number of approaches to fixing this
problem required us to solve a number of technical prob- issue, such as un-distorting the image with projections of
lems. One was correction of barrel distortion - we leaned area, or computing radial distortion coefficients, most meth-
heavily on Sing Bing Kangs work in Semiautomatic Meth- ods depend on knowing the intrinsics of the camera pre-
ods for Recovering Radial Distortion Parameters from A distortion. Sing Bing Kangs work, however, suggests a
Single Image, in which he defined an algorithm by which method to manually pick points on a line and accordingly
a user to draws snakes on a distorted image, with each ap- fit distortion parameters, as referred to above [8]. In par-
proximately corresponding to a projected straight line in ticular, it tries to fit all points that should be collinear (as
space [8]. In his paper, he outlines how these snakes can denoted by a person) so that they are, while moving those
2
points as little as possible, and only moving all points radi- for a retail security video that has been corrected for barrel
ally by adjusting radius parameters. distortion), we can constrain ω to:
Radial distortion, of which barrel distortion is a type, can
be modeled by imposing a polar coordinate system on the
ω1 0 ω2
image. From the center of the image, each pixel has an an-
ω=0 ω1 ω3 (1)
gle and distance from the optical center. Changes in this
ω2 ω3 ω4
distance create radial distortion. The distortion at a point,
which we call ∆r, is the change in distance from the op- This matrix has four unknowns, but it is only known up
tical center from the undistorted distance. We model this to scale. This means there are effectively three unknowns
distortion with the equation: if we set one of the unknown variables to 1 and scale the
rest accordingly. As a result, we can solve for the matrix ω
∞
X p by using our three vanishing points, and exploiting the fact
∆r = C2i+1 r2i+1 where r = x2 + y 2 that because they are mutual orthogonal, for each vi , vj with
i=1 i 6= j, vi> ωvj = 0. Thus we have three scalar equations in
three unknowns:
Then, the approach is to find values of C such that all
points we manually constrain to be collinear as collinear,
while also minimizing the distance we move them. This v1> ωv2 = 0 (2)
is a common radial distortion correction algorithm, and we v1> ωv3 =0 (3)
found that Photoshop actually provides a very effective im-
plementation which can be used to adjust entire videos, and v2> ωv3 =0 (4)
imposes the same distortion parameters on each frame. We
went this route to manually undistort video, rather than im- It is known that ω = (KK > )−1 , where K is the 3x3
plement the polar geometry and solver for the parameters matrix of camera intrinsics. So we can find K using the
from scratch. Cholesky decomposition of ω. We did this with our retail
video and found the camera intrinsics.
Unfortunately after
4.2. Calibration this process the extrinsics R|T are still unknown, so we
could not recover the entire camera matrix P = K R|T .
Being able to map pixel coordinates to world coordinates
Setting the camera to be the origin in world coordinates
is a central component of understanding shoppers’ 3D lo-
is not helpful, because even though it resolves the R|T
cations from images. Thus it is necessary to find a robust
parameters (they would simply be I|0 , we still need to
camera calibration for any given video feed. We considered
know where the ground plane is in world coordinates to re-
two approaches to solving this problem for our unlabeled
solve the projective ambiguity
ofmapping a pixel to a world
retail store data. For our 3DPeS data, calibration parame-
point. We tried estimating R|T by hand through trial and
ters were included with the dataset. The parameters were
error, but the results were very unreliable.
given in a third type of calibration formulation, which we
also explain below. Because retail cameras only need to be
calibrated once, it is practical to do these calibrations by
hand in a real-world context; thus we did not invest time in
automating the calibration process.
3
4.2.2 Affine Calibration 4.2.3 PTZ Calibration for Ground Plane Object Detec-
tion
Our problems with a calibration based on single view
metrology could be resolved by finding point correspon- Affine calibration worked well for our retail video data. But
dences and solving for the camera matrix directly. For our we used a different approach when working with the 3DPeS
retail video, we labeled 15 points by hand in the scene. We dataset, because that dataset already included parameters
place the origin at the bottom left corner of the bottom- for a different type of calibration. Due to the difficult po-
leftmost tile that is fully visible, we let each tile be 1 × sition of the cameras in the dataset, the publisher used a
1 in width and height in world coordinates, and we say that simpler type of calibration [3] designed specifically for Pan,
all tiles lie on the ground plane z = 0. We model the cam- Tilt, and Zoom (PTZ) cameras, which are commonly used
era matrix P as affine, which is a desirable approximation in surveillance. The methodology is fully described in the
even though the true camera matrix is projective, because source paper; we briefly summarize it here for convenience.
the lines in the scene are nearly parallel and solving for The calibration assumes that objects are only detected
fewer unknowns is preferred with only 15 point correspon- along a ground plane of Z = 0. Let U, V, H be the dis-
dences. That is, we let: placement of the camera coordinate system relative to the
world; ∆i = i − i0 , ∆j = j − j0 are the pixel positions
relative to the image’s optical center (i0 , j0 ); αxf and αyf are
a1,1 a1,2 a1,3 a1,4
P = a2,1 a2,2 a2,3 a2,4 (5) the horizontal and vertical scales between the image and im-
0 0 0 1 age plane; t is the tilt angle of the camera; and p0 = p + p0
is the pan angle after the camera is aligned with the world
Then, we use our n = 15 points to solve the following coordinate system. An object’s world coordinates X, Y are
over-constrained system of 2n equations: then given as:
f
Ax = b (6)
αx ∆j
X H U
= f R αyf ∆i + (9)
Where the world coordinates of point i are (xi , yi , zi ), Y αy ∆i sin t + cos t V
−1
the image coordinates are (ui , vi ), and:
where
x1 y1 z1 1 0 0 0 0
cos p0 sin p0 cos t sin p0 sin t
x2 y2 z2 1 0 0 0 0 R= (10)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . sin p0 − cos p0 cos t − cos p0 sin t
xn yn zn 1 0 0 0 0
A=
(7) 4.3. Person Detection
0 0 0 0 x1 y1 z1 1
0 0 0 0 x2 y2 z2 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0 0 0 0 x1 yn zn 1
u1
v1
u2
v2
b= (8)
. .
un
vn
And x is a column vector of the eight unknowns in P ,
arranged in order with unknowns from the first row of P
before unknowns from the second. We solve the system
of equations in the standard way, by rearranging so that the
right hand side is 0 and the matrix A has additional columns
(and x additional rows) to maintain the constraints imposed
in the original equation by the values in b; then taking the
SVD of this augmented left-hand matrix, and using the last Figure 2. Our ConvNet can detect multiple pedestrians with high
column of the third output from SVD as the parameters of confidence, especially in clear environments such as this one. This
P. image is from Camera 3 in the 3DPeS dataset.
4
Person detection in 2D images is a well-studied problem stream of video data. For each frame in the stream, we run
[12] [1] [6] with several existing solutions, offering various our person detector over the image and get as output a set
tradeoffs between speed, accuracy and simplicity. There of bounding boxes. In the general case, the mapping be-
are essentially two parts to the problem: generating regions tween the pixels defining the bounding box and the world
of interest (RoIs) where a person might be, and classifying coordinate system is ambiguous, thanks to the ambiguity of
those regions to determine if the region does indeed contain the 3D to 2D transformation of the camera. But because
a person. we know these bounding boxes are people, we can make an
Recently, approaches that utilize deep ConvNets have assumption that each person’s feet rest on the ground plane
been shown to perform exceptionally well at object detec- (Z = 0). This is a reasonable assumption in nearly all re-
tion, and person detection specifically. For this part of our tail environments; the only test videos we encountered in
problem, we use a deep Convent object detection architec- which this is not the case is when the camera watches over
ture proposed by Girshick et. al. known as Faster R-CNN an escalator or can see multiple floors at once.
[10]. Faster R-CNN is an improvement on Fast R-CNN [4], Once we have bounding boxes for each person, we take
which is itself an improvement of the original R-CNN ar- the bottom center pixel of each box (call its image coor-
chitecture [5]. Faster R-CNN works as a single, unified dinates cx , cy ) and find the 3D coordinates associated with
ConvNet that uses shared convolutional layers to output fea- that pixel, assuming it lies on the Z = 0 plane. In the affine
ture maps that then get sent to a Region Proposal Network calibration case, this means finding the intersection of the
(RPN) and a classifer head. The network is trained end-to- ray from the camera along which any point in 3D would
end with back propagation and stochastic gradient descent, project to (cx , cy ), and the Z = 0 plane. By construction
with a multi-task loss function. The full details can be found this intersection must resolve to a unique point.
in [10]. In the PTZ calibration case, the ground plane assumption
We use a pretrained version of Faster R-CNN that we is built into the calibration model and so there is nothing
modified to only output person detections (the original ver- else that must be done besides converting cx , cy to offsets
sion outputs detections of 20 types of objects). from the optical center and plugging the results into equa-
tion (9).
In both cases, once we have obtained the world coordi-
nates of each person, we plot 3D voxels representing each
person in 3D graphing environment modeled after the room,
to visualize the positions of the people in 3D.
5
single 2D camera, in a cleaner environment.
5.1. Environment
We performed all tests on late-2013 Macbook Pro with
16GB RAM and a 2.3GHz processor. Due to lack of hard-
ware, we ran all code on the CPU, even though the Con-
vNet runs much faster on the GPU. This resulted in an aver-
age execution time of 3.19s per frame, a roughly 15X slow-
down in prediction speed compared to the results reported
by [10] on better hardware. The time spent outside of our
ConvNet’s forward pass was negligible. From these num-
bers it is clear that a real world deployment of our work
should have a dedicated GPU. Figure 5. The point correspondences we labeled for the retail cam-
era video clip, vantage point 2. It is important that some of the
5.2. Results and Error Analysis: Retail Clip Exper- points are off the ground plane, or the calibration would be degen-
iments erate.
still cause problems if the feet of the person are not visible
in the image. This is because our pipeline assumes that the
bottom of the bounding box is where a person’s feet are, and
thus where the ground plane is. When that assumption is vi-
olated (e.g. because the bounding box ends at the person’s
waste), then the output is noticeably inaccurate.
Another typical failure occurred when people were in
rapid motion. In the video clip, there is a point at which
the two women sprint out of the store. For most of these
frames, the system loses track of them because no bound-
ing boxes are predicted. We hypothesize two reasons for
this failure. One is that the rapid and blurry stills of a hu-
Figure 4. Example bounding box predictions for the retail video man sprinting do not look very much like a typical person,
data. These were generated without doing image distortion correc- and these types of images are likely underrepresented in the
tion, although in our final implementation we were sure to correct dataset on which our ConvNet was trained. The second is
distortion first if it was present. that due the underlying architecture of the ConvNet, it has
a receptive field size of 228 pixels. This is suitable for most
The goal of our experiments on the retail clip data was purposes but when the people in this video clip are sprinting
to verify qualitatively that our approach was sound, and to with arms extended on both sides, their width in the image
produce for each frame a 3D visualization of the scene ge- easily exceeds 400 pixels. This makes it nearly impossi-
ometry with people accurately tracked throughout. In our ble for the ConvNet to have a chance at detecting the entire
affine calibration step, we hand-labeled 15 point correspon- bounding box.
dences, shown here in figure 5. The root-mean-square error
5.3. Results and Error Analysis: 3DPeS Dataset
(RMSE) of the calibration matrix we found on the data used
Experiments
to create it was 32.7084 pixels, less than the width of one
tile almost everywhere in the frame. Ad-hoc measurements We also evaluate our pipeline on the 3D People Surveil-
of the final 3D voxels outputs showed they were generally lance Dataset provided by [5]. In general our people detec-
within two thirds of a tile to the true position of each person tion ConvNet works much more reliably on this dataset be-
when the bounding box was correct, or roughly 20cm. We cause of the reduced occlusions, better lighting and higher
did not analyze this rigorously, as we performed most of our definition of the images. In our run of the pipeline on a
quantitative analysis on the second dataset. live stream of 17 frames from the same camera, we detect
A common failure mode of our solution on the retail clip 51 of 53 total person bounding boxes when the person is
data was occlusions. Occlusions cause problems in two more than halfway in the scene (i.e. not majority cut off
ways. The first is that they sometimes prevent our Con- by an edge of the image). Across all frames we tested, the
vNet from finding a person in the frame. Even if the Con- root-mean-square error of our position predictions in world
vNet does find a bounding box, however, occlusions can coordinates was 554 millimeters. This is about 2x higher
6
Figure 7. Our prediction errors in millimeters in the world frame
for each person in each image we evaluated, shown collectively.
Each point is the difference between the predicted x, y of a person
in world coordinates and the true x, y of the person.
Figure 8. The same graph as before but with the outlier (Y offset
> 3000) removed. Note the different scales of the X and Y axes.
Figure 6. Bounding boxes found for a sample frame and the cor-
responding 3D scene model that we generated. In the 3D model,
the origin is marked by the blue plus sign below the left voxel. It there is more error along the X axis than the Y axis, but
corresponds to the tile in the frame found right below the ”i” in most of the Y axis error that does occur is in the same di-
”DixonSecurity.com”. We can see here that the model is rather ac- rection: consistently slightly positive. This is because we
curate considering the low number of calibration points, the origi- use the bottom of the bounding box as the intersection point
nal fisheye distortion, and partial occlusions in the scene.
of the person with the ground, when in reality the ground
truth label for the person’s position in 3D considers the cen-
ter of the person overall. (Imagine a circle on the ground
than our ad-hoc estimate of our performance on the retail around the person’s feet. The centerpoint of this circle is
dataset, due mostly to the vastly greater field-of-view of the the ground truth x, y label. It will consistently be slightly
camera we used in this dataset. (This is an outdoor cam- offset from a point at the edge of one foot, which is what
era which overlooks more than 200 square meters of space, we get with the bounding box method.)
much more than can be seen by the indoor camera. So be- The X axis error is also because of our bounding-box-to-
ing off the same number of pixels will translate to a much intersection-point methodology. As people walk they swing
larger increase in RMSE.) their arms and stride their legs. The bounding box produced
We can glean several interesting insights from the pre- by our ConvNet will generally capture all of these extremi-
diction error graph. For example, we see that in general ties, so any time they are not displaced from the person’s
7
center symmetrically, the bottom center of the bounding References
box will not be an accurate representation of where the feet
[1] N. Dalal, B. Triggs, and C. Schmid. Human detection us-
intersect. Finally, the bounding boxes are in general imper-
ing oriented histograms of flow and appearance. Computer
fect, and random noise is surely a factor as well. Vision ECCV, 2006.
[2] T. Dixon. Fight caught on cctv security camera. https:
6. Conclusions and Future Work //www.youtube.com/watch?v=Kla8W8IIAtk.
[3] I. Everts, G. Jones, and N. Sebe. Cooperative object tracking
We’ve successfully developed and outlined an end-to- with multiple ptz cameras. Image Analysis and Processing,
end approach to turning raw security footage into a 3D 2007.
model of customer movement throughout a retail store. This [4] R. Girshick. Fast r-cnn. IEEE International Conference on
involves a one-time calibration of distortion and camera pa- Computer Vision (ICCV), 2015.
rameters, and then the usage of Faster-RCNN to find peo- [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
ple in the frame. The aforementioned parameters are used ture hierarchies for accurate object detection and semantic
segmentation. CoRR, 2013.
to related their location on frame to their real location in-
[6] I. Haritaoglu, D. Harwood, and L. S. Davis. W4s: A real-
side the store. While scaling difficulties arise in the once-
time system for detecting and tracking people in 2 1/2d.
per-deployment cost of manually determining the camera’s Computer Vision — ECCV, 1998.
distortion, intrinsic, and extrinsic parameters, this method [7] R. Hartley and A. Zisserman. Multiple View Geometry in
seems to be accurate enough to effectively provide data to Computer Vision. Cambridge University Press, 2003.
retail environments. [8] S. B. Kang. Semiautomatic methods for recovering radial
With typical error in the range of 20cm or so in real space distortion parameters from a single image. Technical Report
in indoor settings, this could very plausibly be used to track CRL, 1997.
the location of shoppers in a retail space - information such [9] F. Manzella and I. T. Teije. The truth about in-store analytics:
as aisle choice, for instance, is easily determined at this level Examining wi-fi, bluetooth, and video in retail. 2014.
of granularity. Back projecting the person’s location into 3- [10] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: To-
D space is very important for these businesses, and it’s ex- wards real-time object detection with region proposal net-
citing that a simple and practical assumption about position works. arXiv, 2015.
(that feet are on the ground) is so effective. [11] V. S. O. Repository. 3dpes: A new dataset for people
tracking and reidentification. http://imagelab.ing.
There is, however, a major obstacle to usage of this ap- unimore.it/visor/3dpes.asp.
proach in practice. This is occlusion of the feet - it’s not [12] Z. Zivkovic and B. Krose. Part based people detection using
uncommon for shelves or other objects to block the feet of 2d range data and images. IEEE/RSJ International Confer-
subjects, making it impossible for our existing algorithm ence on Intelligent Robots and Systems, 2007.
to guess their position. We had to carefully select datasets
because of this limitation, but real retail stores will not be
able to do this, instead having to work with whatever their
camera sees. One promising way to handle this would be to
use temporal consistency (i.e. relate similar bounding boxes
across timeframes) to estimate foot position now based on
foot position in previous frames. This could be done with
Extended Kalman Filters, which allow us to integrate a
physical model of the world alongside noisy measurement
data (the person detector) to produce an output that is over-
all more robust. We can also use a Faster-RCNN architec-
ture ConvNet trained specifically to look for feet, and when
feet are not detected in a bounding box (due to occlusions)
we can instead use the position of the face and extrapolate
downward based on assumptions about human proportions.
The problem of consistent offsets in one direction caused
by the edge-of-feet point from the bounding box v.s. the
between-the-feet ground truth point can be resolved with a
simple addition of a mean error vector x∆ , y∆ that can be
learned from training data. Overall, we think this is a com-
pelling first step towards real-time 3D person detection in
retail and that the remaining obstacles are surmountable.
8
AUGMENTED REALITY IN LIVE VIDEO STREAMS USING POST-ITS
FINAL REPORT
JUNE 6TH, 2016
of robust, accurate and fast Post-it detection for aug- (5) For bright colored variants, the saturation
mented reality, which had not yet been done in pre- level is high compared to most surrounding
vious literature. scenes
In addition, we show that it is indeed possible (6) For various color variants (e.g. pink), the hue
to use generic Post-it notes as fiducial markers for is uncommon in most scenes
augmented reality, robustly obtaining full pose es-
timation at high accuracy and high speed. We do The Color-Shape method aims to make use of all
so using a combinatory approach of various low-level of these properties to achieve an optimal solution.
algorithms (color filtering, noise reduction, edge de- While creating this method and choosing param-
tection, line detection and several logical elements) eters, we aim to optimize several factors. First, we
specifically tailored to the use case, providing results aim for a high detection rate, which we define as the
superior to general methods such as SIFT and Tem- percentage of frames that return valid vertices (as op-
plate Matching. posed to frames that return no vertices). Second, we
aim for high accuracy, defined as the percentage of
detected vertices that are accurate. Third, we aim
3. Technical approach for speed, as measured in milliseconds of execution,
while also keeping the standard deviation in mind.
Using Post-its for augmented reality imposes two High standard deviation can result in video stutter
important constraints. First, the Post-it is an ob- even when mean speed is low, as a single frame with
ject with very few distinguishing features, implying high processing time will halt the video until process-
some general feature detection methods might not ing is completed.
work well. Second, the speed requirement implies We use a combination of color masking, binary
a further restriction on the methods being available noise reduction, edge detection, line detection and
to use, and creates a focus on minimizing execution various logic steps to estimate the Post-it’s location
speed. We aim to consistently render 30 frames per and calculate the transformation matrix. We rely on
second, which implies the full algorithm must take Python Opencv3.0.0 implementations of mentioned
less than 33 milliseconds to execute on our hardware. algorithms, and NumPy for other image-wide calcu-
We will test a manual ”Color-Shape” approach, a lations, given both run optimized machine code to
SIFT approach and a Template Matching approach provide optimal performance. Figure 1 shows the
and compare their applicability to solving our prob- main elements of the Color-Shape pipeline.
lem.1
same line (”thin the edges”). After this, a threshold- transform. We subsequently erase all pixels on the
ing mechanism is applied to find the most likely true edge image corresponding to this line with a 3-pixel
edges and remove ones more likely caused by noise. boundary radius. On the new edge image, we re-
Given the relatively simple mask provided by the apply the Hough transform. We iterate through this
previous steps, the edge detector provides good re- method 4 times. The result on the edge image after
sults as expected. Experimentation with the thresh- 2 iterations can be seen in figure 6. As an alternative
old values within reasonable bounds caused no signif- to the iterative approach, we have also tested an ap-
icant difference in the results. proach that aims to filter out multiple line detection
per Post-it edge by filtering lines with similar rho and
3.2.4. Iterative Hough transform & line erase. On theta values. While this gave decent results (see ex-
the edge detected output, we will apply a Hough periments section), we found this to be less robust
transform to find lines. The Hough transform trans- than our iterative erase-line approach.
lates Euclidean x-y coordinates into curves in the 3.2.5. Intersection finder. To find intersections, we
polar space, representing lines with different dis- use the lines’ polar coordinates and solve the follow-
tances from the origin (r) and angles (θ). Cells in ing equation:
Hough space with votes above a certain threshold
will be accepted as lines and converted back into
cos(θ1 ) sin(θ1 ) x ρ
x-y coordinate space, where points at extreme x-y = 1
cos(θ2 ) sin(θ2 ) y ρ2
coordinate values on the specified line will be used as
endpoint estimates. Figure 5 shows the initial output. 3.3. Nearest group filter. Given the detected in-
tersections, we find all intersections that lie within
(or just outside of) the image. Of those, we filter for
the 4 intersections in the closest group by using only
the 4 intersections with minimum distance to the full
group’s geometric center (filtering out intersections
at largest distance from the rest).
3.3.1. Validity decision. Given the resulting inter-
sections, we perform several validations to verify
whether or not the intersections correspond to a valid
Post-it transformation. Specifically:
(1) There must be exactly 4 vertices
(2) There cannot be 3 vertices on a single line
Figure 5. All Hough lines found (3) Opposing edges must be of similar length
(near-affine transformation)
For the third rule, we apply a minimum to maximum
line length range as such:
µ = l1 + l2 /2
lmin = µ ∗ (1 − δ)
lmax = µ ∗ (1 + δ)
3.3.2. Find perspective. Subsequently, we will find
the projection matrix M that allows us to project
Figure 6. Initial edge image (left) pixels of the overlay image into the position of the
and edge image after 2 Hough and Post-it: Ptrans = M ∗ Poverlay . Projective matrix M
Erase Line iterations (right) has 8 degrees of freedom, and every matching pair of
points gives us 2 equations, so we can find the matrix
One problem with the Hough transform is that it using 4 matching points. For the Poverlay coordinates
will fit multiple lines on a single Post-it edge. As a (x & y), we use the 4 vertices of the square image we
robust method for finding the most promising lines want to overlay. For the Ptrans coordinates (x & y),
corresponding to the unique Post-it edges, while re- we use the 4 vertices of the Post-it in the video frame.
moving duplicate Hough matches for a single edge, We can now solve the linear system (using Direct Lin-
we use only the highest-voted line from the Hough ear Transformation):
4 of 10
Augmented Reality in Live Video Streams using Post-its
points are found) or rank deficient, we can use least performed Canny edge detection to only work on the
square fitting to find the estimate. edges of the image (see figure 8).
We use the Opencv Python library for Template
3.5. Template Matching approach. In Template Matching. In the Opencv library, the sum of differ-
Matching, a small image is used as a template to see ences can be calculated in different ways, and we will
if a matching version can be detected in the larger compare results.
image. This approach takes the template as the con-
volution mask and perform a convolution with the 3.5.2. Find the rotation and scaling matrix. In this
search image, sliding a window of the same size of case, we can only find the rotation and scaling of
the template over the search image. Then we com- the matched window instead of the full perspective
pare the pixel intensity difference between the search transformation. As we iterate through different scale
image in the window and the corresponding pixel in s and rotation angle θ, the corresponding transfor-
template. We sum the difference over the window. mation matrix is:
The window with lowest difference sum gives the best P 0 = sR(θ)P + p
match.
where p is the location of the left upper corner of
3.5.1. Find the best matching window. Let us give a matched window box.
formal definition[15]. Suppose coordinates (xs , ys )
at the search image has intensity Is (xs , ys ) and 3.6. Overlay. Once we have the applicable homog-
the coordinates (xt , yt ) at the template has in- raphy from either of above methods, we apply it to
tensity It (xt , yt ). Define the absolute difference consecutive frames of an animation to overlay the an-
in the pixel intensities as Dif f (xs , ys , xt , yt ) = imation into the video (as described in the Color-
|Is (xs , ys )It (xt , yt )|. Shape section). We remove pixels with zero alpha
Define Sum of absolute differences measure as values to allow for basic transparency (allowing us to
overlay e.g. a spherical globe instead of only rectan-
TX Tcol
row X
gular pictures).
SAD(x, y) = Dif f (x + i, y + j, i, j)
i=0 j=0
4. Experiments and Results
We loop over the entire search image and calculate
4.1. Experiments using Color-Shape.
the corresponding SAD(x, y). The pixel with lowest
SAD is the best match. 4.1.1. Hough transform parameters. To find the best
Note that Template Matching is not scale invari- rho, theta and threshold values for the Hough trans-
ant. I.e. we do not know how big the object (Post-it) form, we have done various experiments as shown in
is in the search image (video frame). We will resize table 1.
and rotate the original template to create a series of
templates with different sizes and rotations. Then we Threshold 15 30 15
perform a search of all those templates in the search Rho 1 1 2
frame and return the best match. Theta π/45 π/45 π/45
Speed (µ, ms) 4 1.8 3.6
Speed (σ, ms) 3.2 1.2 2.4
Accuracy High High Mid
Detection rate High Low High
Threshold 15 15 15
Rho 5 1 1
Theta π/45 π/90 π/22.5
Speed (µ, ms) 3.6 6.1 3.7
Speed (σ, ms) 2.8 5.3 2.5
Accuracy Mid High Mid
Detection rate Mid High High
Table 1. Hough transform param-
Figure 8. Edge image for Tem- eter results
plate Matching
In order to reduce noise and improve speed, we Per these results, we have chosen 1, π/45 and 15
transform the image from RGB to gray scale and as our respective optimal parameters.
6 of 10
Augmented Reality in Live Video Streams using Post-its
4.1.2. Speed of Color-Shape. An analysis of execu- 4.2.2. Gray scale reference and colored video frame.
tion time for the various elements of the Color-Shape In this experiment, we use a gray scale plain Post-it
method can be seen in figure 9. Total execution time as reference. Different from the above, the key points
is well within our target range (below 33 ms) at 12 in the reference image are greatly reduced and the
ms mean. matching is more accurate. However, the detection
We find that the iterative Hough transform takes is always at the corner of the Post-it and the results
most computation time. This triggered us to also de- still occasionally contain false positives. The result
sign a non-iterative method for finding the four most can be seen in Figure 11.
promising lines using only one Hough transform. This
alternative method filters lines with similar sigma and
rho (as described in the Hough transform section of
the technical approach). While the method saved 3
ms of mean execution time, we found it to be less
robust than the iterative approach (0.91 accuracy vs.
0.98 baseline). Given total execution time is already
low, we decide to trade execution time for higher ro-
(a) Match at corner (b) False positive
bustness.
Figure 10. Colored image with col- (c) false positive (d) key point matching
ored reference
Figure 12. colored image with rectangle
We can see from figure 10 that the colored Post-it
picture contains too many features and key points, The value within the rectangle is 255 and 0 outside.
making the matching process slow and noisy. The This time, the features and key points from the refer-
Post-it is detected inaccurately. ence image are far fewer. From the results in Figure
7 of 10
Augmented Reality in Live Video Streams using Post-its
12, we can see that it can successfully find matches the method described above where we only accept a
between the reference rectangle and the video frame. match if the closest distance is within 70% of the sec-
However, we still see some false positives. ond closest distance. We can see from figure 14 that
when the reference is simple, e.g. a binary rectan-
4.2.4. Post-it with pattern. In this approach, instead gle, the approach without distance ratio gives more
of using a plain Post-it, we use a Post-it with pattern matches within the Post-it. We can see from the im-
drawn. For simplicity, we will use gray scale for both age that there are two key points within the Post-it
reference and video frames. Figure 13 shows a re- (and the binary rectangle). Those two key points
sult. Here the matching key points are within in the do not have much difference in terms of color inten-
Post-it instead of at the corners as in previous cases. sity or texture. The gradients around those two key
The lack of features for plain Post-its gives little in- points are very similar. Thus they might have similar
teresting matching key points within the center of the distance to the same key point within the reference.
Post-it. A patterned Post-it, however, has richer con- This is an example where distance ratio can eliminate
tent and gradient variation within the area of Post-it, false negatives.
making it easier to match local feature inside. However, when the reference image has richer fea-
tures, i.e. with the patterned Post-it, the first ap-
proach generates many more false positive than the
second approach.
deficient when insufficient key points are matched. [2] Daniel Wagner, Gerhard Reitmayr, Alessandro Mulloni,
For Template Matching, we can estimate a similar- Tom Drummond, Dieter Schmalstieg Real-time detection
ity matrix but cannot estimate the full perspective and tracking for augmented reality on mobile phones IEEE
Transactions on Visualization and Computer Graphics
transformation. (Volume:16 , Issue: 3), 2009
SIFT and Template Matching are designed for gen- [3] Charles A. Poynton Frequently Asked
eral, rich pattern detection, while the Color-Shape Questions about Colour Available at
method we use is designed specifically for our Post-it https://engineering.purdue.edu/ bouman/info/Color-
use case. SIFT and Template Matching can be used FAQ.pdf
[4] Fernandes, Leandro AF, and Manuel M. Oliveira. Real-
for a much broader range of use cases, whereas the time line detection through an improved Hough transform
Color-Shape method is limited in scope to Post-it de- voting scheme Pattern Recognition 41.1, 2008
tection (and highly similar use cases). [5] Fleck, Margaret M., David A. Forsyth, and Chris Bregler.
Finding naked people Computer VisionECCV’96. Springer
5. Conclusions Berlin Heidelberg, 1996
[6] Fleyeh, Hasan. Color detection and segmentation for road
We find that typical methods such as SIFT and and traffic signs Cybernetics and Intelligent Systems, 2004
Template Matching do not sufficiently meet our goals, IEEE Conference on. Vol. 2. IEEE, 2004
as they do not provide accurate pose finding for the [7] Lee, Jae Y., and Suk I. Yoo An elliptical boundary model
for skin color detection Proc. of the 2002 International Con-
plain-faced Post-it, and require too much computa-
ference on Imaging Science, Systems, and Technology, 2002
tion time to operate. [8] Lowe, David G. Object recognition from local scale-
We find the manually designed Color-Shape ap- invariant features Proceedings of the International
proach, exploiting the saturated color and square Conference on Computer Vision. pp. 11501157.
shape properties of the Post-it, to work well and meet doi:10.1109/ICCV.1999.790410, 1999
[9] M. Fiala Designing Highly Reliable Fiducial Markers IEEE
our goals. The full Color-Shape method takes an av-
Transactions on Pattern Analysis and Machine Intelligence,
erage of 12 milliseconds to execute (with 4.5 millisec- vol. 32, no. 7
onds standard deviation) on a 3-year old MacBook [10] Maini, R. et al. Study and Comparison of Various Image
Pro laptop, indicating the maximum 33 milliseconds Edge Detection Techniques International Journal of Image
goal should be achievable on a wide range of laptops. Processing (IJIP), Volume (3) : Issue (1), 2009
[11] Matas, J. et al. Robust Detection of Lines Using the Pro-
We achieve high accuracy projective pose estimation
gressive Probabilistic Hough Transform CVIU 78 1, pp 119-
with a reasonable robustness to noise and lighting 137, 2000
variation. [12] Nipat Thiengtham and Yingyos Sriboonruang Improve
Template Matching Method in Mobile Augmented Reality
6. Future Work for Thai Alphabet Learning International Journal of Smart
Home Vol. 6, No. 3, July, 2012
A number of improvements could be considered for [13] P. Kakumanu, S. Makrogiannis, and N. Bourbakis A sur-
future work: vey of skin-color modeling and detection methods Pattern
• Interest region: the Color-Shape approach Recogn. 40, 3, March 2007
[14] Palmer, Phil L., Josef Kittler, and Maria Petrou An opti-
could be made significantly faster by search-
mizing line finder using a Hough transform algorithm Com-
ing only in an interest region derived from puter Vision and Image Understanding 67.1, 1997
the previous location of the Post-it [15] Roberto, B. Template Matching techniques in computer
• Multiple Post-its: the Color-Shape approach vision: theory and practice 2009
could be generalized to find multiple Post-its [16] S. Garrido-Jurado, R. Muoz-Salinas, F.J. Madrid-Cuevas,
M.J. Marn-Jimnez Automatic generation and detection of
in one image
highly reliable fiducial markers under occlusion Pattern
• Dynamic color filter: the HSV color filter Recognition, Volume 47, Issue 6, June 2014
could be made more accurate by dynamically [17] Schumeyer, Richard P., and Kenneth E. Barner Color-
adapting parameters to the scene based classifier for region identification in video Photonics
• Wearables support: the source code could be West’98 Electronic Imaging. International Society for Op-
tics and Photonics, 1998
ported to relevant augmented reality hard-
[18] Singh, Chandan, and Nitin Bhatia A Fast Decision Tech-
ware such as smartglasses nique for Hierarchical Hough Transform for Line Detection
arXiv preprint arXiv:1007.0547, 2010
References [19] Van Ginkel, Michael, CL Luengo Hendriks, and Lucas J.
[1] Canny, J. A Computational Approach to Edge Detection van Vliet. A short introduction to the Radon and Hough
IEEE Trans. on Pattern Analysis and Machine Intelligence, transforms and how they relate to each other Delft Univer-
8(6), pp. 679-698, 1986. sity of Technology, 2004
10 of 10
Augmenting Videos with 3D Objects
1
• Theia SFM [3] Specifically, the Middlebury website [13] contains a lot
of submissions and evaluations of many stereo-matching
• Bundler [11] algorithms. We ended up pursuing this approach the most
• Visual SFM [10] because the research here showed the most promising
results.
• OpenCV SFM [9]
To solve our problem, however, we do not need a full 3D
We were looking for a few things from the libraries: reconstruction of the scene. We just need an approximate
• Camera position estimation depth map that has good accuracy around the object bound-
aries. What we found while using just stereo-matching al-
• Camera parameter estimation gorithms was that they were prone to noise. To overcome
this problem and achieve the desired results, we propose a
• Reliable and accurate sparse 3D reconstruction
novel approach that uses a combination of image segmenta-
For this project, it was not our goal to try to improve or tion techniques, stereo-matching, and planar interpolation.
optimize any of these libraries. We tried a few of them and
picked the one that was easiest to use. In our case it was the 4. Technical Details
Theia SFM library.
Below we describe our solution. We talk about how we
3.2. Estimating Depth Map calibrate our camera and run SFM. We then describe our
method for getting accurate depth maps. Lastly, we describe
A critical part to solving our problem was obtaining ac-
how we project 3D objects back into the scene.
curate and dense depth maps for each frame of the video.
There were a number of techniques that we considered: 4.1. Sparse 3D Reconstruction and Camera Matrix
Estimation
• Reconstructing 3D objects using volumetric stereo and
using these reconstructions to obtain depth maps [12] As per standard practice in the camera model used in
computer vision, there are 2 parameters:
• Using a combination of the original images and the
sparse 3D points obtained from SFM to approximate • Intrinsic matrix K: A 3x3 matrix which incorporates
the 3D surface positions (using segmentation and pla- the focal length and camera center coordinates.
nar reconstruction).
• Extrinsic matrix [R T]: A 3x4 matrix which maps
• Using a combination of SFM and stereo matching al- world coordinates to camera coordinates. R denotes
gorithms to obtain a dense 3D reconstruction of the rotation and T denotes translation.
scene. [16]
The camera transformation is given by a matrix,
While researching volumetric stereo, we found that
it was genereally used to get a 3D reconstruction of a M = K[R T ]
single object within a scene. For our purposes, we needed
information about the full scene. To extend this algorithm It transforms a point in homogeneous world coordinates
to work on a full scene, we would have needed very reliable to homogeneous image coordinates.
image segmentation algorithms that were determenistic
between frames. We were not able to find anything that The point correspondence problem is defined as follows:
looked promising in this space, so we abandoned this idea. Given n images, find points in the images which correspond
to the same 3D point. There are several well known
We also considered using the sparse 3D points obtained algorithms which work reasonably well in practice, for
from SFM to approximate a dense 3D reconstruction. We example, SIFT, SURF and DAISY.
though about partitioning the original image into uniform
segments. We would then approximate each segment as a The Structure from motion (SFM) problem is defined
plane in 3D and use the sparse 3D points to estimate these as follows: Given m images and n point correspondences,
planes. Unfortunately what we found was that we did not find m camera matrices (M) and n 3D points. Solving the
have enough points in each image segment to do a planar SFM problem for a set of images will give us a sparse 3D
reconstruction, so we could not use this approach by itself. reconstruction of the scene.
Lastly we found a lot of research about using stereo- To get the intrinsic parameters for our camera we tried a
matching algorithms to aid in dense 3D reconstruction [17]. few different approaches:
2
• Computing K using single view metrology with 3 van- 4.2. Depth Map Estimation
ishing points derived from 3 pairs of mutually orthog-
A sparse reconstruction is not enough to get a full depth
onal lines in 3D.
map for each frame.
• Using a checkerboard image to calibrate using
Below we propose a novel approach of using a combi-
OpenCV routines.
nation of stereo-matching, image segmentation (using the
watershed algorithm), and planar interpolation to get dense
• Allowing Structure From Motion (SFM) algorithms to
3D depth maps for each frame.
self-calibrate (which is possible given enough view-
points of a static scene)
4.2.1 Terminology
Each of these approaches gave us a similar K, so we
decided to go with the self-calibration method since it is
automatic and we have plenty of views.
3
3. Image rectification. A transformation to project two
images onto a single image plane as seen in figure 5.
After rectification, all epipolar lines are parallel in the
horizontal axis. All corresponding points have identi-
cal vertical coordinates.
4
4.2.4 From 3D points to a depth map 4.2.5 Missing depth maps
We don’t have a disparity map for every frame. Frames
So far, we are able to obtain 3D points from a disparity where the camera is moving forward, for example, don’t
map. These points are not in world coordinates though, have a good corresponding stereo frame. Rectification
so we need to transform them to world coordinates before between such views introduces too much distortion.
generating depth maps.
As such, we need to be able to reconstruct a depth map
Let x be the point in world coordinates. Let p be the for any frame, using a depth map generated from some other
point in the original image. We rectify the image for stereo frame. Fig 10 is an example of a depth map viewed from a
matching. Rectification is a homographic transform. Let H different camera:
be the inverse of this transform. K, R and T are camera
parameters.
p = KRx + KT
We get p by applying the rectification transform H on pr , There are far more missing depths, which are mostly
there due to occlusions. To help fill in the rest of this depth
map, we use a novel technique which combines image
p = Hpr segmentation and planar reconstruction, as described in
subsections below.
Using these equations, we can derive the equation for
point x in original world coordinates.
4.2.6 Image segmentation
x = R−1 K −1 HKr Rr xr + R−1 K −1 HKr Tr − R−1 T
We use the marker controlled watershed algorithm for
image segmentation. The watershed algorithm is based
We reproject these points back into the original frame on the concept of flooding the image from its minima
and compute the depth for each pixel. Figure 9 is a depth and preventing the merging of water coming from dif-
map for the frame from which we generated the disparity ferent sources. This partitions the image into 2 parts:
map. the catchment basins and the watershed lines. This
approach results in over-segmentation, so we use a vari-
ant which is based on starting to flood from a set of markers.
5
transform which gives us a candidate set of markers. Using use this set of points to estimate a plane using SVD decom-
this, we can apply the watershed algorithm to segment the position for the linear system Ax = t. This gives us a plane.
images.
• Window size of adaptive threshold This plane gives us the depth for every point on it,
irrespective of whether we had a depth for it previously
• Kernel for the morphological opening from the stereo matching algorithms. This is how we fill
• Threshold for the distance transform up holes in the depth map. We can now trace a ray which
starts from the camera and hits the approximate plane. We
Figure 11 is an example of a segmented image. can compute the length of this line segment and this is the
depth of this image point.
4.2.7 Planar interpolation With the above, we get an depth map that looks some-
thing like:
We have now divided the image into several segments. We
assume that each segment is part of a plane. We have the
depth map for each 2D image point, which means we have
a 3D point corresponding to each 2D point. The camera
matrix K has the following structure:
fx 0 cx
0 fy cy
0 0 1
For each 2D point, we can compute the corresponding
3D point. z is the depth of this point from the depth map.
p = (z ∗ (py − cy )/fy , z ∗ (px − cx )/fx , z) Figure 12: Planar interpolation of image segments
6
4.3. Augmenting Video with 3D Objects
7
where δd is the disparity error tolerance.
We describe the quality metrics we use for evaluating the Sometimes refined disparity maps may not be better than
performance of various stereo correspondence algorithms the original disparity map because of bad image segmen-
and the techniques we used for acquiring our image data tation. Figure 18b is a better approximation to figure 18a
sets and ground truth estimates. [18] compared to figure 18c. In figure 18c, plane fitting on bad
image segmentation results in a weird disparity map. So
1. RMS (root-mean-squared) error, measured in disparity it’s important to control the quality of image segmentation
units, between the computed disparity map dC (x, y) to ensure refined disparity maps are better.
and the ground truth map dT (x, y), i.e.,
12 We compute disparity maps for 5 test images in the Mid-
1 X 2 dlebury evaluation dataset using our method as well as pre-
R= (|dC (x, y) − dT (x, y)| )
N defined algorithms in Middlebury. In the following table
(x,y)
we report mean values of RMS error and percentage bad
where N is the total number of pixels. pixels. normal-SGBM refers to our SGBM implementa-
tion of disparity map. planar-SGBM refers to the filtered
2. Percentage of bad matching pixels, disparity map generated by fitting planes using image seg-
mentation. [18] describes the other algorithms used for en-
1 X coding. As seen in the table, planar-SGBM performs better
B= (|dC (x, y) − dT (x, y)| > δd )
N than normal-SGBM in both metrics we defined earlier.
(x,y)
8
(a) Image to be evaluated from Middlebury dataset (b) Ground truth disparity map
(e) Filtered disparity map combining SGBM and plane (f) Filtered disparity map combining SGBM and ag-
fitting on image segmentation gressive plane fitting on image segmentation
9
Algorithm Mean RMS error Mean Bad pixel ratio 7. Conclusion
SSD09bt05 1.559039 0.024049 In this paper, we proposed a system that takes a 3D
object mesh and a video, and augments that video with the
SSD09t20 1.714662 0.030914 object. The system is able to estimate camera positions and
generate depth maps for each frame (to support occlusions).
SADmf09bt05 1.733064 0.031358
SADmf09t02 2.4118 0.051564 We used the Theia SFM library to estimate camera
positions, and proposed a novel method to estimate depths
SAD09t02 2.565821 0.058292
in each frame. To estimate depths, we used a combination
SADmf09t01 3.019650 0.084793 of image segmentation techniques (watershed algorithm),
stereo matching (SGBM), and planar interpolation. When
planar-SGBM 3.177850 0.071215
compared to stereo matching alone, the combination of
normal-SGBM 3.200869 0.072295 these technique allowed us to improve depth map accuracy
while at the same time significantly reducing noise and
SAD09t01 3.717277 0.135821 improving sharpness around object boundaries.
10
References [14] Canny, John. ”A computational approach to edge
detection.” Pattern Analysis and Machine Intelligence,
[1] Lee, J. C., and R. Dugan. ”Google project tango.” IEEE Transactions on 6 (1986): 679-698.
[2] Perry, Simon. ”Wikitude: Android app with [15] Haris, Kostas, et al. ”Hybrid image segmentation
augmented reality: Mind blowing.” digital-lifestyles. using watersheds and fast region merging.” Image
info 23.10 (2008). Processing, IEEE Transactions on 7.12 (1998):
[3] Sweeney, Christopher, Tobias Hollerer, and Matthew 1684-1699.
Turk. ”Theia: A Fast and Scalable [16] Pollefeys, Marc, Reinhard Koch, and Luc Van Gool.
Structure-from-Motion Library.” Proceedings of the ”A simple and efficient rectification method for
23rd Annual ACM Conference on Multimedia general motion.” Computer Vision, 1999. The
Conference. ACM, 2015. Proceedings of the Seventh IEEE International
[4] Ravimal Bandara, Image Segmentation using Conference on. Vol. 1. IEEE, 1999.
Unsupervised Watershed Algorithm with an [17] Pollefeys, Marc, et al. ”Metric 3D surface
Over-segmentation Reduction Technique. reconstruction from uncalibrated image sequences.”
[5] Tola, Engin, Vincent Lepetit, and Pascal Fua. ”A fast 3D Structure from Multiple Images of Large-Scale
local descriptor for dense matching.” Computer Vision Environments. Springer Berlin Heidelberg, 1998.
and Pattern Recognition, 2008. CVPR 2008. IEEE 139-154.
Conference on. IEEE, 2008. [18] Scharstein, Daniel, and Richard Szeliski. ”A
[6] Mur-Artal, Raul, J. M. M. Montiel, and Juan D. taxonomy and evaluation of dense two-frame stereo
Tardos. ”ORB-SLAM: a versatile and accurate correspondence algorithms.” International journal of
monocular SLAM system.” Robotics, IEEE computer vision 47.1-3 (2002): 7-42.
Transactions on 31.5 (2015): 1147-1163. [19] Pollefeys, Marc, Reinhard Koch, and Luc Van Gool.
[7] Furukawa, Yasutaka, and Jean Ponce. ”Accurate, ”A simple and efficient rectification method for
dense, and robust multiview stereopsis.” Pattern general motion.” Computer Vision, 1999. The
Analysis and Machine Intelligence, IEEE Transactions Proceedings of the Seventh IEEE International
on 32.8 (2010): 1362-1376. Conference on. Vol. 1. IEEE, 1999.
11
Classroom Data Collection and Analysis using Computer Vision
Jiang Han
Department of Electrical Engineering
Stanford University
1
Gender Male Female
Training 9,993 10,992
Testing 3,040 2,967
2. Problem Statement
Main framework of system design is shown in Fig.1,
which includes four modules to process the input image.
Image is transformed to gray scale image at the first be- Figure 3. Samples from emotion data set (left to right: Angry, Dis-
ginning since color information is not that important in this gust, Fear, Happy, Sad, Surprise, Neutral).
classification problem. Then face detection is applied for
the image to locate all the human faces positions. Inside
each face box, gender classification and emotion analysis design a system with good trade-off between performance
engine works to generate corresponding labels. As Fig.1 and complexity is one of the project target.
shows, Module-3 and module-4 shares very similar inside
core, which includes: 3. Technical Content
• Image rescale: The subimage inside face box needs to 3.1. Data set and initial analysis
be rescaled for three reasons: (1) This is necessary and
The used data set for gender classification and emotion
will make things much easier to generate consistent
analysis is shown in Table 1 and 2.
feature dimension later on. (2) The source image data
The data set for gender classification is extracted partly
set for training and testing of gender analysis was dif-
from Image of Group (IoG) data set [10], the original data
ferent in scale size. (3) Face box derived from Module-
set includes more properties on each person like: face po-
1 may be different in sizes due to different face size.
sition, eye position, age, gender, pose. In this project, we
• Feature extraction: After rescaling, this step generates only care about the age property. Thus I split the data
features with consistent dimension. Feature extraction into four folders: “Male Training”, “Female Training”,
can use algorithms like Histogram of Gradient (HOG), “Male Testing” and “Female Testing”. From Table I we
Local Binary Pattern (LBP), Bag of Words (BoW), etc. can see that both male and female image number was
roughly balanced to guarantee the best model training
• Model training/classification: Based on the feature
result. With this split, we can use Matlab command
vectors output from feature extraction, we are able
imageSet easily load corresponding images. And the to-
to train the classifier with input image training data.
tal number of images used for both training and testing is
Training step may take long time due to data size. But
26,992. One thing I notice is that source image for genders
once the model is trained, we are able to use it to do
is not scaled to the same size. This is one of the reasons why
classification on the testing image directly.
we add “Image Rescale” before the feature extraction step
From Fig.1 we can see that each module may have multi- to make the input image 48 by 48 gray image. Fig.2 shows
ple algorithm candidates to implement, while being able to six gender sample images, which includes three male and
2
3.3. Face Detection
For face detection, initially I was using the similar
method from our problem set 3 with HOG + SVM + sliding
window. I also tried the following dynamic boxing method
to adjust window size to fit the face scaling.
size = minSize + 2N × step (3)
Here, size value is with constraint of size ≤ maxSize.
N is the N-th time of window expanding, step is a value
controls the window expanding speed. The advantage of
Figure 4. Bar chart of emotion data set. this strategy is that from minSize to maxSize, we at most
need N = dlog2 maxSize−minSize
step e times of expanding.
three female. Also we note that this data set includes peo- And we are giving smaller expanding speed when the win-
ple from different races and different age. dow size is still small. Hence, instead of doing linear time
Furthermore, Table 2 shows the data set for emotion of expanding and apply SVM at each position, here we have
analysis. The data set is from ICML [11], which has 7 cate- cost at O(log) level, eventually we choose the face window
gories of emotions including: Angry, Disgust, Fear, Happy, size with the biggest prediction score (only if this score is
Sad, Surprise and Neutral (both for training and testing). bigger than SVM threshold).
Fig.3 shows samples of the seven types of emotions. We However the test shows that sliding window scheme
notice that this data set includes emotions from different is quite slow because we need to run SVM multiple
gender and ages, which is good to train a robust model. times. Considering face detection is not the main part
However, emotion of human beings is very complicated and of this project, I turned to use the MATLAB embedded
vague. For example the “Surprise” image of Fig.3 may also vision.CascadeObjectDetector() to do the face detec-
be treated as “Angry” in reality. Furthermore, Fig.4 shows tion, which is using Viola Jones object detection framework.
the bar chart of different emotion types image number (in- Viola Jones algorithm is much faster and good at detecting
cluding both training and testing). Notice that this data set scaled faces [3].
is mostly balanced, except that “Disgust” type has signifi- 3.4. Gender Classification and Emotion Analysis
cantly lower number than other categories, which explains
why “Disgust” class has the lowest F1 score in later testing. I put gender classification and emotion analysis in the
same subsection since from Fig.1, we see that the two mod-
3.2. Evaluation metric ules share very similar inside blocks. Therefore in this sub-
section, I’m going to introduce the following three blocks:
To evaluate the performance of classification, accuracy
image rescale, feature extraction and training/classification.
(ACC) [12] is sued as the metric, which defines as:
Here, precision is defined as true positive (tp) over tp plus Image scale is necessary for both training and classifi-
false positive (fp). recall is defined as tp over tp plus false cation. For the training set, each image of emotion anal-
negative (fn). ysis was originally given with 48 by 48 gray scale, which
3
is ok to use directly. But for gender classification data, it Paras Cell:8,8. Block:2,2 Cell:4,4. Block:2,2
was come with RGB images with different scale size. Thus Feature Size 900 4356
rescale is necessary to make the training images into fixed HOG ACC 0.8369 0.8500
size and gray scale. This will make much easier in the fea- Paras Cell:8,8. Block:3,3 Cell:16,16. Block:2,2
ture extraction step to obtain feature vector with consistent
Feature Size 1296 144
dimension.
HOG ACC 0.8337 0.7718
For the classification step, there may be multiple faces
marked from the original image, and each comes with dif- Table 4. Gender classification HOG ACC performance with vari-
ferent box size. Thus we do RGB to gray scale and rescale ous cell/block settings.
to 48 by 48 before we apply classifier. Fig.5 shows an ex-
ample of classification rescale inside box. Paras Cell:8,8. Cell:12,12. Cell:16,16. Cell:24,24
Size 2124 944 531 236
3.4.2 Feature extraction (BoW) ACC 0.8638 0.8390 0.8274 0.7864
Note in order to save space, for feature extraction part Table 5. Gender classification LBP ACC performance with various
from 3.4.2 to 3.4.5, the table ACC values are based on cell settings.
testing of gender classification using SVM. However,
emotion analysis data also gives similar conclusion.
For the feature extraction, I started with bag of words. parameters we can tune for HOG, like “Cell Size”, “Block
Matlab provides some embedded functions like “bagOf- Size”, “Block Overlap”, “Number of Bins”. In my test,
Features”, “trainImageCategoryClassifier”, “imageCatego- I kept default value of “Block Overlap” and “Number of
ryClassifier”, etc. to be used. The default feature vec- Bins” since that is the typical settings. “Cell Size” and
tor is based on SURF, I also tried dense SIFT features. “Block Size” will be more important parameters which can
Based on the extracted feature vector from each images, control feature vector size and testing performance. Here,
K-means is applied to the feature space with entire train- “Cell Size” defines the box to calculate histogram of gradi-
ing images. Here, K defines the vocabulary size for the ents. Smaller cell size will give us better chance to catch
histogram. Eventually the image feature is defined as a his- small-scale details. On the other hand, increasing the cell
togram which defines the nearest cluster center distribution size will be able to capture large-scale spatial informa-
for every image. tion. “Block Size” defines number of cells inside the block,
Table 3 shows the ACC of gender classification with both smaller block size may reduce the influence due to illumi-
SURF and dense SIFT features. From later on ACC of SVM nation changes of HOG features [15].
we can see that this performance is even slightly worse than In addition to HOG, LBP is another feature I found out to
Naive Bayes. It seems reasonable to me since the testing ob- be very useful, the performance is no worse than HOG. The
jects are all faces with same structures (eyes, nose, mouth principle of LBP is different from HOG, instead of using
etc.). Clustering those features may lose some details. The gradient information. LBP compares the pixels value with
scenario of gender classification and emotion analysis is dif- its neighbors, and based on the binary comparison result
ferent from scenarios where bag of words are most used to construct the histogram. LBP can be easily extended to
(for example object classification like cups, ships, etc.). In rotation invariant version [16].
addition, BoW is giving me very slow training speed with Table 4 and Table 5 shows the HOG and LBP testing
around 3 million feature vectors to e clustered. Thus BoW ACC with different feature dimension of gender classifica-
was not selected after testing. tion (emotion analysis data testing result is having similar
trend, thus not listed here). By setting cell/block size we
Feature SURF Dense SIFT are able to obtain different feature dimension. Unsurpris-
BoW Test ACC 0.7137 0.7326 ingly higher feature dimension is able to give better ACC
but may also slow down the system significantly due to in-
Table 3. BoW ACC performance of gender classification on SURF creased complexity for machine learning models.
and dense SIFT features (vocabulary size is 300).
3.4.4 HOG and LBP feature combination
3.4.3 Feature extraction (HOG and LBP) To get the best trade-off between performance and speed,
and based on the research fact that combination of HOG
HOG and LBP were tested after BoW. HOG is a very well- and LBP features is able to improve the detection perfor-
know feature descriptor in computer vision [14], which ac- mance [17]. I joined HOG feature (Cell:8,8, Block: 2,2,
cumulates local gradient information . There are several dimension of 900) together with LBP feature (Cell: 12,12,
4
dimension of 944) to boost ACC. Table 6 shows the combi- out we are able to use this method reduces feature dimen-
nation feature result. sion with small performance loss.
Specifically, we use Matlab CascadeObjectDetector sys-
Paras HOG LBP Joined tem object to detect nose on the face. If object returns the
Feature size 900 944 1844 nose position successfully, we select K points evenly round
Test ACC 0.8369 0.8390 0.8673 the circle (with predefined radius) with nose as the center.
If the nose is not detected (CascadeObjectDetector may fail
Table 6. LBP and HOG feature combination result. with the 48 by 48 low resolution image), we simply use the
center of image as the circle center. Here, K can be selected
Table 6 shows the ACC performance of combination be- with different values to get the best trade-off between ACC
tween HOG and LBP. From which we can see that the orig- performance and complexity.
inal feature dimension for HOG and LBP were both around
900, ACC performance were between 0.83 - 0.84. By join-
ing HOG and LBP together, we are able to get significant
0.03 ACC boosting. Even though we have doubled feature
dimension after combination, but this performance is still
higher than HOG or LBP alone with similar size. Because
LBP and HOG is using different principles to construct fea- Figure 7. Circle key-point detection (circle center as nose or image
center, K = 5 and K = 10).
tures, this kind of combination is able to get diversity gain.
In Fig.7, we show the results of circle based key-point
3.4.5 Feature dimension reduction detection. Matlab CascadeObjectDetector is able to return
Consider feature size is important to system speed, a reason- nose position on left two images, but failed on two images
able prune or feature dimension reduction will be very help- on the right side (under which situation we use image center
ful. Especially when we use real-time system, we would directly). Also, K = 5 and K = 10 detection are shown
rather lose small performance to have more smooth experi- here. After this, we can only extract HOG/LBP features
ence for users. around those key-points.
The way I did to reduce feature dimension was to only
K 5 10 15 20
extract HOG/LBP features from areas around key-points.
Initially I tried several famous key-point detection methods Feature size 20% 40% 60% 80%
as following: ACC Loss 5.6% 2.68% 1.41% 1.16%
Harris: detects corners using HarrisStephens algorithm. Table 7. Key-point based feature reduction performance.
SURF: detect blobs using Speeded-Up Robust Features.
MSER: detect regions using Maximally Stable Extremal Table 7 shows the testing result with different K values.
Regions. We can see that with only 40% of feature dimension, we
are only losing 2.68% of the ACC performance. For real
time systems, people may would rather to satisfy this 2.68%
ACC to get more smooth using experience.
Figure 6. Key-point detection result (each person left to right: Har- 3.4.6 Model Training and Classification
ris, SURF, MSER, MSER Region)
After feature extraction method is selected, we can now test
Fig.6 shows key-point detection result using Harris, on different machine learning classifiers. In this project,
SURF, MSER. Those detection method will return differ- I tried different models including: Naive Bayes (NB), K
ent number of points for different images. Since we are not Nearest Neighbors (KNN), Random Forest and Support
using BoW, we need to construct a consistent feature dimen- Vector Machine (SVM). Also note that to get better sys-
sion. I tried different ways to do this like doing K-means, tem speed, I used same features for gender classification
or select strongest K points out of N key-points. However, and emotion analysis. However, we need to choose most
the testing result gives significant ACC loss comparing to suitable learner for gender and emotion classification.
sliding window scheme. Naive Bayes: NB method is based on Bayes rule to cal-
Therefore, instead of using corner or blob key-point de- culate the probability of each classes. NB also naively as-
tections, which returns mostly different physical positions sumes the independence of each features.
in the image. I used a fixed key-point feature extraction, K Nearest Neighbors: KNN is taking the majority of la-
which extracts fixed number of key-points on face. It turns bels from the K number of nearest neighbors. KNN distance
5
Gender Classification Emotion Analysis classification and emotion analysis. The advantage of NB
Test ACC 0.7495 0.3589 is that it’s running super fast, which is the fastest model
among all models, but the ACC is poor with only 0.7495
Table 8. Naive Bayes ACC performance. for gender and 0.3589 for emotion. Note that gender has
higher ACC since it only has two labels while emotion has
Neighbor Number 1 5 10 20 7 labels.
Gender-ACC 0.7446 0.8017 0.8062 0.8190 Table 9 shows the testing result of KNN with different K
Emotion-ACC 0.5306 0.4820 0.4727 0.4583 values. For gender classification, we can see that when K =
20, we get gender ACC of 0.8190. But the boost from K =
Table 9. KNN ACC performance. 10 to K = 20 is relatively small, meaning that the neighbors
ranked 10 to 20 are contributing limited influence. However
Tree Number 20 60 100 300 for emotion analysis, ACC is the best when K = 1, and
Gender-ACC 0.7696 0.8022 0.8102 0.8235 performance is reducing significantly when we increase K
Emotion-ACC 0.4388 0.4859 0.4965 0.5074 value. This means that for emotion analysis data, neighbors
outside the first 1 are introducing more noise than positive
Table 10. Random Forest ACC performance. contributions. Note that comparing with NB method, by
using KNN, we are able to boost gender ACC from 0.7495
C=0.0008 C=0.01 RBF Gaussian to 0.8190, and emotion ACC from 0.3589 to 0.5312.
Gender-ACC 0.8673 0.8608 0.4939 0.4950 Table 10 shows the testing result of random forest, I
Emotion-ACC 0.5089 0.5022 0.2064 0.2017 tested on different tree size. Here, we can see that gender
ACC is increasing from 0.7696 all the way to 0.8235 when
Table 11. SVM ACC performance. tree number is 300. Emotion ACC increases from 0.4388
to 0.5074. Also we notice that from tree size of 100 to 300
Pruned 0 1 2 3 the improvement is relatively small, which means the model
SVM-ACC 0.5672 0.5205 0.5911 0.4752 has almost converged. It has been proved that random for-
Pruned 4 5 6 est is able to prevent over-fitting, thus bigger number of tree
SVM-ACC (s) 0.6073 0.5207 0.5558 size should converge to some value. A reasonable tree size
should be chosen to get the best trade-off between perfor-
Table 12. SVM ACC with class selection. mance and complexity.
Table 11 shows ACC performance of SVM. It turns out
Classifier NB KNN Random Forest SVM that SVM is having less improvement in emotion analy-
Gender Time (s) 1.71 63.3 57.1 170.8 sis than gender classification, this mainly because emotion
Emotion Time (s) 4.12 103.2 231.2 457.2 analysis itself is not a binary classification problem as gen-
der. Also, parameter C value seems not influencing ACC
Table 13. Time cost for data set training. that much. I also tried different kernels like RBF and Gaus-
sian. In both gender and emotion problem, RBF and Gaus-
sian kernel are performing very badly, which definitely not
can be calculated with different metric like Euclidean dis-
a good kernel choice. Considering there are some over-
tance, Hamming distance, etc. (K value can be tuned for
laps/similarity between different emotions, and some emo-
KNN).
tion type may have negative influence, i.e., cause some false
Random Forest: random forest is an ensemble model
positive to other emotions. I also tested on pruning class la-
based on decision tree, where decision tree trains and tests
bels. Table 12 shows the SVM ACC result when pruning
based on attribute split, and label with leaf node. (Tree num-
different emotions. We can see that with specific prune, we
ber can be tuned.)
are able to boost ACC to more than 0.60.
Support Vector Machine: well-know method to split
Table 13 shows the training time cost for each models,
samples with minimum distance maximized. Matlab also
which indicates the following training time relation: NB <
provides ClassificationECOC classifier to support multi-
KNN < Random Forest < SVM. However, longer train time
class classification with SVM. (C value can be tuned, which
does not necessarily indicates longer test time.
controls overfitting, different kernels can be tried.)
To select the best model, we need to run and tune each of
4. Experimental Setup and Results
the classifiers. Note that for random guess, gender classifi-
cation will have 50% ACC, and emotion analysis will have Note in section 3, I already showed the majority of nu-
14.28% ACC (7 classes in total). merical testing results (like different feature performance,
Table 8 shows NB ACC testing results of both gender different classifier performance) in multiple tables. In this
6
section, we will shown some experiment results in addition Gender Male Female
to that. Precision 0.8745 0.8602
The simulation tool used for this project was mainly Recall 0.8615 0.8733
Matlab. The total .m files number is around 20. I also F1 0.8679 0.8667
used JAVA and Python for some data/image parsing. I have
several main functions to test on BoW features, data pro- Table 14. Precision, recall and F1 of gender classification.
cessing, gender detection, emotion analysis, etc. Also other
helper functions to do feature extraction, classification, etc. Emotion Angry Disgust Fear Happy
Vlfeat tool was also used in order to test on dense SIFT fea- Precision 0.4004 0.9091 0.3730 0.6789
ture. Recall 0.3779 0.1802 0.2754 0.7627
For convenience, I parsed and split images into the fol- F1 0.3888 0.3008 0.3169 0.7183
lowing format: Emotion Sad Surprise Neutral
emotion-train/test Precision 0.3719 0.6982 0.4405
0-Angry Recall 0.3841 0.6041 0.5345
1-Disgust F1 0.3779 0.6477 0.4830
2-Fear
Table 15. Precision, recall and F1 of emotion analysis.
3-Happy
4-Sad
5-Surprise 2). Note it also has very high precision of 0.9091 and very
6-Neutral low recall of 0.1802, which meas it may be difficult for the
gender-train/test system to retrieve “Disgust” from testing image, but once it
Male is marked, with 90.91% probability that will be correct.
Female However, those numerical results are tested on the testing
Here, each of the classes of emotion analysis and gen- data set, whose resolution was intended to be low and face
der classification has its own corresponding folder, which expression was very complex. My feeling when testing on
makes very easy to load images using Matlab command im- real life image or video stream is that, the system is working
ageSet. far better than the numerical performance on testing data set
(shown in section 4.2 and 4.3).
4.1. Class F1 score analysis
4.2. Real Life Image Test
In Section 3, we already showed majority of numerical
The numerical ACC results from section 3 should al-
results including test on feature extraction, feature reduc-
ready be enough to verify the correctness of the system.
tion, different classifier models etc. In this part, I’m going
But to have a more straightforward view of the perfor-
to show how robust the system is on each type of classes.
mance. This subsection shows some test result on images
Instead of using accuracy, precision, recall and F1 score are
and videos.
used (definition can be found in Section 3.2).
Table 14 shows the precision, recall and F1 of gender
classification. We can see that both of male and female type
have very close performance on the three metrics. Thus we
can conclude the system has no bias, and will have very
similar good performance on male and female.
Table 15 shows the precision, recall and F1 of emotion Figure 8. Emotion image-1 testing result.
analysis. Different from table 14, here we notice each of
the class is highly biased. Among them, “Happy” and “Sur-
prise” have the highest F1 score, indicating those two types
of emotions will have the best performance in real life test-
ing. Some emotions have relatively low F1, like “Angry”,
“Fear” and “Sad”, but this is understandable since those
emotions were essentially kind of vague. As shown later
on in section 4.2 and 4.3, we can tolerate some overlapping Figure 9. Emotion image-2 testing result.
among them. In addition, “Disgust” has the lowest F1 score,
this is also reasonable since we have much less image train- In order to test the emotion classification result, I found
ing data for “Disgust” emotion (as shown in Fig.4 and Table several images (Fig.8 and Fig.9) [18] with human faces of
7
various expressions. Also, the image includes people with gender classification only has one error in the image, since
different races and ages. Note the five faces of Fig.8 and the gender features for the person seems kind of vague.
Fig.9 were passed to the system inside one image. Hence, I also tested the system on a lot of my personal images,
corresponding back to Fig.1, the process will be: my feeling is that the system is giving much better perfor-
mance than the ACC value showed in table of Section 3. For
• Face detection module circles out five face box areas. gender classification, we have quite high ACC here because
• Rescale image inside face box. Then feature extraction people in real lift images mostly have more clear features
generates feature vectors for each face with consistent than training data. While for emotion analysis, as shown in
dimension. Section 4.1. The system is giving much higher F1 score on
emotions like “Happy”, “Surprise”, which are more com-
• Run gender classifier with the provided feature. mon emotions in real life.
• Run emotion classifier with the provided feature. 4.3. Video Stream Test
• Add face box, gender and emotion label to image. In addition to image, I also tested the system on video
stream to see how well it can handle continuous expression
From Fig.8 and Fig. 9 we can see that we actually have a change. Different from image, which is static, video is a
very good test result on both gender and emotion classifica- more flexible method to do the gender and emotion testing,
tion. For gender classification, only one of the 10 faces had since we can show different expressions and observe how
error. For emotion analysis, we marked five different emo- the system handles those changes.
tions labels, which are: Angry, Fear, Surprise, Happy and
Neutral. I would say almost all the emotion classification
results look reasonable to me. However, human being’s face
expression is very complicated. It’s even vague for us some-
times to judge other’s emotion through face expression. For
example, the third face on Fig.9 could be explained as either
“Surprise” or “Happy” by different people.
8
der some situations (like “Fear” and “Surprise”), but we are gender. The gender prediction engine seems quite accuracy.
still able to get the correct prediction, indicating the system It should also be very robust since I did a lot of exaggerated
is robust. Again, emotion prediction itself is kind of sub- face expressions during this video. Otherwise we should
jective or vague in real life. It seems the major difference look forward to even higher accuracy.
between “Angry” image and “Sad” is on the mouth feature,
but both predictions are reasonable to me. 5. Conclusion and Future Work
In this project, we touched topics like face detection,
gender classification and emotion analysis. Different fea-
ture extraction method like BoW, LBP and HOG were
tested. Key-point detection based feature dimension reduc-
tion was considered to reduce complexity. Multiple classi-
fiers like NB, KNN, Random Forest and SVM were tested.
Parameters for each model were tuned to get the best per-
formance. Numerical results indicate that we are able to
get 0.8673 accuracy on gender classification and 0.5089 on
emotion analysis (0.6073 when we prune particular class).
Further analysis was done on precision, recall and F1 for
Figure 12. Emotion distribution during video sampling time. each classes. Testing on real life images and video stream
also demonstrates the validity of the system.
Personally speaking, this is a very exciting project,
which lets me familiar with different vision algorithms and
how to connect them with machine learning tools. Being
able to develop a demo system that can be used immedi-
ately is very interesting. I had a lot of fun to test on my
different personal pictures and photos.
ACC on emotion analysis is a part that can be improved
in future work. Also current emotion analysis has biased
performance on different type of emotions. Introducing
deep learning concept should be very help to improve this
multiclass problem.
Figure 13. Gender distribution during video sampling time. Also, to be better used in real life scenarios like class
quality analysis. More information can be collected, like
Instead just focusing on one single image, a more inter- human poses, age information, human recognition etc. A
esting or useful analysis would be doing this on entire video good model to use the collected information generate an
stream time. Fig.12 shows the emotion distribution on the overall summary score (like group analysis) will also be
94 frames of this 15 seconds video. We can clearly see the very interesting.
proportion of each type of emotions. In this demo video, Code link: Follow this link.
I was “Angry” for 6% of the time, never been “Disgust”, Code link also submitted through Google Form.
“Fear” for 25%, “Happy” for 15%, “Sad” for 10%, “Sur-
prise” for 28% and “Neutral” for 16%. We notice that emo- References
tion “Disgust” never appears in this video stream, this is be- [1] Yang M H, Kriegman D J, Ahuja N. Detecting faces in images:
cause training data for “Disgust” was significantly smaller A survey[J]. Pattern Analysis and Machine Intelligence, IEEE
than others (based on Table 2 and Figure 4). Also from 4.1, Transactions on, 2002, 24(1): 34-58.
we can see “Disgust” has a very low recall, which means [2] Zhang C, Zhang Z. A survey of recent advances in face detec-
it’s relatively difficult recognize this emotion from image, tion[J]. 2010.
but once it’s recognized, it will mostly be correct (based on [3] Viola P, Jones M. Rapid object detection using a boosted
the high precision from 4.1). System detects 6 types of emo- cascade of simple features[C]//Computer Vision and Pattern
tions in this short video because I was changing my expres- Recognition, 2001. CVPR 2001. Proceedings of the 2001
sion frequently on purpose. In reality, this kind of emotion IEEE Computer Society Conference on. IEEE, 2001, 1: I-511-
distribution maybe useful to evaluate the quality of a class. I-518 vol. 1.
Fig.13 shows the gender prediction distribution during [4] Mkinen E, Raisamo R. An experimental comparison of gender
the video time. From which we can see that most of the time classification methods[J]. Pattern Recognition Letters, 2008,
(93%) the system is able to make prediction with correct 29(10): 1544-1556.
9
[5] Lian H C, Lu B L. Multi-view gender classification using local
binary patterns and support vector machines[M]//Advances
in Neural Networks-ISNN 2006. Springer Berlin Heidelberg,
2006: 202-209.
[6] Baluja S, Rowley H A. Boosting sex identification perfor-
mance[J]. International Journal of computer vision, 2007,
71(1): 111-119.
[7] Saatci Y, Town C. Cascaded classification of gender and facial
expression using active appearance models[C]//Automatic
Face and Gesture Recognition, 2006. FGR 2006. 7th Inter-
national Conference on. IEEE, 2006: 393-398.
[8] Kim Y, Lee H, Provost E M. Deep learning for robust feature
generation in audiovisual emotion recognition[C]//Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE Interna-
tional Conference on. IEEE, 2013: 3687-3691.
[9] Fasel B, Luettin J. Automatic facial expression analysis: a sur-
vey[J]. Pattern recognition, 2003, 36(1): 259-275.
[10] Gallagher A, Chen T. Understanding images of groups of
people[C]//Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on. IEEE, 2009: 256-263.
[11] https://www.kaggle.com/c/challenges-in-representation-
learning-facial-expression-recognition-challenge/data.
[12] https://en.wikipedia.org/wiki/Accuracy and precision.
[13] https://en.wikipedia.org/wiki/F1 score.
[14] Dalal N, Triggs B. Histograms of oriented gradients for hu-
man detection[C]//Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society Conference on.
IEEE, 2005, 1: 886-893.
[15] http://www.mathworks.com/help/vision/ref/extracthogfeatures.html
[16] Ahonen T, Hadid A, Pietikainen M. Face description with
local binary patterns: Application to face recognition[J]. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions
on, 2006, 28(12): 2037-2041.
[17] Wang X, Han T X, Yan S. An HOG-LBP human detector
with partial occlusion handling[C]//Computer Vision, 2009
IEEE 12th International Conference on. IEEE, 2009: 32-39.
[18] http://www.shutterstock.com/s/emotions/search.html
[19] https://jonmatthewlewis.wordpress.com/2012/02/24/posture-
and-gestures-in-the-classroom-and-on-the-date/
10
Computer Vision for Food Cost Classification
Dash Bodington
Stanford University
dashb@stanford.edu
1
port will discuss the dataset, models, and results, and con- class. Enforcing this 50% distribution cuts the dataset ap-
clusions of this project in detail. proximately in half because of the relative rarity of expen-
sive restaurants.
2. Implementation After the training and test datasets are fully defined, each
image is cropped from the center to the largest square area
2.1. Experimental Setup possible, and is resized to a variable size, depending on the
The implementation of this project was completely com- feature extractor used (sizes range from 64x64 to 227x227).
putational, and involved dataset extraction, data preprocess- With this definition of the training and test datasets, there
ing, feature extraction, and classification. All code for this are several further processing steps which are sometimes
project was written in python 2.7, opencv was used for most used on the training set to improve training and perfor-
image processing [2], tensorflow [1] was used to write all mance.
neural networks, and scikit provided implementations of • For validation or cross validation, which is only used
some other classifiers[5]. when tuning feature or classifier parameters, or train-
The project was run on a fast desktop computer. All neu- ing neural networks, 20% of the training dataset is ran-
ral network models were run on an Nvidia GTX 780 GPU, domly designated as the validation set. All models ex-
and the remainder of the processing was done on a 4.4GHz cept neural networks are trained on the whole training
quad-core CPU. Even with reasonably fast hardware, fea- dataset before testing.
ture extraction was very time consuming, so it was often
done only once, and the features were cached to be used • Depending on the classifier and loss function being
on-demand. used, the training set will usually have images dis-
carded in the same fashion as the test dataset to even
2.2. Dataset Extraction the distribution of images across classes.
Though this project uses a nicely formatted and labeled 2.3. Feature Extraction
dataset, the subset required for training and testing comes
from several extraction steps. The Entire Yelp dataset con- Several feature extractors were used in this project as in-
tains 200,000 color images from business’ Yelp pages. Im- puts to the various classification systems.
ages can be uploaded by either businesses or customers, and
are of varying quality and size. These images are tagged 2.3.1 Images as Features
by Yelp’s own computer vision algorithms, and can be cor-
In some cases, preprocessed (scaled and cropped) images
rected by users, as there are sometimes errors in tagging.
were used as features themselves. This method is usually
Initially, the image database is analyzed, and all images
most appropriate for input into a convolutional neural net-
tagged as ’food’ are extracted along with the ’id’ of the busi-
work classifier, which essentially defines its own features
ness they come from. All images from the same business
internally, but can also be used with other classifiers with
are grouped, and are then labeled with the consolidated cost
varying success. These features use images of size 128x128
label (mapped from Yelp’s $ - $$$$ to the binary $ or $$
or 64x64, which are vectorized unless the classifier is a con-
for this project) of the business. If a business has not been
vnet.
labeled with a cost rating, which happens infrequently, the
corresponding images are discarded. Next the businesses
ids are binned by their attached cost ratings, each bin is 2.3.2 Color Histogram Features
shuffled, and a predetermined train/test ratio (0.7 in this As an initial step beyond images as features, it was thought
case) is used to assign each business id to the training set that the colors in an image could give an indication about
or test set. This binning by business and cost rating is used the cost of food. This feature consists of a length 30, nor-
to ensure that there is a similar distribution of data in the malized (sum 1) vector, which contains a 10-bin histogram
test and training datasets, and to avoid placing images of for each color channel. These features use images of size
the same item from the same business into the training and 128x128 or 64x64 as inputs.
test dataset, which could be considered as mixing the two.
In total, there are 11,530 images available for training, and
2.3.3 SIFT Bag of Words Features
4,412 images available for testing, though these counts de-
crease depending on the desired training and testing distri- SIFT Bag of Words features consist of frequency distribu-
butions. tions of common feature descriptors. During training, a set
In the test dataset, images in each class are shuffled, and of SIFT descriptors from all training images is extracted,
images are discarded from the class with more images un- and N (range: 20-100) ’words’ are extracted with the K-
til exacly 50% of the images in the test set are from each means algorithm [2, 4]. Next, during training and testing,
each image’s feature vector is calculated as a normalized 2.4.2 Naive Bayes
N-length histogram of the words, where each feature de-
scriptor from the image is assigned to one word with the Naive Bayes classification assumes feature indepen-
nearest neighbor algorithm. These features use images of dence and Gaussian probability distribution of fea-
size 128x128 as inputs. tures to make a maximum likelihood estimate ŷ =
argmaxy P (y)ΠP (y|xi ) of the class, and usually can per-
form well on small training datasets because it has few pa-
2.3.4 Alexnet Features
rameters.
Alexnet is a 13-layer pretrained convolutional neural net-
work which previously achieved state of the art perfor-
mance in the Imagenet large-scale image classification chal- 2.4.3 Support Vector Machine (SVM)
lenge [3]. Originally, the network’s output was a 1000-class
The SVM is a linear classifier which defines a class-dividing
softmax layer, but because the network learns very useful
hyperplane f (x) = B0 + B T x, which minimizes 21 ||B||2
features for other tasks, swapping the final layers of the net-
subject to y(B0 + B T x) > 1 on the training set. Generally,
work for custom-trained layers is a common practice, espe-
SVMs have the advantage that they are less likely to overfit
cially for those working without the computational power or
than other methods because of build-in regularization.
large data volume required to train a model with similar per-
formance from scratch. For this project, ’Alexnet Features’
are considered to be the output of one of the final layers of 2.4.4 Neural Networks
the pretrained network.
With Alexnet, multiple feature sets were created from While neural networks are the most general classifier used
multiple layers of the convnet. fc8, fc7, and fc6 (the final in this project, they are also the most difficult to tune, and
fully-connected layers of the network). Because these fea- it requires a great deal of data to train networks with many
tures are effectively sparse (we are feeding the network a neurons. Neural networks are a layered structure of inter-
very small subset of the images it was trained to classify), connected linear and nonlinear operations whose parame-
and very large in size, PCA is often used to reduce the fea- ters can be learned with various gradient descent methods.
ture dimensionality before training. In this project, the classification networks presented have
This network requires image inputs of size 227x227, zero or one hidden layers, and all layers are fully connected.
which are also mean-subtracted. Each fully connected hidden layer (when present) is fol-
Using the GPU allows for a significant speedup in fea- lowed by a Rectified Linear Unit (ReLU), and the final layer
ture generation, over 70x faster than CPU computation on is a softmax layer which takes two inputs and computes
the hardware for this project, but is not enough to train a net- pseudo-probabilities for the input on each class according
work like Alexnet from scratch with limited time and data. to
exi
2.4. Classifiers p̂(xi ) = P xi
e
Because of the many feature inputs used in this project, .
classifiers with different properties and strengths are used to
Though neural networks have been responsible for many
increase the likelihood of good performance with each fea-
recent state of the art results in computer vision, they
ture set. Feature vector lengths range from 30 (Color His-
are among the most difficult models to manage on small
togram) to 12,288 (cropped and rescaled images), so mod-
datasets. Because of this, L2 regularization is sometimes
els which may overfit in some cases, may perform well with
added to the iterative minimization of the cross-entropy loss
fewer features.
X
Loss = − p(xi )log(q(xi )) + λR(W eights)
2.4.1 K - Nearest Neighbor
The K-Nearest Neighbor classifier archives the entire train- where q contains the outputs of the softmax layer, and p is
ing feature set, and at prediction time, calculates the eu- the one-hot label vector for the training sample x. Other
clidean distance from the input feature vector, and makes training tricks, such as dropout and projection of sparse
a prediction based on a majority vote from the labels of feature vectors into lower dimensions with PCA are also
the K closest training examples (K = 20 for this project). attempted to increase robustness and decrease overfitting.
While storage-inefficient, it is one of the simplest classi- Batch gradient descent with momentum was used to train
fiers, and would perform well if images in the training and networks for this project because it provided reliably con-
testing dataset were similar enough, but fails to generalize verging results, especially when changing the class distri-
otherwise. bution (and size) of the training set.
Figure 2. These three images of cheesecake, chocolate dessert, and
steak, are the images with the highest estimated probability of be-
ing expensive by the neural network.
Travis V. Allen
Stanford University
CS231A, Spring 2016
tallen1@stanford.edu
Abstract 1. Introduction
This project focuses on the creation of an end-to-end
Many in the computer vision community have written “automatic” jigsaw puzzle solving program that assembles
papers on solving the “problem” of jigsaw puzzles for a jigsaw puzzle using an image of the disassembled puzzle
decades. Many of the former papers written cite the po- pieces. The solver does not use the picture of the assembled
tential extension of this work to other more serious endeav- puzzle for matching purposes. Rather, the solver created for
ors like reconstructing archaeological artifacts or fitting to- this project attempts to solve the lost the box conundrum by
gether scanned objects while others simply do it because displaying the assembled puzzle to the user without the need
it’s an interesting application of computer vision concepts. for the reference image. Needless to say, this program was
This author falls in the latter camp. Several methods for created with an eye toward a potential smart phone applica-
constructing jigsaw puzzles from images of the pieces were tion in the future. It should be noted that the puzzle solver
considered from a theoretical standpoint before the comput- is created entirely in Matlab and heavily utilizes functions
ing power and the high-resolution image capturing devices found in the Matlab Image Processing Toolbox.
necessary to employ these methods could be fully realized. Solving jigsaw puzzles using computer vision (CV) is an
More recently, many algorithms and methods tend toward attractive problem as it benefits directly from many of the
disregarding piece shape as a discriminant entirely by us- advances in the field while still proving to be both challeng-
ing “square pieces” and rely instead on the underlying im- ing and intellectually stimulating. Due to time constraints,
age properties to find a solution. The jigsaw puzzle solver manpower limitations, and the fact that I had little to no
described in this paper falls somewhere in between these prior experience with many of these concepts and the Mat-
two extremes. The author of this project paper describes lab Image Processing Toolbox when beginning this project,
the creation of an “automatic” jigsaw puzzle solving pro- several assumptions about the problem were made up front
gram that relies on multiple concepts from computer vision in order to make it simple enough to solve in the time given.
as well as past work in the area to assemble puzzles from The primary assumptions and limitations are as follows:
a single image of the disassembled pieces. While the pro-
gram is currently specifically tailored to solve rectangular 1. The pieces in the source image do not overlap nor do
puzzles with “canonical” puzzle pieces, concepts learned they touch.
from this work can be used in concert with other computer
vision advances to enhance the puzzle solver and make it 2. The source image is captured in a “top-down” manner
more robust to varying piece and puzzle shape. The puzzle with minimal perspective distortion of the pieces.
solver created for this purpose is fairly unique in that it uses
a picture of the disassembled pieces as input, no reference 3. The pieces in the source image comprise one entire
to the original puzzle image, and is done using the Mat- puzzle solution (you can not mix and match pieces
lab Image Processing Toolbox. The solver created for this from other puzzles).
project was successfully used on five separate puzzles with
different rectangular bounds and dissimilar puzzle images. 4. The final puzzle is rectangular in shape and all pieces
These results are similar to others who have created simi- fit neatly into a grid.
lar puzzle solvers in the past. Ultimately, the author hopes
that this work could lay the groundwork for a smart phone 5. The pieces of the puzzle are standard or “canonical”
application. in shape – this means they are square with 4 distinct,
sharp corners and four resulting sides.
1
6. All intersections of pieces in the puzzle will be at the Lastly, a former student of CS231A, Jordan Davidson
corners of the puzzle pieces and all internal intersec- [1], did his project in this very area, though it was in a
tions will be at the corners of four pieces. slightly different vein. He looked at a genetic algorithm that
could solve large “jigsaw” puzzles with square pieces that
7. Each side is characterized by having a “head,” a “hole,” used the information from the pieces to determine if it had
or by being “flat.” found the correct match. While the algorithm is interesting
In the following paper, I will discuss the previous work and probably applicable on some level to my puzzle solver,
that has been done in this area in section 2, emphasizing it was not quite what I was looking to do for this project.
those papers that most influenced the methodology I fol- Jordan’s work appears to be in an area that is growing with
lowed for my puzzle solver. I will then describe my tech- others attempting to solve larger puzzles of this kind. Per-
nical approach to the problem and how my puzzle solver sonally, I wanted to solve the puzzles as they are seen and
works in section 3. In section 4 I will show some of the manipulated in real life. Most of this area of research has
results obtained with the puzzle solver so far and discuss other applications and was not quite what I was looking to
both the experimentation that has been conducted and ar- do. Though, as I said, there is definite applicability of some
eas where more experimentation could occur. Finally, I will of the algorithms to my ultimate solver and future iterations
wrap things up in a conclusion in section 5. may look at something like the algorithm explored in Jor-
dan’s paper.
2. Related Work
3. Puzzle Solver Technical Approach
As one can imagine, such a problem as solving jigsaw
puzzles might attract a good number of people in the com- Creating a program to construct a puzzle using an im-
puter vision community. And indeed it has. The problem age of the pieces requires a number of steps, each of which
has been considered for decades, going back to H. Free- can be executed in a number of ways. In this section, I will
man, and L. Gardner [2] in 1964. They first looked at how describe the methods I used in my final code, but will also
to solve puzzles with shape alone. Then there’s H. Wolf- discuss alternatives that were either attempted with subop-
son et al. [4] who describes the very matching methodology timal results or that were not used but could be in future
that I use. He is able to assemble a 104 piece puzzle us- iterations.
ing his method in 1988. While I do not solve a 104 piece
3.1. Image Capture and Segmentation
puzzle, their solution requires individual pictures of each
piece, though he does solve it with shape alone. One area In order to capture the pieces to be assembled into a
where his method is different than mine is that it can handle final puzzle, I used a fairly high resolution camera – a
two intermixed puzzles, whereas mine can only handle the Canon Rebel T4i DSLR with 18.0 Megapixel resolution.
pieces from one at present. D. Goldberg et al. [3] expanded The pieces were placed face up on an easily segmentable
on Wolfson’s work and developed an even more global ap- background (i.e. a “green screen”) with great care taken to
proach to solving jigsaw puzzles – their method allowed for ensure they were not overlapping (see figure 1 for an exam-
the solution of puzzles that did not necessarily intersect at ple using the Wookie puzzle). The picture was taken from
corners. My inspiration for color matching across bound- a “top-dead-center” position looking straight down onto the
aries comes from D. Kosiba et al. [5] who propose methods pieces in order to reduce perspective distortion. Lighting
for using color in addition to shape in 1994. was kept as neutral as possible with consideration given to
Some insight into how to use Matlab to help solve this sources of glare and to the possible disproportionate light-
problem came from some detailed student papers that were ing of some pieces over others. With more time, the abil-
found with a Google search. A. Mahdi [8] from the Univer- ity to compensate for off-axis image capture of the pieces
sity of Amsterdam and N. Kumbla [6] from the University (i.e. rectification to the ground plane) could be built into the
of Maryland both attempt the problem using methods sim- code, though that was not explored for the current incarna-
ilar to what I end up using, though neither ends up with tion of the solver.
quite full and satisfactory solutions and both rely on high As discussed earlier, with an eye toward an eventual
resolution and highly controlled inputs. Finally, my inspi- smart phone application, the first step I take is to signifi-
ration for creating a smart phone application comes from cantly reduce the resolution of the input image from that of
L. Liang and Z. Liu [7] from Stanford who do not use the the original in order to shrink the memory burden and make
same matching methodology (they use the actual image of the resulting image more comparable to one that might be
the fully constructed puzzle and SURF/RANSAC), but who obtained with a smart phone. Once I have resized the image
do try to implement their solution in a real-time smart phone (960x1440 was the main resolution used for the test cases),
application. This is a possibility in the future for my puzzle I use a Gaussian filter (with σ = 1) to blend the edges prior
solver. to the actual segmentation.
2
Figure 1. Input Image for the Wookie Puzzle
3
method similar to [5], I identify small patches of pixels (I
found three, 2x2 patches to work well) distributed evenly
along the edge, grab the average L*a*b* color information
in each of the patches, and store it for the matching process
(see figure 3 for an example). Of course, I only do this
along the edges with holes and heads – this is unnecessary
along the flat edges. As with the piece information before,
I also gather the HSV values for each of the patches, but
no longer use them in the matching process after receiving
mixed results.
4
curves along each edge to determine an average overlap (or least two of the categories, so a score threshold of about
gap) – obviously the smaller the overlap or gap, the bet- 280 (again, experimentally determined) can be set in order
ter the match. With these factors in mind, here is how the to weed out incorrect matches.
matching algorithm works. Once the local matching algorithm has pared down to a
Once the algorithm determines the two pieces and the final set of scored matches, it then returns these matches to
side of each piece to be matched, it first checks to make the global algorithm in score-priority order. The primary
sure one is a head and one is a hole. If not, the match is nuances that differ between local matching for the border
discarded. It then compares the side lengths and the in- versus the inner pieces is the orientation of the piece. As I
tegrals along each edge. If the difference in side lengths will soon discuss, the border pieces are aligned according
is significant (beyond approximately 10 pixels for the stan- to the flat edge, so the local matching only considers one
dard resolution image), the match is discarded. The integral potential edge for each piece. And, because the border is
of the curve along each edge is compared. If they are not being matched in the absence of the rest of the puzzle, the
approximately equal and opposite, the match is discarded local matching algorithm does not have to consider any ad-
(the threshold for this varies based on image resolution – ditional sides from other pieces that may come into play.
turns out it is not the best method for weeding out candidate Not so for the inner pieces. For the inner portion of the puz-
matches so it is not the most strict). Next, the overlap is cal- zle, a piece is being matched to a slot in a grid, and that
culated by overlaying the two edges in XY-coordinate space slot has neighboring sides. As will soon be discussed, the
and taking the difference (using pdist). The result is the local matching algorithm will always have at least two sides
overlap, or gap, between the two pieces. If the average, for each internal piece, but could have upwards of three or
minimum, or maximum overlap is beyond specific thresh- four to consider depending on the location of the slot and
olds based on image resolution, then the match is thrown what pieces have been matched so far. In the inner case, the
out. Once we’ve looked at the basic shape discriminants for local matching algorithm has to ensure that the heads and
determining whether a match is likely correct, we then look holes all line up first, as before, but then determines all of
at the color discriminants and determine the ∆E. We do this the matching metrics per side and takes the average over the
both from a regional perspective (piece to piece) and from number of sides. The major difference here is that a single
an edge perspective (using the patches along the edge). ∆E piece could fit into a slot in multiple ways, so piece orienta-
is essentially the “distance” between two colors in the color tion must be accounted for and the piece must be rotated in
spectrum and is determined using the following formula: all valid configurations before the algorithm returns a set of
matches. A single piece could potentially have four possi-
∆L∗ = L∗1Avg − L∗2Avg (1)
ble “matches” to a single slot, depending on hole/head ori-
∆a∗ = a∗1Avg − a∗2Avg (2) entation. The match thresholding and scoring are the same
∗
∆b = −b∗1Avg b∗2Avg (3) across pieces and edges in the inner piece matching as in the
p border matching, the only true difference being that they
∆E = ∆L∗2 + ∆a∗2 + ∆b∗2 (4) are averaged across each piece/edge to which the piece is
matched in the inner matching (which is unnecessary in the
The ∆E is found for each pair of patches along the edge
border case).
and the average of those patch values is used. Obviously,
the lower the ∆E, the closer the colors are in the spectrum.
Once we’ve computed and captured these shape and 3.3.2 Global Assembly – the Border Pieces
color comparisons and have weeded out obviously bad
matches, we then compute a match “score.” This score It is logical to begin the global assembly with the border
is found using experimentally determined weights that are because the border pieces are distinct – they each have at
multiplied by the four key matching criteria: side length least one flat side. Since the piece matching algorithm is not
difference, overlap difference, ∆E along the edge, and ∆E perfect and does not always return the “correct” match as
between the pieces. Since these values tend to vary widely the “best” match, a so-called “greedy” algorithm that sim-
between matches and puzzles, the standard range of values ply places the “best” match between two pieces in the next
found for “correct” matches on one of the test puzzles was available slot will not necessarily result in a coherent solu-
used to develop a set of weights that somewhat normalizes tion (i.e. one might get something that is non-rectangular
each parameter so that one specific criteria is not favored too or even nonsensical). In order to provide for this possi-
much more than another. For most correct matches, when bility while not resorting to a “brute-force” approach that
multiplied by the weights, the values should be no greater runs through every possible combination of pieces, I de-
than 100. This means that, theoretically, a correct match cided to use a so-called “branch-and-bound” algorithm, nor-
could have a total score of up to approximately 400. How- mally used in the solution of the Traveling Salesman Prob-
ever, in reality, most correct matches have low scores in at lem (TSP). In the general description of the TSP, a salesman
5
needs to do business in a number of cities spread out over a can be considered “children” or “branches.”
region with defined distances between each. The salesman
wants to find the shortest overall route that goes to every 3. Since the local matching algorithm returns a rank-
city only once and returns to his starting point – thus its a ordered list of potential matches, start with the first
distance minimization problem. potential match as piece B.
Much like the TSP, each match made between two pieces
along the puzzle border is given a score. Once a solution 4. Next, remove piece B from the list of pieces remaining
is found, the total of all of the match scores that make up to be matched and make piece B the new piece A.
that border solution should reflect how good the solution
is. Ideally, the smallest overall score will be the best and 5. Use the new piece A to find more potential matches.
correct solution. However, that turns out not always being
the case, as will be discussed. 6. This process continues until one of the following oc-
Since there are many ways of reaching a solution, we curs:
need a way of capturing a large number of possible solu-
tions and then finding the best one of those potential so- (a) We run out of pieces remaining – in this case we
lutions. One way of doing this is the branch-and-bound have found a solution or “leaf.” We store this set
method. This algorithm will be discussed shortly, but first I of matches and their scores as a solution, back
will describe the greater methodology for how the border is up, and see if we can find more solutions.
constructed.
The general construction of the border begins with a (b) We run into a border constraint that isn’t satisfied
corner piece. We orient the piece such that the counter- – in this case we back up and see if another match
clockwise-most flat side is “down,” with the other flat side does satisfy the border constraint before moving
to the “left” and a head or hole to the “right.” Matching is on.
then done to the “right” in a sequential manner. The local (c) We do not get any potential matches for the cur-
matching algorithm, then, receives a left piece and a list of rent piece A – in this case we need to back up and
possible right pieces, all with the flat side down. It returns a see if we can find another path using a different
list of potential right pieces. We then choose one of the right piece from an older set of potential matches.
pieces to be the new left piece and continue the process until
we run out of possible right pieces. As we progress around
The algorithm either runs until it has exhausted the search
the border, when we hit another corner, we rotate the en-
space and found all possible solutions based on both the side
tire puzzle and continue as if the flat side of each piece is
constraints and the local matching thresholds, or until it has
“down.”
obtained the number of solutions requested of it. As can
Now, since the potential number of solutions is (n − 1)!
probably be surmised from the basic description above, one
where n is the number of border pieces, we want to con-
can visualize this approach as a tree with the first piece at
strain the number of solutions found through whatever
the root and branches extending upward for each potential
means necessary. The local matching algorithm does a good
match. If we are able to make it all the way up a branch to
job of weeding out very poor matches, but will still return
a leaf, then we have found a solution to the problem. If we
multiple potential matches that could lead to a nonsensical
get stuck on a branch and can’t expand, we come back down
full solution. In order to combat this we also place a set of
the tree until we find another path that looks fruitful. In this
side length constraints on the puzzle such that the border
way we can reduce the total number of solutions tried to
solution must have side lengths corresponding to a rectan-
well below that which would be found through simple brute
gular puzzle. If we’ve gone too long without a corner piece
force.
or if we get sides defined by corner pieces that do not equal
As one can see, because the problem is being solved in
one another, the solution is thrown out and we search for a
a nonlinear fashion, the number of solutions that might be
new one.
found before the “correct” solution is highly dependent on
After all of this preamble, I am now going to describe
several factors, not least of which are the first piece cho-
the basic branch-and-bound algorithm that is used for the
sen and how well the first few pieces match. If incorrect
border matching problem. The algorithm can be described
matches are made early in the process, it can take a long
in the following manner:
time (and a lot of matches) before the correct solution is
1. Choose a starting piece (usually a corner) and call it found. And even when the correct solution is found, it may
piece A. not be the “best” solution as per the scoring system. Ideally
the “best” and “correct” would be the same, but that is not
2. Use piece A to find a set of potential matches – these always the case.
6
3.3.3 Global Assembly – the Inner Pieces Puzzle Total Border Border Inner
Name Pieces Pieces Soln Soln
Once a border solution is found, it is passed to the global as-
Wookie 12 10 1st 1st
sembly for the inner pieces. The global assembly algorithm
Storm Troops 24 16 58th 5th
assembles the border pieces into a grid and then grabs the
Droids 12 10 1st 1st
“upper left” open slot as the new “piece A.” This algorithm
Speeder 16 12 1st 1st
also creates a grid with relative orientations for each piece.
Rey Finn 12 10 1st 1st
When the pieces were first characterized, they were each
Kylo Ren 24 16 N/A N/A
oriented with “side 1” being “up.” Once they are placed
in the final puzzle grid, we create a second grid with the Table 1. Test Puzzle Results
same dimensions that provides the relative 90 degree rota-
tion from “up” for each piece (0-3). Because we know we
have the border completed, we can be assured that that first ing Toolbox, it was not completed in a satisfactory man-
slot will have at least two pieces along its edges. The more ner by the end of this project. Instead, I use the func-
pieces along an edge, the more accurate and discriminating tions vision.AlphaBlender and step along with the
the score should be. The inner piece matching algorithm piece masks and the cropped segmented pieces from the
then uses a similar branch-and-bound algorithm as in the original image to create a quasi-final image “grid” that
border case to find potential matches for this first slot. It shows the extracted pieces oriented per the solution. It’s
then removes the piece from those remaining and moves not ideal, but it at least shows how the final pieces should
across the puzzle filling in all available pieces from left to be laid out and arranged. For reference, see figure 4 for an
right and then top down (like reading a book). Unlike in the example of the final solution.
border case, however, the potential matches could involve
the same piece, just oriented differently. For this algorithm, 4. Experimentation and Results
orientation is very important. As with the border assembly
algorithm, once all of the pieces are found, that solution is I ran my puzzle solver in Matlab on both a home PC with
stored and we then back up and see if we can find more until 4 year old hardware (6 GB RAM, Intel i5 processor, AMD
either we run out of solutions or we have found the number Graphics Card) and a 13-inch MacBook Air, 2015 model
of solutions desired. Ideally, the solution with the lowest to- with 4GB memory and Intel Graphics with little difficulty.
tal score (aka the “best” solution) will also be the “correct” It takes about a minute or so to run one of the test puzzles
solution. (it might take more than a minute for the 24 piece puzzles)
One last note about the global assembly algorithms: from image segmentation through to final construction. I
while it might make sense to have the two algorithms sepa- carried a good amount of information in memory through-
rate during developing and while trying to understand where out the process since I was doing a lot of experimentation
each breaks down, the ideal case would be to combine these and wanted the ability plug and play various modules for
algorithms in order to weed out border solutions that do not both fine-tuning and debugging. This could be pared down
provide for full puzzle solutions. This was thought about, for a future implementation.
though not implemented in the final code. Had there been While the puzzle solver created for this project cannot be
more time, this would have helped to bring down the to- used on every puzzle (per the limitations noted earlier), for
tal number of end puzzle solutions. As it was, in the time those it could be used on, I was able to experiment to find
allowed, I was simply able to get both of these algorithms limitations and weaknesses. I also used this experimenta-
working well enough to tweak the various variables to see tion to find the best criteria for matching.
how best to find matches. The next step would be to link For this project I tested the puzzle solver on six Star
these two algorithms and throw out border solutions that Wars themed children’s puzzles that I bought at Target. The
have no potential solutions based on all of the criteria dis- overview of the results using the final matching parameters
cussed above. are in Table 1.
As one can see, most of the puzzles had 12 to 16 pieces,
3.4. Final Image Construction except for two that had 24 pieces. Of all the puzzles that
Once we have assembled all of the pieces into a grid had less than 24 pieces (four of the puzzles), the correct
with their relative orientations, we now have the solution border solution was the best border solution returned, by
to the puzzle. The next step is displaying that solution score (hence the “1st” in the third column). Then, using the
to a user. Ideally, I would like to display a completed correct solution as the lead-in to the inner puzzle algorithm,
and fully stitched together puzzle image using the puzzle those same puzzles found the correct solution to be the one
pieces as segmented from the original image. While this with the best score.
is most likely possible using the Matlab Image Process- For the Storm Trooper puzzle, there were two primary
7
factors that led to it not doing as well. First, there are more While it would logically seem, and in most cases it would
pieces and therefore more potential matches. Still, if the actually be, that matching the color patches across the
matches were registering scores that reflected the true “cor- boundaries should be one of the best ways to discriminate in
rectness” of the match, then one would expect the overall order to find a true match, due to the variability in how the
score of the completed border to be better than 58th. And, puzzle pieces are carved up, this was not always the case.
even when we fed the correct border solution to the in- For instance one piece was carved almost perfectly along
ner puzzle algorithm the correct solution was 5th best by Wookie’s nose, which is dark, and just on the other side of
score. However, to put this into perspective, the global bor- the edge there was a bright background. The ∆E in this
der solution algorithm returned in excess of 2000 potential case was fairly large even though the match is a correct one.
border solutions for the storm trooper puzzle (of a math- In fact, this was also a case where the puzzle pieces had a
ematically potential 15!, approximately 1.3 trillion, solu- small ∆E between them, but the patch difference was much
tions with brute force), of which the correct one was 58th higher. This is not an expected result. And while one might
by score. Which isn’t all that bad. Additionally, the storm begin to think that maybe color is too volatile and should
trooper puzzle was by far and away the most homogeneous not be considered at all since the perceived variability in the
in terms of color of all the puzzles. It was very difficult shape is much smaller among true matches per my above
to discriminate matches based on color using the methods list, that is not entirely accurate. Due to noise and distor-
I described before, especially because of the way the storm tion, the actual length of the sides and the measured overlap
trooper line discontinuities happen to match up along the is not exact. And while the correct match is always small,
border of the pieces, making cross border color matching so are many other matches. These criteria are best for weed-
very difficult indeed. And finally, the pieces were fairly ing out those pieces whose shape isn’t even close to correct.
square, so all four sides were very even and comparable in It can also help when the head of one piece is bent in one
length. If they were instead more elongated with one pair direction while the hole of the other piece is expecting it to
of sides longer than the other, the side length discriminant be bent in another – then the overlap will suffer. For the
would have knocked down potential matches. most part, however, the size helps get you close. Unfortu-
This was a case where the experimentally determined nately, many of the pieces have differences that fall within
match scoring algorithm broke down. While it worked in the acceptable ranges above. That is why color is then used
the other test cases very well, one can see quite clearly that to help with the ranking of those potential matches. And
other methods would need to be pursued in order to get the in most cases, the color does help. There are just a few in
storm trooper puzzle to be solved correctly. every puzzle where there are large transitions in both the
The other puzzle in the table that was looked at as a test puzzle region or just along the border that cause the match
case but does not have a rank for a solution is the Kylo scoring to return some interesting values.
Ren puzzle. This puzzle highlighted the need for a better Additional parameters that were used early on for color
corner-finding or edge-finding process. While the code I scoring were RGB and HSV channel averages. While in
developed to find the corners repeatedly on the other puz- some cases there was clear correlation, in many others there
zles worked quite well, the pieces of the Kylo Ren puzzle didn’t appear to be any correlation whatsoever. Color vari-
were extremely elongated and many of the “heads” were so ance was also considered, though it was also disregarded
small as to be mistaken for corners. Needless to say, the au- because, after some thought, I could not see how it would
tomatic corner-finding was not able to find the corners, and return a marked improvement. The extreme variability in
without them the rest of the algorithm just doesn’t work as the color scoring led to a rethinking about how the colors
is. were being compared and the eventual use of ∆E.
When experimentally determining the criteria to use for Additional methods for finding border matches that were
the piece matching, individual matches were observed with considered but were unable to be implemented before this
special attention paid to the values for correct matches. The report were:
Wookie puzzle was used as the baseline case and the val-
ues derived from this puzzle were applied to the others with • Segmentation along the border (such as meanshift) –
general success (except for the Storm Trooper puzzle). Here if we could determine there are a certain number of
were the typical values and the final weights applied: segments along one border that coincide with a cer-
tain number of segments along another, then maybe
• Side Distance Difference: 0-8 pixels (wt = 12.5) we could find a potential match.
• Overlap Average Difference: 0-14 pixels (wt = 7.0) • Find lines along a border that break at the border, then
• ∆E Patches: 7-45 (wt = 2.9) look for the continuation of these lines on the other
side. Would have something to do with the flow of
• ∆E Pieces: 4-35 (wt = 3.2) pixels – seems difficult to implement, though would
8
really help with the Storm Trooper puzzle.
• Grab features within the head piece or along the edge
and build a Bag of Words model. Then try to find
matches on the other side of the piece around the hole.
This has potential, but is potentially computationally
expensive.
While the matching algorithm used isn’t perfect, it worked
for the test cases considered. And while others may have
been able to solve larger puzzles [4][3] with their algo-
rithms, my code proved to be fairly robust and efficient at
solving the puzzles provided. Since there appear to be no
standardized “jigsaw puzzle metrics” against which to com-
pare by puzzle solver for puzzles with irregular shapes, I
cannot say exactly how my puzzle solver compares to oth-
ers that have been developed. However, it is one of the few
that I’ve seen that takes a raw image of all of the pieces
at once and produces a fully constructed solution. Most of
the puzzle solvers found in official papers and in student
submissions cited earlier produce only partial or theoretical
solutions, or solutions that require even greater initial con-
straints than my own (i.e. the pieces have to lie in a grid at
Figure 4. Solution Created by the Automatic Puzzle Solver
the outset or each piece has to be scanned individually with
a high resolution scanner). Still others rely on the original
image to find the location of the pieces in the final image,
which is not the problem I set out to solve. One example
using the Wookie puzzle can be seen in with the original
image in figure 1 and the final solution as found by my puz-
zle solver in figure 4.
5. Conclusion
I have created an end-to-end “automatic” jigsaw puzzle
solver that uses Matlab and the Matlab Image Processing
Toolbox to piece together a jigsaw puzzle using only an
image of the pieces. I used this puzzle solver on six test
puzzles and proved that it works on five of them quite reli-
ably, but also found where there were weaknesses in the cur-
rent implementation. Certain design decisions were made
early on that simplified the problem such that I could com-
plete the entire project by the deadline. Unfortunately, this
also meant it was hard to go back and try a completely new
method once I had begun going down a certain path.
I learned a lot over the course of this project. I learned
about how to think about a 3-dimensional, physical world
problem in terms of a 2-dimensional perception of that
problem. I learned how to think about manipulating ev-
ery ounce of information I could glean from a single photo- Figure 5. The “Truth” – A Picture of the Assembled Wookie Puz-
graph to help the computer “think” like a human and make zle
matches that would result in a correct solution. I learned
about many functions inherent within Matlab, especially the
Matlab Image Processing Toolbox. As I developed the pro- differently. This knowledge will certainly be helpful in the
gram, I learned new tools and tricks that, had I known them future and could be applied to improving the puzzle solver.
earlier, I may have approached certain parts of the project Several of the papers I read where people have attempted
9
this problem in the past did not make much sense to me un-
til I went and attempted it myself. I had believed it would
be easier to extract the corners of the four-sided pieces, and
therefore decided to go with canonical piece jigsaw puzzles.
However, this then meant I was fairly limited in the types
and numbers of puzzles to which my program could ap-
ply. It also meant that extraction of this information was ab-
solutely essential to everything my program did afterward.
Some of the other methods, like the use of fiducial points
as in [3], may have proven more difficult at first, but could
have paid dividends in its ability to scale.
If my original end goal was to create a program that be-
gan to explore the possibility of creating an automatic jig-
saw puzzle solver smart phone application, which was the
original idea, then I believe I have achieved a pretty great
stride in that direction. However, my code is not yet robust
enough to the kinds of inputs a smart phone might provide,
nor is it efficient enough in both memory allocation and
processor requirements to be feasible for that application.
Many changes would have to be made before I can get to
that end goal, which is something I realized about halfway
through the project. While I believe I have created a solid
and workable solution within the constraints of the problem
as I originally set forth, I see many areas where it could be
improved for future incarnations. All in all, I did what I set
out to do, I learned a lot, and I enjoyed the process.
References
[1] J. Davidson. A genetic algorithm-based solver for very large
jigsaw puzzles: Final report.
[2] H. Freeman and L. Garder. Apictorial jigsaw puzzles: The
computer solution of a problem in pattern recognition. IEEE
Transactions on Electronic Computers, EC-13(2):118–127,
April 1964.
[3] D. Goldberg, C. Malon, and M. Bern. A global approach to
automatic solution of jigsaw puzzles. Comput. Geom. Theory
Appl., 28(2-3):165–174, June 2004.
[4] A. K. Y. L. H. Wolfson, E. Schonberg. Solving jigsaw puzzles
by computer. Annals of Operations Research, 12:51–64, 1988.
[5] D. A. Kosiba, P. M. Devaux, S. Balasubramanian, T. L.
Gandhi, and K. Kasturi. An automatic jigsaw puzzle solver. In
Pattern Recognition, 1994. Vol. 1 - Conference A: Computer
Vision amp; Image Processing., Proceedings of the 12th IAPR
International Conference on, volume 1, pages 616–618 vol.1,
Oct 1994.
[6] N. Kumbla. An automatic jigsaw puzzle solver.
[7] L. Liang and Z. Liu. A jigsaw puzzle solving guide on mobile
devices.
[8] A. Mahdi. Solving jigsaw puzzles using computer vision.
10
Database-Backed Scene Completion
Alex Alifimoff
aja2015@cs.stanford.edu
Author’s note: I liberally use the third person pronoun, ”we”, in this paper, as I’m used to
working in group projects. Rest assured, I am the only author of this project.
Introduction
Ever have an almost-perfect photo? Maybe it’s that photo of the beach that your dweeb
uncle stepped in front of, or that wedding photo ruined by the donut truck driving in the
background. Scene completion is the task of taking a photo and replacing a particular
region of that photo with an aesthetically sensible alternative. In this work we demonstrate
our implementation of a method originally implemented by Hayes & Efros which produces
interesting scene completions utilizing a very large database of images.
Previous Work
There are many different approaches to the problem of scene completion. One possible
approach is to use multiple images (either from multiple cameras, multiple pictures, or
video) to determine exactly what kind of information was in the masked part of the image,
and then adjust that information appropriately and place it back into the original image.
[2, 3]
A second common approach is to utilize information from the image itself to attempt to
guess what kinds of missing information should be used to fill the masked part of the image.
[3] The majority of these approaches involve utilizing nearby textures and other patterns
from the input image to fill the scene.
This project follows the methodology outlined by Hayes and Efros [1], who differ from previ-
ous approaches in that they try to complete the scene by finding plausible matching textures
from other photographs. In particular, the implemented system searches thousands to (ide-
ally) millions of photographs to find globally matching scenes, and then utilizes texture
patterns from those scenes to fill the missing hole in the input image.
Key Improvements
The significant improvement of the Hayes & Efros system over previous image completion
software is the ability to produce ”novel” scene completions through the use of the large
image database that is searched to find global scene matches.
Additionally, the Hayes & Efros system does not place wholly stringent restrictions on which
pixels must be used from the source image and which pixels must be completed. All of the
masked pixels must be replaced, but through the use of a novel application of min-cost
graph cutting, the system may decide to replace more pixels in the original image if it
makes for a better fit. This allows interesting completions that simply aren’t possible with
more stringent restrictions. We discuss this in depth in the following sections.
1
Figure 1: An example input image with corresponding mask (black region)
Technical Approach
The input to the algorithm is an image and a corresponding mask, which indicates which
part of the image needs to be filled.
Generally, we then follow three steps:
1. Semantic Scene Matching. Quickly identify possible images that we could use to fill
the hole in our scene.
2. Local Context Matching. Identify the local context and search all of the remaining
scene matches to find the best local matches.
3. Blending. Perform graph cut and poisson blending to merge the two images.
The first part of the method involves finding images which represent similar scenes to the
image being filled. However, since there are potentially millions of images to search, any
comparison must be done extremely quickly. To implement this part of the algorithm,
we rely upon a scene descriptor called GIST. GIST descriptors build a low-dimensional
representation of the scene which is designed to capture a couple of perceptual dimensions.
The authors of the original GIST paper describe these dimensions as ”naturalness, openness,
roughness, expansion, ruggedness” [4, 5].
GIST descriptors are pre-generated for every image in the database. Once a masked image
is provided, the GIST descriptor for the masked image is calculated. I use GIST descriptors
with 5 oriented edge-responses at 4 scales aggregated to a 4x4 spatial resolution. These
descriptors are slightly smaller than the original ones in Hayes and Efros, but the slight
reduction did not impact performance while slightly improving the time it took to build
the GIST descriptors for the database. We augment each GIST descriptor with a color
histogram with 512 dimensions to capture color information.
The search is then performed by using the weighted combined GIST and color descriptor
and an l2-distance metric. Each generated descriptor is compared to the masked input
image, and the best 100 images are kept for local context matching.
2
Figure 2: Best GIST matches for leftmost image
Computational Limitations
One of the main difficulties in pursuing this project was the computational resources nec-
essary to implement it in the same manner as the original authors. The original paper
utilized a network of 15 computers to examine millions of images simultaneously. Since
this computing power wasn’t available to me, I downloaded a pre-filtered subset of closely
matching images from Hayes and Efros’ project site to augment the thousands of images
I downloaded independently. This allowed me to get high quality matches to ensure the
rest of the scene completion pipeline worked appropriately. The graphics in this report
were generated from my own database of 200,000 images and the additional images from
the original project site. My database primarily consisted of photos downloaded from the
image sharing website, Flickr, that were tagged ”outdoors”. I restricted the category for the
purpose of getting quality completions for images within the same category.
It took about 12 hours to generate all of the GIST descriptors for the small dataset with
liberal use of multiprocessing. However, this computational only needs to be performed
once. Performing a single nearest-neighbor search of the dataset takes approximately five
minutes.
The next step is to find appropriate patches in semantically similar images to use as the
scene completion content. The first step of this process is to determine exactly what the
local context of a particular image is. To do this, I first dilate the mask. This effectively
produces a second, slightly larger mask. Then the mask is cropped so that it is only the
width and height of the dilated mask. I then subtract away the original mask and are
left with an image patch which corresponds to the local context of a particular image. The
remainder corresponds to the local context that we will be examining in each image. We will
use this patch to find the ”optimal” patch to use in our scene completion for each matching
image.
We illustrate this process. On the left we show an example dilated mask, where the red
corresponds to the area of local context we will be consider. Then, as we move it across the
image, we consider patches like the one on the right.
3
We take each patch and compute a HOG descriptor and a color histogram. We utilize these
as texture and color features and compare them to the local context of our source image.
We use sum-of-squared distances of this feature set to select which patch to use from a
particular image.
Blending
There are two main parts to the blending step. Given a mask and a dilated mask, we have
to use all of the pixels from the patch image for the area of the image covered by the mask.
However, for the dilated mask, we have a decision to make. One particular innovation of
Hayes and Efros’ method is actually choosing to remove more of the original image than the
mask requires.
To determine which part of the original image to keep and which part to patch, we use a
min-cost graph-cut algorithm. We assign each pixel a label, ”original” or ”patch”. We call
the set of all labels L, and we minimize:
∑ ∑
C(L) = Cunary (p, L(p)) + Cpair (p, q, L(p), L(q)) (1)
p p,q
We define the unary cost functions as follows. For any pixel in the space removed
by the original mask, Cunary (p, original) >> 0 (any very large number) and we set
Cunary (p, patch) = 0. For any pixel that is not covered by the mask or the dilated mask,
Cunary (p, patch) >> 0 and Cunary (p, original) = 0. The intuition here is that for the for-
mer category, we must choose pixels from the patch. In the latter category, we must choose
pixels from the original picture. For all of the rest of the pixels, we define
where f is a function returning the location of a pixel and f ′ is a function returning the
nearest pixel in the original mask. Intuitively, we want to punish choosing pixels that are
not in the original the further away we get from the hole. Like Hayes and Efros, we use
k = 0.002.
The remaining part of the graph cut algorithm is determining how to define Cpair . In our
implementation, each pixel is connected in a four-way neighbor set-up, so for pixels that are
not adjacent, we have zero cost. For other pixels, we use:
Cpair (p, q, L(p), L(q)) = ||h(ppatch ) − h(poriginal )|| + ||h(qpatch ) − h(qoriginal )|| (3)
where h is a function returning the vectorized (RGB) representation of a pixel. The intution
here is that we want to minimize the gradient of the image difference as opposed to the
intensity difference along the seam.
We include a figure demonstrating the change from before the graph-cut is applied and
afterwards. Generally this causes a small expansion in the size of the mask.
Finally, poisson blending is applied to the image and its patch to seamlessly blend the two
images. This ensures that slight differences in color do not ruin the completion attempt.
The poisson solver is allowed to run on the entire domain of the image and not just the
local region it is attempting to patch.
Results
Here are a number of possible good completions from various input masks and input images.
Generally, when I used the pre-seeded database of gist matches compiled from the Hayes &
Efros site, I got reasonable performance. Additionally, when I used images that were ”out-
doorsy” (this was the image category I primarily downloaded from Flickr), I got reasonable
4
Figure 4: The patch before and after applying graph-cut
completions. However, when trying to complete images that didn’t have particularly good
GIST descriptor matches in the dataset, the matches could be comically bad.
One of the main take-aways from this project is that this method highly relies upon having
a large dataset available to search for completions. Hayes and Efros required a significant
amount of computation power to search their dataset of 2 million images, and even they
largely restricted the semantic categories in which they downloaded images. Utilizing this
method as a production system for image completion would only be reasonable for companies
that have significant computation power and access to many images, like a search engine
provider or photo-sharing website.
Runtime is another issue of concern with this particular algorithm. As discussed previously,
Hayes and Efros required fifteen CPUs to process a single image in five minutes. On a single
CPU, their algorithm took 74 minutes to run. The average runtime for my implementation
across a sample set of 100 photographs was as follows. For this particular experiment, we
chose to use the 200 best matching scenes for local context matching. 1 My implementation
was comparable to Hayes and Efros, despite being implemented in Python.
Quantitatively evaluating the performance of the algorithm in regards to how effectively
it completes images is difficult, as there is no good metric for evaluating the ”realness” of
photographs without doing human evaluation. This evaluation is done by Hayes and Efros,
but sadly I did not have the time or the access to resources to adequately conduct human
trials.
Areas of Improvement
There are numerous situations in which this system fails. We generally classify these errors
into a couple of different categories.
1. The first category are failures of scene matching. These are situations in which the
GIST descriptor identifies scenes that just don’t belong together (i.e. filling a mask
in a tropical scene using an image from the snow)
2. The second category are blending issues. These errors occur typically when a sub-
optimal image patch is chosen that contains superfluous artifacts, or when the
graph-cut algorithm chooses to include something it should not.
1
This was the same number as Hayes and Efros for the purposes of comparison, although for
smaller databases we suspect this number should be reduced as many of the matches beyond the
20th were quite bad
5
Figure 5: Mask and possible completions for grassy/forest scene
6
Figure 7: An example of a high-level semantic issue. Notice the partial rabbit in the grass.
3. The final category of errors includes issues with high level semantics. These are
situations in which partial objects are included in the patch, such a part of a rabbit
filling in a grassy scene because there is otherwise a good local context match. Since
the algorithm has no notion of objects, this happens quite often.
Final Thoughts
In general, I was quite happy with the output of the system. There was generally at least
one reasonable completion for the vast majority of images that I would input, provided I
was using input images that fell into the same semantic category as input images in my
database. Largely, this algorithm is effective with lots of data, but not generally effective
for solving the scene completion problem on a resource-limited budget.
Acknowledgments
I would like to thank Silvio Savarese for an awesome class and to the entire teaching staff
for making a really strong effort to improve the class, even throughout the quarter.
References
[1] Hayes, J. and Efros, A. Scene Completion Using Millions of Photographs. SIGGRAPH, 2007.
http://graphics.cs.cmu.edu/projects/scene-completion/scene-completion.pdf
7
Figure 9: An example of a scene matching issue. The forest does not belong in the city!
[2] Irani, M., Anandan, P., and Hsu, S. 1995. Mosaic based representations of video sequences and
their applications.
[3] Agarwala, A., Dontcheva, M., Agrawala, M., Drucker, S., Colburn, A., Curless, B., Salesin, D.,
and Cohen, M. 2004. Interactive digital photomontage. ACM Trans. Graph. 23, 3, 294–302.
[4] Oliva, A., and Torralba, A. 2006. Building the gist of a scene: The role of global image features
in recognition. In Visual Perception, Progress in Brain Research, vol. 155.
[5] Oliva, A., and Torralba, A. 2001. Modeling the shape of the scene: a holistic repre-
sentation of the spatial envelope. In International Journal of Computer Vision, vol 42 (3).
https://people.csail.mit.edu/torralba/code/spatialenvelope/
8
Deep Drone: Object Detection and Tracking for
Smart Drones on Embedded System
1
that shares full-image convolutional features with the detec- a relatively shallow and small network to extract image fea-
tion network, thus enabling nearly cost-free region propos- tures and to detection. The architecture is shown in Fig 1.
als. An RPN is a fully-convolutional network that simulta- Even using our small network architecture, the detection
neously predicts object bounding boxes and scores at each frame rate is still low on TK1 mobile GPU. To compensate
position. RPNs are trained end-to-end to generate high- for the slow speed of detection, we used the cheap KCF
quality region proposals. Because of the region proposal tracker, although it’s less accurate than MDNet, to track on
network are fused and could be trained end to end, the net- the bonding box returned by the detection pipeline. Thus
work is faster than fast R-CNN. we have the accurate but slow Faster R-CNN for detection,
YoLo Detector [13] is a new approach to do object de- and have the less accurate but super fast KCF for tracking.
tection. Prior work on object detection re-purposes classi- Detection is only called when the confidence of the tracker
fiers to perform detection. Instead, YoLo use object detec- is below certain threshold, which is very infrequent. This
tion as a regression problem to spatially separated bounding architecture makes the pipeline accurate, robust and fast.
boxes and associated class probabilities. A single neural Accuracy is not our sole target in this project. We have
network predicts bounding boxes and class probabilities di- a thorough evaluation with respect to accuracy, power con-
rectly from full images in one evaluation. Since the whole sumption, speed, and area of different hardware running de-
detection pipeline is a single network, it can be optimized tection and tracking algorithm. Balancing these hardware
end-to-end directly on detection performance. The unified constraints, rather than only optimizing for MAP, is our top
architecture is extremely fast but not as accurate as Faster priority.
R-CNN.
KCF[7, 8] is the kernelized correlation filters used for 4. System Architecture
detection. It uses the characteristic that under some condi-
The software architecture of our vision system consists
tions, the resulting data and kernel matrices become circu-
of two components. The first component is a detection al-
lant. Their diagonalization by the DFT provides a general
gorithm running Convolutional Neural Network (CNN) and
blueprint for creating fast algorithms that deal with trans-
the second part is a tracking algorithm using HOG feature
lations, reducing both storage and computation by several
and KCF. These two algorithms are seamlessly integrated
orders of magnitude, obtaining state-of-the-art trackers that
to ensure smooth and real-time performance. The detec-
run at 70 frames per second on NVIDIA TK1 and is very
tion algorithm, e.g. Faster RCNN, is expensive to compute
simple to implement.
since CNN-based detection has lots of GOPs per frame and
MDNet [12] is the state-of-the-art visual tracker based
is only called to initialize a bounding box for key object in
on a CNN trained on a large set of tracking sequences, and
the scene. The tracking algorithm, e.g. KCF, is relatively
the winner tracker of The VOT2015 Challenge. The net-
inexpensive to compute and can run at a high frame rate
work is composed of shared layers and multiple branches of
to track the bounding box provided by detection algorithm.
domain-specific layers, where domains correspond to indi-
The main algorithm loop is shown in the pseudo code be-
vidual training sequences and each branch is responsible for
low, and we discuss the detection and tracking details in the
binary classification to identify the target in each domain.
next two subsections.
The network is trained with respect to each domain itera-
tively to obtain generic target representations in the shared Algorithm 1 Detection and Tracking Pipeline for Deep
layers. Online tracking is performed by evaluating the can- Drone
didate windows randomly sampled around the previous tar- boxF ound ← false
get state. while true do
However, the drawback of MDNet is that it needs to run f ← new frame
CNN to extract image features, making it very slow. Con- while boxF ound == false do
sidering the frame rate required by real time tracking, we detection(f) . Invoke detection algorithm
used the KCF algorithm for tracking, which achieves 70 if Box is detected then
frames per second on our hardware: a NVIDIA Tegra K1. boxF ound ← true
end if
3. Contribution end while
Fast and Faster R-CNN originally used VGGNet for fea- tracking(f) . Invoke tracking algorithm
ture extraction. It is accurate but slow. Drones have lim- if Tracking is lost then
ited hardware resource both in memory and in computation boxF ound ← false
power, so we need to have smaller network. In order to run end if
the CNN fast enough on embedded device, we didn’t use end while
those off-the-shelf network architectures. Instead, we used
2
4.1. Detection 4.2.1 KCF
Drones are mainly used to take pictures of human, so KCF [7][8] is a more old school tracking algorithm than
we focus on detecting people as first step. We made fur- MDNet, but it’s supposed to be faster and more succinct.
ther assumption in this project that there is only one person The algorithm uses Discrete Fourier Transform to diagonal-
of interest to track, so that the person with the highest de- ize data matrix, which is then processed to train a discrim-
tection score is our target. We have used two detectors to inative classifier through linear regression and kernel trick;
do people detection: Faster RCNN and Yolo detector. We This new approach is called Kernelized Correlation Filter
analyze them both in below sections. (KCF).
We found out that KCF runs very fast on video, it takes
4.1.1 Using Faster R-CNN on average around 8.8 milliseconds to check per frame on a
Macbook pro CPU.
We used a 7-layer convolutional neural network on Faster The downside of KCF is the requirement that the video
R-CNN[15] for people detection. The framework takes raw has to be continuous. If the video fades to black, the ob-
image frames from a video stream and outputs bonding ject moves very fast or a jump cut occurs, KCF will have
boxes and target classes for detected objects. a hard time to recover. When this occur, the peak value of
We used a in-house model trained on the KITTI[2] detection score from running Gaussian kernel on correla-
dataset. KITTI contains a rich amount of training samples tion filter will suffer a significant drop; it will also return
that include objects such as cars, pedestrians and cyclists negative bounding box value if it can’t find any match. We
and can easily generalize to our task. In this project, our in- leveraged this feature to combine KCF with faster RCNN
terested detection target is people so we modified the script or other detection algorithm to solve the problem. Namely,
to detect people only. A detailed architecture is shown in when KCF is not confident or fail entirely, it will call a de-
Figure 1. tection algorithm in hope of recover.
We measured the accuracy(mAP) and speed of our in-
house model and compare it with the baseline, shown in Table 2. Speed of detection and tracking on different hardware
platforms
table1. Our in house model has slightly worse accuracy than
Hardware Platform GTX 980 TX1 TK1
the baseline, but has 12x faster speed.
Power 150W 10W 7W
Table 1. Accuracy and runtime for our detection network (runtime Detection 0.17s 0.6s 1.6s
is measured on GTX980 GPU) Tracking 5.5ms 14ms 14ms
Model mAP Runtime Tracking
Baseline[15] 65.9 % 2s 182fps 71fps 71fps
Frames/Sec
Ours 62.0 % 0.17s
3
Figure 1. The CNN architecture that we used for detection
In order to deal with the large form factor of the TX1 number of customized region pooling layers. We spent
development board, we bought a small carrier board for great effort in installing CUDA and Faster RCNN onto all of
TX1. It is shown in Fig 4.3. This carrier board has the our desktop and embedded platforms. During installation,
size as small as the heat sink, we can unplug the TX1 cen- we ran into the following issues.
tral board from the full development board and only plug
the central board that contains the actual TX1 chip to this 1. We flashed our TK1 with the latest L4T (Linux for
carrier board. This makes the size even smaller than TK1. Tegra), however, TK1 doesn’t support CUDA version
However, there’s no free lunch. The interface, especially higher than CUDA 6.5, so we only installed CUDA 6.5
the power supply isn’t compatible with the DJI drone, we dev kit on TK1. However, the latest Caffe is not back-
haven’t got a chance to connect the carrier board with the ward compatible with CuDNN v3 and before, if we
drone yet, which could be future work. The TK1 is fully revert the Caffe to an ealier branch, it won’t support
working, if not optimized for speed, the TX1 carrier board the new layers required by faster r-cnn. So we turned
is not the critical path of our project. off the CuDNN switch for Caffe installation to bypass
this problem.
5. Implementation and Experiments 2. Some of Faster RCNN libraries were written in python
5.1. Detection and compiled into native c++ code using Cython.
When installing these libraries onto embedded sys-
Faster RCNN builds on top of Caffe, a deep learning tems ,i.e. TK1 and TX1, we ran into compilation er-
framework that requires multiple dependencies, and has a ror on gpu nms.cpp, a GPU implementation for non
4
Figure 2. Faster RCNN performs really well on detecting people from drones perspective. Even when the object is far away and twisted
(last figure).
Figure 3. Yolo detector performs not as well as Faster RCNN. When the target person is small, the detector fails.
maximum suppression. The root cause was identified as compiler incompatibility in the embedded systems.
5
Figure 4. KCF tracker performs very well on videos that doesn’t have jump cuts. It fails if the video fades to black. However, this is not a
problem since detection will be called on this scenario.
6
5.3. Handshake between detection and tracking bounding box result will affect how well tracking algorithm
performs. If the bounding box is not tight enough, it might
Since the detection algorithm is implemented in Python,
incorporate irrelevant subject (in the nunchaku case incor-
we uses the Python.h to realize a C++/Python binding be-
porating big chunk of shadow, Figure 2 sub-image 4), track-
tween the two algorithms. The initialization of detection
ing will then mistakenly think that it’s the irrelevant subject
algorithm (loading neural network under Caffe) is stored in
it wants to track. Second, detection doesn’t work well when
the C++ main program as a PyObject. Upon receiving a
the person is disguised as a bulkier figure. As in the snow-
new frame, C++ main program converts the frame (stored in
boarding video, when the viewpoint is on the side of the
byte array) to a Python ndarray object and passes the result-
person, with the help of helmet, face mask and bulky cloth,
ing array to a call to the detection method. It then parse the
it’s hard to detect the person. We address the two problem
result and determine if the detection result matches a thresh-
by adjusting the confidence threshold for detection score,
old (how confident the object is a person). An interesting
and retrain the neural network with images from different
bug that arise is when importing a Python Module under
angles and viewpoints.
sudo(which is needed for activating and using DJI drone’s
camera); this is because some of the PythonThe NVIDIA 5.6. Online detection and tracking on DJI live cam
TX1 and TK1 development board that we used packages are
installed with root rw permission only, we worked around We then adjust the detection and tracking module to
this problem by command sudo su, and hacked the privilege function with DJI M100’s live cam. As mentions in sec-
of using the DJI live camera. tion earlier, the camera read in data as raw byte array in
NV12 format, the format has three components for each
5.4. Interacting with DJI camera library pixel: a luma component (the brightness) Yánd two chromi-
We use two sets of DJI libraries. It only contains nance (color) components U and V; since the detection and
interface code to talk to the camera, we didn’t use tracking algorithm are all based on RGB pixel values. We
any DJI code for vision algorithms. First, the cam- do the following conversion to obtain a RGB representation
era input module provided by djicam.h, which lever- of the frame (clamping means restrict the value within the
age libdcam to read in from the built-in camera on the 0-255 range for RGB):
drone. The library only provides three simple func-
tions (manifold cam init, manifold cam read R = clamp(Y + 1.4075(V − 128))
and manifold cam exit) which means we need to ma-
G = clamp(Y − 0.3455(U − 128) − 0.7169(V − 128))
nipulate all data from raw pixel arrays; we initialize the
camera with TRANSFER MODE (transferring image input B = (int)(Y + 1.779(U − 128));
to controller and mobile app) and GET BUFFER MODE
(store the video input to local buffer byte array in NV12 After obtaining the frame encoded in RGB format, we
format, more on format conversion in later subsection). We use similar approach as offline video and detect and track
use the CAM NON BLOCK mode (which means not wait- video frame by frame. A challenge surfaces in this step is
ing the camera to fully initialized) to ensure that we can give that since we are doing detection on a live stream, the sub-
constant control to the drone even if the camera is not set ject might be moving rapidly while we are detecting, this
up. We sleep the program and wait for the camera to exit at means that there is a possibility that the object has already
the end when we return. The specifics of the library can moved out of the bounding box while detection finishes an-
be found here: https://github.com/dji-sdk/ alyzing a frame from ∆t seconds ago. We addressed this
Manifold-Cam/blob/master/djicam.h issue by initialize the tracker with the frame from ∆t sec-
onds ago instead of the current frame. The tracker will then
5.5. Offline Detection and Tracking train it’s positive patches and negative patches with correct
We first tested our detection and tracking module on our bounding area. This solved the problem unless the object
own offline video recordings from different perspectives (a has deformed too much in the ∆t time frame.
DJI Inspire recording of Song Han playing nunchaku and a
5.7. Controlling the Drone
GoPro recording of Song Han snowboarding) with the TK1
board mounted on the drone. For detection using faster r- We are controlling the camera using the OnBoard SDK
cnn, each frame takes 1.6 second on average (comparing to provided by DJI. We first need to send an activation data
0.6s on TX1 which is small enough to mount on the drone to the CoreAPI driver. Since DJI recently upgraded their
but not yet supported by DJI). For tracking under KCF, each drone operating system and their Onboard SDK not up-
frame only takes 14ms (71 fps). Running our detection and dated accordingly, this step causes tremendous trouble. We
tracking module on the two videos exposes several inter- had to contact DJI’s engineer who wrote their Onboard
esting problems to us. First, the tightness of detection’s SDK to get the new version of encryption key (a magic
7
number) to successfully activate. Then we can gain con- [10] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Ce-
trol of the camera by sending GimbalAngleData to it. hovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and
A GimbalAngleData contain three import field corre- R. Pflugfelder. The visual object tracking vot2015 challenge
sponding to the three spatial degree of freedom for the cam- results. In Proceedings of the IEEE International Conference
era: yaw, roll and pitch. on Computer Vision Workshops, pages 1–23, 2015.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
6. Conclusion Advances in neural information processing systems, pages
We present Deep Drone, a detection and tracking sys- 1097–1105, 2012.
tem running real time on embedded hardware, that powers [12] H. Nam and B. Han. Learning multi-domain convolu-
tional neural networks for visual tracking. arXiv preprint
the drones with vision. We presented our software architec-
arXiv:1510.07945, 2015.
ture that combines the accurate but slow detection algorithm
[13] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
and the less accurate but fast tracking algorithm, to make only look once: Unified, real-time object detection. arXiv
the system both fast and accurate. We also compared the preprint arXiv:1506.02640, 2015.
runtime, power consumption and size of different hardware [14] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi.
platforms, and discussed implementation issues and corre- You only look once: Unified, real-time object detection.
sponding solutions dealing with those embedded hardware. CoRR, abs/1506.02640, 2015.
[15] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
Acknowledgment towards real-time object detection with region proposal net-
works. CoRR, abs/1506.01497, 2015.
We thank Amber Garage for equipment support. [16] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
References arXiv:1409.1556, 2014.
[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[1] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Learning spatially regularized correlation filters for visual Going deeper with convolutions. In Proceedings of the IEEE
tracking. In Proceedings of the IEEE International Confer- Conference on Computer Vision and Pattern Recognition,
ence on Computer Vision, pages 4310–4318, 2015. pages 1–9, 2015.
[2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets [18] G. Zhu, F. Porikli, and H. Li. Tracking randomly
robotics: The kitti dataset. International Journal of Robotics moving objects on edge box proposals. arXiv preprint
Research (IJRR), 2013. arXiv:1507.08085, 2015.
[3] R. Girshick. Fast r-cnn. In International Conference on Com-
puter Vision (ICCV), 2015.
[4] S. Han, H. Mao, and W. J. Dally. Deep compres-
sion: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
[5] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights
and connections for efficient neural network. In Advances in
Neural Information Processing Systems, pages 1135–1143,
2015.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-
ing for image recognition. arXiv preprint arXiv:1512.03385,
2015.
[7] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Ex-
ploiting the circulant structure of tracking-by-detection with
kernels. In Computer Vision–ECCV 2012, pages 702–715.
Springer, 2012.
[8] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-
speed tracking with kernelized correlation filters. Pattern
Analysis and Machine Intelligence, IEEE Transactions on,
37(3):583–596, 2015.
[9] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J.
Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and¡ 1mb model size. arXiv
preprint arXiv:1602.07360, 2016.
8
Eigenfaces and Fisherfaces – A comparison of face detection techniques
Pradyumna Desale Angelica Perez
SCPD, NVIDIA Stanford University
pdesale@nvidia.com pereza77@stanford.edu
3
achieve 95% accuracy with 4 training images as Graph 6: Segmented PCA against number of
shown in the graphs below. principal components for various training sets Yale
Graph 7: LDA against number of principal
Graphs 2,3,4: Graph 2:2D PCA against number of components for various training sets Yale
principal components for various training sets from
ORL
Graph 3: Segmented PCA against number of
principal components for various training sets ORL
Graph 4: LDA against number of principal
components for various training sets ORL
4
Looking closely at the inaccurate face detection in Gaussian noise of different variance in the
Yale database provides us insight into this problem. experiment.
When first 4 images of each subject are picked for
training, only subject 15 and 3 has dark glasses while
other subjects don’t have glasses. When test images
are run against such training set, all the probe images
of subjects with glasses are categorized into either
We also use the Wiener filter which is a MSE-
class 3 or class 15. Since all subject images with
optimal stationary linear filter for suppressing the
glasses were not used for training, our training model
degradation caused by additive noise and
is heavily biased towards two classes. That being
blurring. Fourier transforms are unable to recover
said, increasing the number of training images
components for which Fourier transform of point
improves the accuracy and we get around the bias.
spread function is 0. This means they are unable to
undo blurring caused by band limiting of Fourier
4.2. Variations in facial expression of subjects,
transform. We can see that some of the accuracy of
structural changes to the faces, and variations
face recognition is recovered when Weiner filters are
in pose
used for correction of additive noise in Graphs 8 and
Yale database also includes facial expressions of 9.
subjects and graphs 5,6,7 show that LDA method is Weiner filter noise suppressed images look like
more immune to variations in facial expressions but below
PCA method’s accuracy is not substantially lower
than LDA. LDA method is definitely superior when
Overall we are very surprised by how resilient
both sets of algorithms are to significant changes in
facial poses.
5
4.3.2 Speckle Noise
This granular noise occurs in ultrasound, radar and
X-ray images and images obtained from the
magnetic resonance. The multiplicative signal
dependent noise is generated by constructive and
destructive interference of detected signals. The
wave interference is a reason of multiplicative noise
occurrence in the scanned image. The speckle noise
is image dependent. Therefore it is considered hard
to find a mathematical model that describes the
removal of this noise, especially if we expect the
randomness of the input data Structural changes to
the face image such as beard, glasses etc. We had
identified Lee’s filter as the method to counter the
Speckle noise but speckle noise is not greatly
affecting the recognition performance so we
prioritized the study of Gaussian and S&P noise over
correction to Speckle noise.
https://pereza77@bitbucket.org/pereza77/cs231a_final_
project_face_recognition.git
5. Summary
We examined different subspace methods of face
recognition in this project. Two-stage recognition
systems include PCA, LDA for feature extraction
followed by SVM for classification. All methods are
significantly influenced by different settings of
parameters that are related to the algorithm used (i.e.
PCA, LDA or SVM).
For methods working in ideal condition both PCA
and LDA achieve greater than 90% accuracy within
three training images.
This project dealt with ‘closed’ image set, so we
did not have to deal with issues like detecting people
who are not in the training set. On the other hand our
two test databases contain images of the same
subjects that often differ in face expressions,
hairstyles, with or without beard, or wearing glasses
7
References
[1] M. A. Turk and A. P. Pentland, "Face recognition
using eigenfaces," in Computer Vision and Pattern
[2] Recognition, 1991. Proceedings CVPR '91., IEEE
Computer Society Conference on, 1991, pp. 586-591.
[3] Y. Jian, et al., "Two-dimensional PCA: a new
approach to appearance-based face representation and
recognition," Pattern Analysis and Machine
Intelligence, IEEE Transactions on, vol. 26, pp. 131-
137, 2004.
[4] A new LDA-based face recognition system which can
solve the small sample size problem. Pattern
Recognition 33, 1713–1726.
[5] Belhumeur, P. N., Hespanha, J., and Kriegman, D.
Eigenfaces vs. fisherfaces: Recognition using class
specific linear projection. IEEE Transactions on
Pattern Analysis and Machine Intelligence 19, 7
(1997), 711–720.
[6] B. Moghaddam and Y. Ming-Hsuan, "Gender
classification with support vector machines," in
Automatic Face and Gesture Recognition, 2000.
Proceedings. Fourth IEEE International Conference
on, 2000, pp. 306-311
[7] P. Phillips, et al., "The FERET evaluation
methodology for face-recognition algorithms," Pattern
Analysis and Machine Intelligence, IEEE
Transactions on, vol. 22, pp. 1090-1104, 2002.
[8] P. Viola and M. J. Jones, "Robust real-time face
detection," International Journal of Computer Vision,
vol. 57, pp. 137-154, 2004.
[9] T. H. Le and L. Bui, "Face Recognition Based on
SVM and 2DPCA," International Journal of Signal
Processing, Image Processing and Pattern
Recognition Vol. 4, No. 3, September, 2011
[10] MATLAB.com
[11] Brunelli, R., and Poggio, T. Face recognition through
geometrical features. In European Conference on
Computer Vision (ECCV) (1992), pp. 792–800
[12] Kanade, T. Picture processing system by computer
complex and recognition of human faces. PhD thesis,
Kyoto University, November 1973.
[13] H. Yu and J. Yang, "A Direct LDA Algorithm for
High-dimensional Data with Application to Face
Recognition," Pattern Recognition, Vol.34, pp.2067-
2070, 2001.
[14] F. Song, D. Zhang, J. Wang, H. Liu, and Q. Tao, "A
parameterized direct LDA and its application to face
recognition," Neurocomputing, Vol.71, pp.191-196,
2007.
[15] ORL face database -
http://www.uk.research.att.com/facedatabase.html
[16] Yale face database -
http://cvc.yale.edu/projects/yalefaces/yalefaces.html
8
Emotion AI, Real-Time Emotion Detection using CNN
Page 2 of 9
AWS, where we re-trained the first and last class, as they say it is merely a combination of fear
layer. We also had to experiment with vari- and disgust. Taking this into account, we dropped
ous learning rate methods and parameters in all instances of contempt from our data set, and
order to generate a non-divergent model. re-split it. Thankfully contempt was the smallest
class in terms of image count, so we didn’t lose a
• Develop real-time Interface: OpenCV al- substantial part of our data set.
lowed us to get images from our laptop’s we-
bcam. We then extracted the face as before, JAFFE Data Set After eliminating contempt,
pre-processed the image for the CNN, and we again tested our model qualitatively and quan-
sent it to AWS. On the server, a script would titatively. We found that we were doing very well
run the image through the CNN, get a predic- quantitatively, our precision, recall, and accuracy
tion, and the results would be pulled back to were all well over 90% on both test and train, but
local. our qualitative results were still rather poor. Since
the network was doing very well on data it was
3.2 Implementation given, but was not generalizing well, we decided
3.2.1 Dataset Development to find additional data sources. One of the research
papers we investigated combined the CK+ with the
CK+ Dataset The first step in developing our
Japanese Female Facial Expression (JAFFE) data
emotion-detection system was to acquire data with
set, and was able to achieve improved results. Un-
which to train our classifier. We sought to find the
fortunately, the data set only contained around 250
largest data set we could, and we selected the Ex-
images, but it was still able to boost the model’s
tended Cohn-Kanade (CK+) data set. This data
performance by a few percent.
set is composed of over 100 individuals portray-
ing 7 different labeled emotions: anger (1), con- Custom Images Since the real-time interface
tempt (2), disgust (3), fear (4), happy (5), sad (6), was being tested solely by us, we also decided to
and surprise (7). In addition, we also introduced add ourselves to the data set to see if it would im-
an addition class-0 to represent a neutral expres- prove qualitative results. We found a friend to help
sion. One feature we really liked about this data us, and the 3 of us proceeded to take an additional
set is that for each person displaying an emotion, 20 images for each class, further increases our data
the directory contains 10 to 30 images demonstrat- set size. After this final inclusion, our final data set
ing that individual’s progression from a neutral ex- sizes were:
pression to the target emotion. This is good be- Set Size
cause it allows us to have multiple degrees of in- Train 2104
tensity for each emotion represented in our data Val 300
set, as opposed to only the most extreme exam- Test 601
ples. We originally elected to take the first two
images of each sequence and label them as neu- Including the images from the JAFFE data set
tral, and the last three and label them as the tar- and the ones we custom-made, we were able to
get emotion. We found that this greatly limited again boost our quantitative results by a small mar-
our training set size, however, as we were left with gin, and our qualitative results also noticeably im-
fewer than 1000 training images. To combat this, proved.
we looked more closely at the images and decided
3.2.2 Data Pre-processing
to take the last third of each sequence as the target
emotion, as opposed to just the last three. Given the non-homogeneity of the data set, we had
to pre-process the data into a common format. We
Excluding Contempt Upon testing this 8-class first converted all images to grayscale. We then
classifier, we found that it tended to over-predict utilized OpenCV to detect faces within each im-
”contempt”. This manifested in quantitatively age, which returned to us a set of bounding boxes
lower recall and precision scores for the con- to examine. In cases where no bounding box was
tempt class, as well as qualitative worse predic- found, we set that image aside and ran it again us-
tions when we fed it live images. We conducted ing different detection parameters until we were
further research on this and found that many pa- able to successfully detect the face. In cases where
pers on emotion detection ignore the contempt multiple bounding boxes were returned, we ana-
Page 3 of 9
lyzed the sizes and locations of the boxes, and se- transferred from these data-rich environments to
lected the one that was largest and/or most central our data-poor environment.
in the image. Using this approach, we were able to Since LeNet and AlexNet were trained with dif-
extract the facial component of every image in our ferent intentions than our own, we needed to tweak
data set. Once extracted, we re-scaled the images the networks slightly. First, since our input images
to a common 250-by-250 size. were neither color nor the 227-by-227 dimension
In order to help the CNN perform better, we utilized by these networks, we had to change the
also applied statistical pre-processing to the im- input data-layer and retrain the first convolutional
ages. The first step we took was applying a small layer to account for this. Second, since we are
5-by-5 Gaussian filter over the images, which is only predicting 7 classes rather than the 1000 orig-
meant to help smooth-out noise while still preserv- inally used, we needed to retrain the final softmax
ing image’s edges and distinctive features. The layer. As a result of us having far fewer training
second step we took was subtracting the training- images than these networks originally had, we also
set’s mean image from every image. This is ben- had experiment with different learning rate hyper-
eficial because the distribution of pixel values be- parameters in order to induce convergence, as the
comes centered at 0, and is common practice for original hyper-parameters often diverged. We ul-
training data fed to any machine-learning model. timately settled on a ”fixed” learning rate policy,
Since we only have 2104 distinct training im- with base learning rate of 0.001 with a momen-
age’s, and Convolutional Neural Networks tend to tum of 0.9 and weight decay of 0.0005. Note that
perform better with more data, we sought to find even though a ”fixed” learning rate policy is used,
ways to enrich this data set. To do this, we aug- the SGD solver of Caffe still uses the momentum
mented each image in two ways. First, we mir- and weight decay to steadily reduce weight up-
rored the image across the Y-axis, which produced dates over time.
a similar but not identical training point. In ad-
3.2.4 Real-Time Interface
dition, we also introduced slight rotations of 10
degrees in either direction for each image, which In order to create the real-time interface, we
helped to boost our training-set size, and improve needed to gather local images and run them
robustness. through the CNN on AWS. To accomplish this,
we utilized OpenCV to extract the images from
3.2.3 CNN Construction out laptop’s webcam. The images were them pre-
In order to develop our Convolutional Neural Net- processed in the same manner as our data set: con-
work, we decided to utilize pre-trained models. vert it to grayscale, extract the facial component,
We believed that this would lead to better results and re-scale to 250-by-250. We chose to use 250-
for our project, since these pre-trained networks by-250 images because AWS would only allow us
are much deeper that we could develop, and would to send at most 65KB in a single file, so the 250-
thus have much better feature-detection power. In by-250 images fit within this constraint.
researching existing networks, we couldn’t find On AWS, we had a server script that has our
any that dealt directly with facial detection or trained model loaded into memory, and waits for
recognition, so we chose to use networks with an incoming file. When received, the image has
varying initial applications. a Gaussian filter applied, and the mean image is
The first network we used was LeNet, which subtracted, just as with the rest of the data set. The
was trained on the MNIST data set. The MNIST image is then augmented via a mirroring and ro-
data set is composed of hand-written numbers, and tation, and the set of images is fed into the neural
the objective of the model is to classify each im- network. A prediction is produced for each image,
age as a digit. The second network we looked at and we select our prediction to be the most com-
was AlexNet, which was developed and trained for mon class label among the images. If there is a tie,
the ImageNet Challenge. This challenge seeks to we select the class with the highest sum of class
classify images into one of 1000 categories, rang- scores among the maximal classes.
ing from animals to beverages. Even though nei- One limitation of AWS is that it does not allow
ther of these networks deals directly with faces, you to send data directly to a local computer, so
our hope is that the lower level features learned by our script could not simply send the results back
these networks, such as edges and curves, can be when the computation was finished, or even sig-
Page 4 of 9
nal to us that it was done. We had our first script Confusion matrix
write the results to a file and have a second socket 149 0 2 1 0 0 1 0
that listens the file. We had another local script 1 61 0 0 0 0 1 0
that called the second AWS server after sending 4 0 5 0 0 0 0 0
images were done to retrieve the results and this 3 0 0 51 0 0 0 0
combinations of four scripts and two sockets cre- 0 0 0 0 27 0 0 0
ated a close to real-time interface. 1 0 0 0 0 82 0 0
0 0 0 0 0 0 33 0
4 Experiments and Results 1 0 0 0 0 0 0 80
Page 5 of 9
Confusion matrix it includes images for the remaining 6 emotions.
128 0 0 1 0 0 1 The inclusion of this data set resulted in the fol-
1 51 0 0 0 0 0 lowing:
3 0 57 0 0 0 0
0 0 0 34 0 0 0 Precision
0 0 0 0 78 0 0 0.98 0.95 0.88 0.97 0.99 0.75 1.0
0 0 0 0 0 32 0
Recall
1 0 0 0 0 0 100
0.97 0.98 0.92 0.91 0.93 1.0 0.92
Precision
0.96 1.0 1.0 0.97 1.0 1.0 0.99
Recall Dataset Acc.
0.98 0.98 0.95 1.0 1.0 1.0 0.99 Train .979
Val 0.969
Dataset Acc. Test 0.945
Train 1.0
Val 0.975 We can see that the JAFFE data set noticeably
Test 0.986 reduced our accuracies across all three splits. In
addition, the precision of class 5 (sadness) took a
It is difficult to directly compare the results from big hit, dropping from 1.0 to 0.75. In an attempt
the 7-class and 8-class case since the data was to better understand why this occurred, we looked
re-segmented when removing contempt. Despite at some of the JAFFE images that were labeled
this, one can clearly see that the precision, recall, as sad. We found that some of these images were
and accuracy for every class increased between rather poor or subtle examples of a sad expression,
the two runs. This indicates that excluding con- and could easily be confused with neutral by just
tempt not only improve performance by not mis- looking at them. Below is an example of an image
classifying contempt images, but also preventing that is labeled as sad, but was incorrectly classified
other images from being confused with contempt, as neutral by our model:
and were thus classified correctly. Our qualitative
results also improved as a result of this change,
and we were able to get more correct predictions.
In a sample of 20 webcam images we sent to the
network, 8 were classified correctly, and we had
100% accuracy on surprise images. An example
of a correctly classified image is below:
Page 6 of 9
4.5 Custom Images network, the results were much better than with
In the final iteration of constructing our data set, our previous models. When portraying extreme
we added a total of 420 images equally split expressions of surprise or sadness, the model cor-
among the 7 classes. In addition to providing addi- rectly classified them with near perfect accuracy.
tional, unique images to the model, it also helped When making expressions that were less exagger-
to balance the class distribution, which is an is- ative, the model was not able to classify the im-
sue we had previously been unable to address. By ages very well, as one might expect. This is be-
adding these images into our data set, we were cause the key features that differentiate the classes
able to achieve results similar to the 7-class model are not readily apparent, so the model can not pre-
in nearly every category, which is significant given dict as well. One class that the model qualitatively
we retain the JAFFE data set which previously de- performs very poorly at is happiness. In our nu-
creased our performance. In addition, this was the merous attempts to elicit a prediction of happiness
first instance where AlexNet outperformed LeNet, from our model, we nearly always failed. Interest-
so below we show AlexNet’s results: ingly, when a friend of ours attempted to do the
same, she was able to consistently get predictions
Confusion matrix of happiness when we couldn’t. We are unsure
136 0 0 0 0 0 0
why this would occur, but it could be caused by
0 74 0 0 0 0 0
an underlying artifact of our data set, such as a
0 1 59 0 0 0 0
woman’s facial features being more highly associ-
1 1 1 40 0 0 1 ated with a happy expression.
0 0 0 0 117 0 0
In an attempt to quantify our relatively qualita-
0 0 0 1 1 54 0
tive results, we tried to classify 50 live-stream im-
1 0 0 1 0 0 112 ages. Of those we sent, the network correctly clas-
Precision sified 28, typically being those with the most ex-
0.99 0.97 0.98 0.95 0.99 1.0 0.99 aggerative expressions. As previously discussed,
Recall surprise and sadness performed the best, while
1.0 1.0 0.98 0.91 1.0 0.96 0.98 happiness performed the worst. In addition to
looking at the predicted class, we also analyzed
the class scores output for incorrectly classified
Dataset Acc. image. In comparing the class scores produced
Train 1.0 by our earlier models to those produced our fi-
Val 0.99 nal model, we noted a respectable increase in the
Test 0.985 score for the correct class. In every case we ex-
amine, the final model produced class scores such
that the correct class was either the second or third
AlexNet Loss
maximal score, while our previous models had no
such guarantee. This shows that even though we
couldn’t correctly classify the images, our predic-
tions were at least closer to being correct. In sum-
mary, we were able to achieve improved results on
our life-streamed images, but not nearly as well as
our quantitative results would indicate.
5 Conclusion
In this paper, a CNN-based emotion detection
model is proposed that utilizes facial-detection
software and cloud computing to accomplish its
In addition to quantitative improvements in our task. The final model resulted in accuracies com-
model, we also experienced qualitative improve- parable to the state-of-the-art papers in the field,
ments results as well. When observing the live- reaching as high as 98.5% accuracy on our cus-
stream of predictions being returned to us by our tom data set, and 97.2% on the original CK+ data
Page 7 of 9
set. Our code base can be fount at https:// Gesture Recognition and Workshops (FG 2011),
github.com/barisakis/cs231a_eai. In 2011 IEEE International Conference on, pages
addition, our model also exhibits more balanced 878–883. IEEE, 2011.
accuracy results across the emotion spectrum. Paul Ekman. An argument for basic emotions.
Lastly the proposed model still worked signifi- Cognition & emotion, 6(3-4):169–200, 1992.
cantly well with non-actor subjects, especially for
physically expressive emotions like sadness, hap- Paul Ekman, Wallace V Friesen, and Phoebe
piness and surprise. Ellsworth. Emotion in the human face: Guide-
One future area of work is to create a user in- lines for research and an integration of findings.
terface where users can iteratively train the model Elsevier, 2013.
through correcting false labels. This way the Nico H Frijda, Andrew Ortony, Joep Sonnemans,
model can also learn more from real world users and Gerald L Clore. The complexity of inten-
who express various emotions in different ways. sity: Issues concerning the structure of emotion
In addition, including a layer in the network that intensity. 1992.
accounts for class imbalance could provide addi- Md Nazrul Islam and Chu Kiong Loo. Geo-
tion improvements over our results. We attempted metric feature-based facial emotion recognition
implement latter of these, but were unable to get it using two-stage fuzzy reasoning model. In
working. Neural Information Processing, pages 344–351.
Another area of interest to explore is predicting Springer, 2014.
on a continuous scale the intensity of emotions be-
L. A. Jeni, J. M. Girard, J. F. Cohn, and F. De La
ing portrayed. We believe that we already have a
Torre. Continuous au intensity estimation us-
reliable recognition algorithm, so by incorporat-
ing localized, sparse facial feature space. In
ing knowledge from the social sciences on emo-
Automatic Face and Gesture Recognition (FG),
tion, a more powerful predictor could be built. In
2013 10th IEEE International Conference and
order develop and train such a predictor, however,
Workshops on, pages 1–7, April 2013. doi:
one would need an annotated data set with which
10.1109/FG.2013.6553808.
to work. One possible means for creating such a
data set would be to aggregate people’s opinions Bo-Kyeong Kim, Jihyeon Roh, Suh-Yeon Dong,
of an image using a service like Amazon Mechan- and Soo-Young Lee. Hierarchical committee of
ical Turk, and then average the responses together deep convolutional neural networks for robust
to produce an intensity measure for each emotion. facial expression recognition. Journal on Mul-
Even though many state of art algorithms are timodal User Interfaces, pages 1–17, 2016.
very good at detecting facial expressions, these Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Ja-
images are often exaggerative and un-realistic. As son Saragih, Zara Ambadar, and Iain Matthews.
such, we believe that there is need for better data The extended cohn-kanade dataset (ck+): A
sets aimed at understand ’real’ emotions. complete dataset for action unit and emotion-
specified expression. In Computer Vision
References
and Pattern Recognition Workshops (CVPRW),
S. W. Chew, P. Lucey, S. Lucey, J. Saragih, J. F. 2010 IEEE Computer Society Conference on,
Cohn, and S. Sridharan. Person-independent fa- pages 94–101. IEEE, 2010.
cial expression detection using constrained lo- Michael Lyons, Shota Akamatsu, Miyuki Ka-
cal models. In Automatic Face Gesture Recog- machi, and Jiro Gyoba. Coding facial expres-
nition and Workshops (FG 2011), 2011 IEEE sions with gabor wavelets. In Automatic Face
International Conference on, pages 915–920, and Gesture Recognition, 1998. Proceedings.
March 2011. doi: 10.1109/FG.2011.5771373. Third IEEE International Conference on, pages
Charles Darwin, Paul Ekman, and Phillip Prodger. 200–205. IEEE, 1998.
The expression of the emotions in man and ani- Caifeng Shan, Shaogang Gong, and Peter W
mals. Oxford University Press, USA, 1998. McOwan. Facial expression recognition based
Abhinav Dhall, Akshay Asthana, Roland Goecke, on local binary patterns: A comprehensive
and Tom Gedeon. Emotion recognition using study. Image and Vision Computing, 27(6):803–
phog and lpq features. In Automatic Face & 816, 2009.
Page 8 of 9
S. Velusamy, H. Kannan, B. Anand, A. Sharma,
and B. Navathe. A method to infer emotions
from facial action units. In 2011 IEEE In-
ternational Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 2028–
2031, May 2011. doi: 10.1109/ICASSP.2011.
5946910.
Anbang Yao, Junchao Shao, Ningning Ma, and
Yurong Chen. Capturing au-aware facial fea-
tures and their latent relations for emotion
recognition in the wild. In Proceedings of
the 2015 ACM on International Conference on
Multimodal Interaction, pages 451–458. ACM,
2015.
Zhiding Yu and Cha Zhang. Image based static fa-
cial expression recognition with multiple deep
network learning. In Proceedings of the 2015
ACM on International Conference on Multi-
modal Interaction, pages 435–442. ACM, 2015.
Page 9 of 9
End-to-end learning of motion, appearance and interaction cues for multi-target
tracking
1
get appearance. 3. Multi Object Tracking Framework
As shown in Figure 1, MOT involves three primary com-
2.1. Appearance model ponents. Our model includes modeling of appearance, mo-
tion, and interaction. These components will be described
Technically, appearance model is closely related to vi- in more details.
sual representation features of objects. Depending on how
MOT
precise and rich the visual features are, they are grouped Components
into three sets of single cue, multiple cues, and deep cue.
Because of efficiency and simplicity single cue appearance
model is widely used in MOTs. Many of single cue mod-
els are based on raw pixel template representation for sim- Appearance Motion Interaction
plicity [25, 2, 22, 19], while color histogram is the most
popular representation for appearance modeling in MOT
Figure 1. MOT components
approaches [4, 11, 28]. Other single cue approaches are
using covariance matrix representation, Pixel comparison
representation, or SIFT like features. The multi cues ap- 3.1. Appearance
proaches combines different kinds of cues to make a more
rebust appearance model. The final appearance cue used in In this section, we now describe the appearance model
tracking is the deep visual reperesentation of objects. These that we integrate into our framework for multi-object track-
high-level features are extracted by deep neural networks ing. As we recall, our problem is fundamentally based on
mostly convolutional neural networks trained for a specific addressing the challenge of data association: that is, given a
task [7]. Our model shares some characteristics with [7], set of targets Tt at time step t, and a set of candidate detec-
but differs in two crucial ways: first, we are learning to han- tions Dt+1 at timestep t + 1, we would like to compute all
dle occlusion and solve the re-identification task in addition of the valid pairings that exist between members of Tt and
to David’s work that is bounding box regression only. We Dt+1 .
output the similarity score (same object or not) and bound- The idea underlying our appearance model is that we can
ing box. Second, there are differences in the overall archi- compute the similarity score between a target and candi-
tecture, e.g. the number of fully connected layers on top date detection based on purely visual cues. More specif-
of two networks for fusing, loss function, inputs and out- ically, we can treat this problem as a specific instance of
puts and hence the training and testing procedure is different re-identification, where the goal is to take pairs of bound-
since we want to address re-identification as well as bound- ing boxes and determine if their content corresponds to the
ing box to help tracking. same person. We thus desire our appearance model to rec-
ognize the subtle similarities between input pairs, as well as
be robust to occlusions and other visual disturbances.
2.2. Motion model To approach this problem, we construct a Siamese Con-
volutional Neural Network (CNN), whose structure is de-
Object motion model describes how an object moves. picted in Figure 2. Let BBi and BBj represent the two
Motion cue is very important for multiple object tracking bounding boxes we wish to compare – in our case, BBi
since knowing the potential position of objects in the fu- might be a target bounding box at frame t, and BBj would
ture frames will reducing search space and help the appear- be a candidate detection at frame t+1. We first crop the im-
ance model on better detectation of similar objects. Popular ages containing BBi and BBj to contain only the bound-
motion models used in multiple object tracking are divided ing boxes themselves, while also ensuring that we include
into linear motion models and Non-linear motion models. some amount of the surrounding image context. The net-
As the name ”linear motion” indicates objects following the work then accepts the raw content within each bounding
linear motion model move with constant velocity. This sim- box and passes it through its layers until it finally produces
ple motion model is the is the most popular model in MOT a 500-dimensional feature vector for each of the two inputs.
[3]. There are many cases that linear motion models can not Let φi and φj thus be the final hidden activations ex-
deal with, in this cases non-linear motion models are pro- tracted by our network for bounding boxes BBi and BBj .
posed to produce a more accurate motion model for objects In order to compute the similarity, we then simply con-
[27]. We present a new Long Short-Term Memory (LSTM) catenate the two vectors to get a 1000-dimensional vector
model which jointly reasons based on the past movements φ = φi ||φj , and pass this as input to a final fully-connected
of an object and predicts the future trajectorys of that object layer. We lastly apply a Softmax classifier, which outputs
[1]. the probabilities for the positive and negative classes, where
2
task of data association, in which we can match members
of Tt and Dt+1 based on which detections are closest to the
motion prior’s next predicted location for each target.
To thus incorporate this information, we construct
a Long Short-Term Memory (LSTM) network over the
3D velocities of each target. More concretely, let
(xi0 , y0i , z0i ), (xi1 , y1i , z1i ), . . . (xit , yti , zti ) represent the 3d tra-
jectory of the i-th target from the timestep 0 through
timestep t. Assuming a point (xit+1 , yt+1 i i
, zt+1 ), we want to
see whether this point belongs to the trajectory of i-th target.
Figure 2. Our appearance model
Let use define the velocity of target i at the j-th timestep i to
be vji = (vxij , vyji , vzji ) = (xij −xij−1 , yji −yj−1 i
, zji −zj−1
i
).
positive indicates that the inputs match, and negative indi- This can be done by assigning an score to this point and see-
cates otherwise. ing whether it is large enough or not. For this purpose, we
The actual network structure we use for this challenge train our LSTM to accept as inputs the velocities of a single
consists of the 16-layer VGG net, which won the ImageNet target for timesteps 1, . . . , t and produces H-dimensional
2014 localization challenge. In our case, we begin with outputs. We also pass the t + 1 velocity vector (which
the pre-trained weights of this network, but remove the last we wish to determine whether it corresponds to a true tra-
fully-connected layer so that the network now outputs a jectory or not) from a fully-connector layer that brings it
500-dimensional vector. to H-dimensional vector space. The last LSTM output is
We then fine-tune this network by training the overall then concatenated with this vector and the result is passed
network on positive and negative samples extracted from to another fully connector layer which brings the 2H di-
our training sequences. For positive pairs, we use instances mensional vector to the space of k features. Finally, another
of the same target that occur in different frames. For neg- fully connector layer, reduces the dimension to 2 which will
ative examples, we use pairs of different targets that may be used as the 0/1 classification problem during the train-
span across all frames. ing.
We trained this model on MOT3D dataset which con-
tains 2 scenes with more than 950 frames that contain more
than 5500 objects. We extracted more than 100k of pos-
itive and negative samples. We did the training on one
scene and validated on the other scene. The result was
84 percent accuracy on the binary classification problem of
positive/negative pairs. We used CUHK03 dataset [13] as
the sanity check for our prediction. This dataset contains
13164 images of 1360 pedestrians and contains 150k pairs.
FPNN method which got rank 1 of identification MAP rate
were able to achieve 19.89 percent accuracy. Our method
achieves 18.61 percent of accuracy and outperforms several
other methods such as LDM, KISSME, SDALF.
3.2. Motion
The second component of our overall framework is the
inclusion of an independent motion prior for each target.
The intuition is that the previous movements for a particular
target can strongly influence what position a target is likely
to be at during a future time frame. Figure 3. Our 3D motion prior model
Additionally, a nuanced motion prior can help our model
when tracking objects that are occluded or lost, since it pro- Note that training occurs from scratch, and weights are
vides a heuristic as to where these objects might generally shared across all targets. Once we train the network, then
be located. Thus, formulating a sophisticated model for the given a query target i at timestep t0 , the LSTM will output
motion prior of a target will be valuable in achieving robust a predicted velocity vti0 +1 . We can then simply add the ve-
performance during tracking. locity to the query target’s position at t0 in order to compute
We can therefore use this information to aid us in the the motion prior’s predicted position for frame t0 + 1. That
3
is, probabilities of whether the t and d correspond to the same
object.
(xit0 +1 , yti0 +1 , zti0 +1 ) = (xit0 +vxit0 +1 , yti0 +vyti0 +1 , zti0 +zti0 +1 ) The inputs to the LSTM are feature vectors that we ex-
tract from our individual models. Let φA represent the hid-
We therefore obtain the predicted position from the motion den activations extracted from our appearance model before
prior, and can use this to filter out candidate detections that the final fully connected layer of the network, where we in-
are not sufficiently close to the prior. put the bounding boxes surrounding target i and detection
For training this model, we used MOT3D dataset, which d. Let φM j be the hidden state of the Motion Prior LSTM
only consists of true trajectories. We considered trajecto- extracted at timestep j, and likewise let φIj be the hidden
ries of length t + 1 = 7 and we assumed H = 128. For feature vector of the Interaction model extracted at timestep
each true trajectory, we changed the last element of it by a j. Then, the input to our integrator is given by
randomly chosen object among all other objects that exist at
the same frame. By doing this we were able to reach to the φj = φA ||φM j ||φIj
same number of invalid trajectories as the valid trajectories
(it is not good to have unbalanced distributions for train- where we thus concatenate the individual feature vectors
ing). After training this model, we were able to achieve the output by the modules. Therefore, when we set up the
accuracy of 95 percent for the 0/1 classification problem. model we use these features as inputs to the LSTM and
train it to output either a positive or negative label for each
3.3. Integration timestep (indicating whether there is a valid match) using a
standard Softmax classifier and cross-entropy loss.
Given these three components of our framework for
An important point to note is that we train this LSTM
Multi-object Tracking, we now describe the method by
without fine-tuning the weights of the individual compo-
which we integrate these parts into a coherent system. To
nents of the framework, which are each in fact trained sep-
recall, we have identified appearance cues, motion priors,
arately. The overall model, composed by the previous com-
and interactive forces as critical parts of the MOT problem.
ponents is illustrated by figure 5 and the output of the model
We believe a sophisticated framework should merge these
is a similarity score which is used as a weight for the edges
pieces together in an elegant way. You can find the graphi-
of matching graph for matching the detections between time
cal model of our approach in figure 4. Each human has an
frames.
appearance edge and motion edge, and between every pairs
of humans there is an interaction edge. Similarity Score
H Occluded
0.23
A . . . ]
[
0.95
Appearance Edges
A M I
Human Node
h1 Interaction Edges Motion Interaction
0.12
0.23
Motion Edges A
Appearance
Feature Feature Feature
False alarms
Extractor Extractor Extractor T
T+1
M
A hn
Figure 5. Our overall model
4. Experiments
Figure 4. The graphical model of our approach
In this section, we now describe our various experiments
Our overarching model is a Long Short-Term Memory and results, and then later peform a qualitative analysis on
network which we construct over the already pre-trained model’s performance.
appearance, motion, and interaction modules. This LSTM
4.1. Baselines
is trained to perform the task of data association: once
again, suppose we are at timestep t and wish to deter- We first discuss the various baselines that we use to es-
mine whether target i is matched to a detection d found tablish a standard for comparison against our more nuanced
in timestep t + 1. We then train the LSTM to output the model.
4
• Markov Decision Process Tracker people into the 3D world. It consists of two publicly avail-
able datasets: a crowded town center, and the well known
In [23], authors demonstrated success on 2D Multi-
PETS2009 dataset.
object tracking by formulating the tracking problem as
a Markov Decision Process (MDP). They represented 4.3. Results
every target as being in either an active, tracked,
lost, or inactive states, and learned the appropriate The accuracy and results of each component of our sys-
transition probabilities and rewards based on extracted tem is described at each of the experimental sections. Here
features. In order to evaluate this method on the 3D we see the final results of the tracker in table 6 for results of
challenge, we project the bottom-midpoint of the our tracker compared to other baselines on MOT3D chal-
predicted 2D bounding boxes to the ground plane lenge. The last 3 rows are our cross validation on MOT
(using the provided calibration parameters given in the challenge training set.
data sequences). Tracker MOTA (H) MOTP (H) MT (H) ML (L)
DBN (State of art) - 1st 51.1 61.0 28.7% 17.9%
KalmanSFM (Baseline) - 5th 25.0 53.6 6.7% 14.6%
5
[4] W. Choi and S. Savarese. Multiple target tracking in world [19] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll
coordinate with single, minimally calibrated camera. In never walk alone: Modeling social behavior for multi-target
Computer Vision–ECCV 2010, pages 553–567. Springer, tracking. In Computer Vision, 2009 IEEE 12th International
2010. Conference on, pages 261–268. IEEE, 2009.
[5] E. Fontaine, A. H. Barr, and J. W. Burdick. Model-based [20] C. Spampinato, Y.-H. Chen-Burger, G. Nadarajan, and R. B.
tracking of multiple worms and fish. In ICCV Workshop on Fisher. Detecting, tracking and counting fish in low quality
Dynamical Vision. Citeseer, 2007. unconstrained underwater videos. VISAPP (2), 2008:514–
[6] H. Grabner, M. Grabner, and H. Bischof. Real-time tracking 519, 2008.
via on-line boosting. In BMVC, volume 1, page 6, 2006. [21] C. Spampinato, S. Palazzo, D. Giordano, I. Kavasidis, F.-P.
[7] D. Held, S. Thrun, and S. Savarese. Learning to track at 100 Lin, and Y.-T. Lin. Covariance based fish tracking in real-
FPS with deep regression networks. CoRR, abs/1604.01802, life underwater environment. In VISAPP (2), pages 409–414,
2016. 2012.
[22] Z. Wu, A. Thangali, S. Sclaroff, and M. Betke. Coupling
[8] C. Huang, B. Wu, and R. Nevatia. Robust object tracking by
detection and data association for multiple object tracking.
hierarchical association of detection responses. In Computer
In Computer Vision and Pattern Recognition (CVPR), 2012
Vision–ECCV 2008, pages 788–801. Springer, 2008.
IEEE Conference on, pages 1948–1955. IEEE, 2012.
[9] Z. Khan, T. Balch, and F. Dellaert. An mcmc-based particle
[23] Y. Xiang, A. Alahi, and S. Savarese. Learning to track: On-
filter for tracking multiple interacting targets. In Computer
line multi-object tracking by decision making. In Interna-
Vision-ECCV 2004, pages 279–290. Springer, 2004.
tional Conference on Computer Vision (ICCV), pages 4705–
[10] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler. 4713, 2015.
MOTChallenge 2015: Towards a benchmark for multi-
[24] J. Xing, H. Ai, L. Liu, and S. Lao. Multiple player tracking in
target tracking. arXiv:1504.01942 [cs], Apr. 2015. arXiv:
sports video: A dual-mode two-way bayesian inference ap-
1504.01942.
proach with progressive observation modeling. Image Pro-
[11] B. Leibe, K. Schindler, N. Cornelis, and L. Van Gool. Cou- cessing, IEEE Transactions on, 20(6):1652–1667, 2011.
pled object detection and tracking from static cameras and [25] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who
moving vehicles. Pattern Analysis and Machine Intelligence, are you with and where are you going? In Computer Vision
IEEE Transactions on, 30(10):1683–1698, 2008. and Pattern Recognition (CVPR), 2011 IEEE Conference on,
[12] K. Li, E. D. Miller, M. Chen, T. Kanade, L. E. Weiss, and pages 1345–1352. IEEE, 2011.
P. G. Campbell. Cell population tracking and lineage con- [26] B. Yang, C. Huang, and R. Nevatia. Learning affinities and
struction with spatiotemporal context. Medical image anal- dependencies for multi-target tracking using a crf model.
ysis, 12(5):546–566, 2008. In Computer Vision and Pattern Recognition (CVPR), 2011
[13] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter IEEE Conference on, pages 1233–1240. IEEE, 2011.
pairing neural network for person re-identification. In Pro- [27] B. Yang and R. Nevatia. Multi-target tracking by online
ceedings of the IEEE Conference on Computer Vision and learning of non-linear motion patterns and robust appear-
Pattern Recognition, pages 152–159, 2014. ance models. In Computer Vision and Pattern Recogni-
[14] C.-W. Lu, C.-Y. Lin, C.-Y. Hsu, M.-F. Weng, L.-W. Kang, tion (CVPR), 2012 IEEE Conference on, pages 1918–1925.
and H.-Y. M. Liao. Identification and tracking of players IEEE, 2012.
in sport videos. In Proceedings of the Fifth International [28] A. R. Zamir, A. Dehghan, and M. Shah. Gmcp-tracker:
Conference on Internet Multimedia Computing and Service, Global multi-object tracking using generalized minimum
pages 113–116. ACM, 2013. clique graphs. In Computer Vision–ECCV 2012, pages 343–
[15] W. Luo, T.-K. Kim, B. Stenger, X. Zhao, and R. Cipolla. Bi- 356. Springer, 2012.
label propagation for generic multiple object tracking. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 1290–1297, 2014.
[16] E. Meijering, O. Dzyubachyk, I. Smal, and W. A. van Cap-
pellen. Tracking in cell and developmental biology. In Semi-
nars in cell & developmental biology, volume 20, pages 894–
902. Elsevier, 2009.
[17] P. Nillius, J. Sullivan, and S. Carlsson. Multi-target tracking-
linking identities using bayesian network inference. In Com-
puter Vision and Pattern Recognition, 2006 IEEE Computer
Society Conference on, volume 2, pages 2187–2194. IEEE,
2006.
[18] K. Okuma, A. Taleghani, N. De Freitas, J. J. Little, and D. G.
Lowe. A boosted particle filter: Multitarget detection and
tracking. In Computer Vision-ECCV 2004, pages 28–39.
Springer, 2004.
6
Near-Eye Display Gaze Tracking via Convolutional Neural Networks
1
scene but is forced to focus to a fixed distance. This dis-
tance is a function of the distance between the lenses and
the display in the head-mounted display, as well as the fo-
cal length of the lenses themselves. This forces a mismatch
between vergence (able to verge anywhere) and accommo-
dation (only able to accommodate to one distance), known
as the vergence accommodation conflict. When exposed to
such a conflict for extended periods of time, users develop
Figure 1. Table of existing gaze tracking datasets, which are
symptoms of headache, eye strain, and, in extreme cases,
mostly tailored towards model assumed methods.
nausea[13].A system capable of determining the distance
to which the user is verged at, through gaze tracking, can
either use focus-tunable optics or an actuated display to re- where (sx , sy ) are the screen coordinates, and (x, y) are
duce the vergence accommodation conflict by changing the the pupil-glint vector components. A calibration procedure
distance to which users focus to, as explained in [7]. is performed to estimate the unknown variables a0 , a1 , ...b5
via least squares analysis by asking a user to looking at (at
2. Related Work least) 9 calibration targets. The accuracy of the best IR-
based methods fall somewhere between 1◦ to 1.5◦ accuracy.
A variety of remote eye gaze tracking (REGT) algo-
rithms have been reported in literature over the past couple
View-Based REGT In view-based REGT, only intensity
of decades. For our purposes, the general body of knowl-
images from traditional cameras are used without any ad-
edge can be divided into two categories: ones which assume
ditional hardware. These techniques rely more on image
a model of the eye and ones which learns an eye tracking
processing techniques to extract features from the eyes di-
model.
rectly, which can then be mapped to 2D pixel coordinates.
2.1. Model Assumed REGT Tan et. al [16] uses an image as a point in high dimensional
space and through an appearance-manifold technique is able
Methods assuming a model of the eye generally extract to achieve a reported accuracy of 0.38◦ . Zhu and Yang [19]
features from an image of the eye and map them to a point proposed a method for feature extraction from intensity im-
on the display. This type of work generally either uses in- ages and using a linear mapping function are able to achieve
tensity images captured from a traditional camera as seen in a reported accuracy of 1.4◦ .
[19], or uses illumination from an infrared light source and
captures the eye with an IR camera. 2.2. Model Learned REGT
Model learned REGT techniques use some sort of ma-
Infrared-Based REGT IR illumination creates two ef- chine learning algorithm to learn an eye tracking model
fects on the eye. Firstly, it creates the so-called bright-eye from training data consisting of input/output pairs, i.e. 2D
effect, similar to red-eye in photography, which results from coordinates of points on the screen and images of the eyes.
the light “lecting off of the retina. The second effect, a glint In the work by Baluja and Pomerleau [2], an artificial
on the surface of the eye, is caused by light reflecting off the neural network (ANN) was trained to model the relation-
corneal surface, creating a small bright point in the image. ship between 15x40 pixel images of a single eye and their
This glint is often used as a reference point, because if we corresponding 2D coordinates of the observed point on the
assume that the eye is spherical, it does not move as the eye screen. In their calibration procedure the user was told to
rotates in its socket. look at a cursor moving on the screen along a path made of
two thousand positions. They reported a best accuracy of
After grabbing an image of the eye, the glint and pupil an
1.5◦ .
be extracted via image processing algorithms described in
Similar work by Xu et. al. [12] was presented, but in-
[6]. A glint pupil vector can be calculated, and mapped to a
stead of using raw image values as inputs, they segmented
2-D position on the screen via some mapping function. Al-
out the eye and performed histogram equalization in order
though many have been proposed, [20, 19], the most com-
to boost the contrast between eye features. They used three
monly used function is the 2nd order polynomial defined in
thousand points for calibration and reported an accuracy of
[10], defined as :
around 1.5◦ .
2.3. Our Approach
sx = a0 + a1 x + a2 y + a3 xy + a4 x2 + a5 y 2
A key motivation behind this work is that gaze tracking
sy = b0 + b1 x + b2 y + b3 xy + b4 x2 + b5 y 2 systems lack the required robustness for commercial ap-
2
Figure 2. Image of setup, sample captures, and path of point on screen. The left image depicts our setup comprising of an LCD monitor,
webcam, and chin rest to keep the head roughly stable. The upper right images show samples images from what the webcam captures. The
bottom right image displays a part of the path that the moving target follows during its trajectory.
plications. High variability in the appearance of the pupil Instead of using one of the above datasets, we decided
and lack of a clear view cause accuracy issues with detec- to create our own dataset suitable for neural network train-
tor based tracking systems. Deep learning performs well ing. Our strongest criteria was to have a large number of
on problems with high intra-class variability. The intuition calibration points densely sampling the entire screen, with
behind this work was to use a CNN model instead of the corresponding images of the eye. With such training data,
parametric calibration since it would probably be more ro- we would expect the CNN to learn the fine differences be-
bust to variations in skin/eye color, HMD fixation etc. tween points on the screen.
In particular the key contributions of this work are: In our setup, as seen in Figure 2, we placed a user with
his/her head resting on a chin rest 51 cm away from a 1080p
• Introduction and implementation of an end-to-end
24 inch monitor. A webcam was placed very close to the
CNN based approach to gaze tracking
chin rest imitating what a camera placed inside of a near-
• Creation of a new gaze tracking dataset with a near-eye eye display would see.
camera covering five different subjects and dense po- Asking a user to fixate on a series of targets is infeasi-
sition sampling of the screen based on smooth pursuit ble for the large number of calibration points we wanted
to collect. Instead we used the fact that humans are able
• Performance evaluation of the chosen method, conclu- to track moving objects well, up to some angular velocity.
sions and suggestions for improvement that can be in- This is the notion of smooth pursuit. We moved a point
corporated in future work about a screen at 7.5◦ /s in a winding pattern from left to
right, top to bottom, as seen in Figure 2 and were able to col-
3. Dataset lect 7316 calibration points during a single 4 minute sitting.
Many eye-gaze tracking datasets exist [14], as seen in At this angular velocity the eye is able to smoothly track the
Figure 1, however they are mostly tailored for model as- point as it moves about without sacades (which occur when
sumed REGT systems. The majority of the datasets use the eye attempts to ’catch up’ when a point is moving very
target based calibration with large spacing between targets. quickly). We chose this particular angular velocity based on
Neural networks are not able to train on such few, and [11], which introduced the concept of a pursuit based cali-
sparsely, sampled points and learn a good relationship be- bration and found that points moving about between 3.9◦ /s
tween image data and pixel coordinates. The few datasets and 7.6◦ /s resulted in best accuracy.
that used continuous targets also allowed for continuous In order to achieve a smoothly moving point, we dis-
head movement, which is not a good representation for a played four points per angular degree, giving us a total of
near-eye display where the display is strapped to the user’s 7316 points. We captured webcam video frames at approxi-
head (with the camera rigidly fixed inside of it). mately 30 fps, which roughly corresponded to one frame per
3
1/4 angular degree point shift. Because our goal was around 5. CNN Implementation
1◦ of accuracy, we binned the calibration points into 1◦ bins.
For example, points displayed between 0.5◦ and 1.5◦ would 5.1. Caffe
be considered the same as 1◦ . We found that with a 4x re- The CNN implementation was performed in both Caffe
duction in classes, and a corresponding 4x increase in points and Tensorflow for comparison. Caffe has LeNet in the
per class, the CNN was able to better learn. model repository implemented for MNIST digit classifica-
The cropped and downsampled captured dataset can be tion. The .prototxt file was reconfigured to point to the
found at [1] lmdb file for our dataset. The image files and binary with
class labels are converted to the Caffe compatible lmdb file
4. CNN Learning Approach and then training is performed. The figure below shows the
setup workflow for the Caffe implementation.
In this work we explore the use of Convolutional Neural
Networks (CNN) for gaze tracking in an end-to-end fash-
ion. Traditional approaches like the one mentioned in the
previous section rely on hand-engineered feature detectors
and then use a parametric model to track the gaze direction
of a user.
Recently CNNs have outperformed traditional feature
engineering-based computer vision methods in a variety of
tasks. This work explores their use for gaze tracking. The
key benefit is that this approach is fully data driven. We
train the CNN model to take images of the users eye (taken
from a camera very close to their face) as input and estimate
the gaze direction in terms of x and y pixel coordinates on
the screen.
6. Data Organization
Figure 3. LeNet Architecture Due to the large size of the data captured and the limited
computational power available, a few steps were taken to
4
Figure 6. Augmented Images
5
Figure 7. CAVE Dataset Downsampled Images
6
Figure 11. Correctly and Incorrectly Classified Images from
CAVE Dataset
Figure 13. Captured Average Angular Error
7
References [16] K.-H. Tan, D. J. Kriegman, and N. Ahuja. Appearance-based
eye gaze estimation. In Proceedings of the Sixth IEEE Work-
[1] Captured Dataset. https://www.dropbox.com/s/ shop on Applications of Computer Vision, WACV ’02, pages
mxbis1osiedclkd/data.zip?dl=0. 191–, Washington, DC, USA, 2002. IEEE Computer Society.
[2] S. Baluja and D. Pomerleau. Non-intrusive gaze tracking [17] S. Xu, X. Mei, W. Dong, X. Sun, X. Shen, and X. Zhang.
using artificial neural networks. Technical report, Pittsburgh, Depth of field rendering via adaptive recursive filtering. In
PA, USA, 1994. SIGGRAPH Asia 2014 Technical Briefs, SA ’14, pages 16:1–
[3] B. A. Barsky and T. J. Kosloff. Algorithms for rendering 16:4, New York, NY, USA, 2014. ACM.
depth of field effects in computer graphics. pages 999–1010, [18] T. Zhou, J. X. Chen, and M. Pullen. Accurate depth of
2008. field simulation in real time. Computer Graphics Forum,
[4] S. Hillaire, A. Lecuyer, R. Cozot, and G. Casiez. Using an 26(1):15–23, 2007.
eye-tracking system to improve camera motions and depth- [19] J. Zhu and J. Yang. Subpixel eye gaze tracking. pages 124–
of-field blur effects in virtual environments. In Proc. IEEE 129, May 2002.
VR, pages 47–50, 2008. [20] Z. Zhu and Q. Ji. Novel eye gaze tracking techniques under
[5] S. Hillaire, A. Lcuyer, R. Cozot, and G. Casiez. Depth- natural head movement. IEEE Transactions on Biomedical
of-field blur effects for first-person navigation in virtual en- Engineering, 54(12):2246–2260, Dec 2007.
vironments. IEEE Computer Graphics and Applications,
28(6):47–55, Nov 2008.
[6] T. E. Hutchinson, K. P. White, W. N. Martin, K. C. Reichert,
and L. A. Frey. Human-computer interaction using eye-gaze
input. IEEE Transactions on Systems, Man, and Cybernetics,
19(6):1527–1534, Nov 1989.
[7] R. Konrad, E. Cooper, and G. Wetzstein. Novel optical con-
figurations for virtual reality: Evaluating user preference and
performance with focus-tunable and monovision near-eye
displays. Proceedings of the ACM Conference on Human
Factors in Computing Systems (CHI16), 2016.
[8] G. Maiello, M. Chessa, F. Solari, and P. J. Bex. Simulated
disparity and peripheral blur interact during binocular fusion.
Journal of Vision, 14(8), 2014.
[9] M. Mauderer, S. Conte, M. A. Nacenta, and D. Vishwanath.
Depth perception with gaze-contingent depth of field. ACM
SIGCHI, 2014.
[10] C. H. Morimoto and M. R. M. Mimica. Eye gaze tracking
techniques for interactive applications. Comput. Vis. Image
Underst., 98(1):4–24, Apr. 2005.
[11] K. Pfeuffer, M. Vidal, J. Turner, A. Bulling, and
H. Gellersen. Pursuit calibration: Making gaze calibration
less tedious and more flexible. In Proceedings of the 26th
Annual ACM Symposium on User Interface Software and
Technology, UIST ’13, pages 261–270, New York, NY, USA,
2013. ACM.
[12] L. qun Xu, D. Machin, P. Sheppard, M. Heath, and I. I. Re. A
novel approach to real-time non-intrusive gaze finding, 1998.
[13] T. Shibata, T. Kawai, K. Ohta, M. Otsuki, N. Miyake,
Y. Yoshihara, and T. Iwasaki. Stereoscopic 3-D display with
optical correction for the reduction of the discrepancy be-
tween accommodation and convergence. SID, 13(8):665–
671, 2005.
[14] B. Smith, Q. Yin, S. Feiner, and S. Nayar. Gaze Locking:
Passive Eye Contact Detection for HumanObject Interaction.
In ACM Symposium on User Interface Software and Technol-
ogy (UIST), pages 271–280, Oct 2013.
[15] B. Smith, Q. Yin, S. Feiner, and S. Nayar. Gaze Locking:
Passive Eye Contact Detection for HumanObject Interaction.
In ACM Symposium on User Interface Software and Technol-
ogy (UIST), pages 271–280, Oct 2013.
8
Fine-grained Flower Classification
Leo Lou
qibinlou@stanford.edu
4.1. Features
Color: Colour is described by taken the HSV values of
the pixels. The HSV space is chosen because it is less
sensitive to variations in illumination and should be able
to cope better with pictures of flowers taken in different
weather conditions and at different time of the day. The
HSV values for each pixel in an image are clustered using
k-means. Given a set of cluster centres (visual words)
wic , i = 1, 2, ..., Vc , each pixel in the image I is then
assigned to the nearest cluster centre, and the frequency
of assignments recorded in a a Vc dimensional normalized
frequency histogram n(wc |I).
Features mAP
HSV 42.3%
HOG 49.1%
SIFT 53.0%
Figure 3. The distribution of numbers of images over 102 classes
CNN w/o segmentation 73.9%
CNN w/ segmentation 54.1%
HSV+HOG+SIFT+CNN 84.0%
5.2. Setup
Our experiment is based on Python environment and var- Table 1. Classification performance on the test set. mAP refers
ious Python packages, numpy, scipy, scikit-learn, etc. The to classification performance averaged over all classes(not over all
complete package list can be obtained from the import sec- images)
tions of our scripts. It’s worth mentioning that we use
OpenCV 2.4.8 [1] for Python package to process our images
and extract different features. Due to some unknown issues, Table 2 shows the comparison result between our work
our OpenCV version doesn’t support to extract SIFT fea- and others. Note that in [7], the best performance is
tures. Instead, we use the open sourced Octave and VLFeat achieved by further augment the training set by adding
Method mAP [8] N. Sünderhauf, C. McCool, B. Upcroft, and T. Perez. Fine-
Nilsback et al. [6] 76.3% grained plant classification using convolutional neural net-
Anelia et al. [3] 80.66% works for feature extraction. In CLEF (Working Notes), pages
Ours 84.0% 756–762, 2014.
Ali et al. [7] 86.8% [9] A. Vedaldi and B. Fulkerson. VLFeat: An Open and Portable
Library of Computer Vision Algorithms, 2008.
References
[1] OpenCV. https://github.com/itseez/opencv.
[2] Overfeat. https://github.com/sermanet/OverFeat.
[3] A. Angelova, S. Zhu, and Y. Lin. Image segmentation for
large-scale subcategory flower recognition. In Applications
of Computer Vision (WACV), 2013 IEEE Workshop on, pages
39–45. IEEE, 2013.
[4] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation ap-
plied to handwritten zip code recognition. Neural computa-
tion, 1(4):541–551, 1989.
[5] M.-E. Nilsback and A. Zisserman. Delving into the whorl of
flower segmentation. In BMVC, pages 1–10, 2007.
[6] M.-E. Nilsback and A. Zisserman. Automated flower classi-
fication over a large number of classes. In Computer Vision,
Graphics & Image Processing, 2008. ICVGIP’08. Sixth In-
dian Conference on, pages 722–729. IEEE, 2008.
[7] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn
features off-the-shelf: an astounding baseline for recognition.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, pages 806–813, 2014.
Gradient-learned Models for Stereo Matching
CS231A Project Final Report
Leonid Keselman
Stanford University
leonidk@cs.stanford.edu
Abstract
1
Region Neighborhood Outlier
Matching Aggregation Propagation Removal
2
Figure 4. The image on the left is ground truth for the motorcycle scene in 1. The image on the right is the results of our semi-global
matching pipeline with naive hole filling as described in section 3.1. For visual comparison, occluded and missing ground truth pixels from
both images are masked out.
els in a given scanline. This cost accumulation is the pri- 3.1.3 Propagation
mary computational bottleneck in the system so paralleliz-
In order to perform propagation across the image, we’ve im-
ing that component is enough to provide sufficient scaling
plemented semi-global matching [7], in full, as described in
across processor cores.
the original paper. We chose to perform 5-path propagation
for each pixel, as it represents a row causal filter on the im-
3.1.1 Cost Computation age, using a pixel’s left, right, top, top-left, and top-right
neighbors. This produces an answer that satisfies the cost
As a baseline method of cost computation, we’ve imple- function of Hirchmuller [7]
mented both standard sum of absolute differences, and the X
robust Census metric [20]. Census was recently tested and E(D) = (C(p, D(P )) (1)
p
shown to be the best performing stereo cost metric [8]. The
weighted sum of absolute differences and Census was addi- X
+ P1 · 1{D(p) − D(q)} = 1
tionally state-of-the-art for Middlebury until a year or two
q
ago [13]. The state of the art in this space is MC-CNN X
method [21], which implemented a CNN algorithm to re- + P2 · 1{D(p) − D(q)} > 1
place traditional methods. However, since our project fo- q
cuses on implementing neural networks in other parts of the Additionally, we added naive hole filling by propagating
stereo pipeline, re-implementing this cost metric is not a pixels from left-to-right, in order to fill occluded regions.
high priority. This is a naive metric, but is a large part of the hole filling
Specifically, we implemented Census with 7x7 windows, used in the state of the art work [21].
which allows us to exploit a sparse census transform [5],
and fit the result for every pixel into 32-bits. This enables 4. Learning Propagation
efficient performance with the use of a single popcnt in-
struction on modern machines. Most papers in the KITTI dataset build on top of the
successful method of semi-global matching [7], which is
an algorithm for propagating successful matches from pix-
3.1.2 Region Selection els to their neighbors. The goal of this part of the project
was to replacing this function with either a standard neural
For our region selection baseline, we’ve implemented both network, or recurrent neural network. Depending on one’s
box correlation windows and weighting with a non-linear perspective on what operation semiglobal matching is per-
smoothing algorithm such as the bilateral filter [19]. This forming, there is a wide array of neural network architec-
was inspired recent unpublished ECCV 16 submissions tures that may be amenable to replace it. An overview of
on the Middlebury leader-board, which claim to replace the formulations is shown in figure 5.
the popular cross-based [22] with a smooth affinity mask The first and most straightforward view of what the en-
method like a bilaterial filter, as first shown in [10]. ergy function, as stated in equation 1, is that it regularizes a
3
4.1. 1D Smoothing
One-dimensional Smoothing Two-dimensional Smoothing
One straightforward view of semi-global matching is
Probability of correct disparity Column-wise probability simply as regularization function on top of a pixel’s cor-
(70d) disparity image (750x70)d relation curve. A correlation curve is the set of matching
costs for a single pixel and it’s candidates. If this input is
Classifier Classifier negated, and fed a softmax activation function, as used to
train many neural networks, it treats the values as unnor-
malized log probabilities, and selects the maximum (which
Raw costs per disparity (70d) Disparity Image (750x70)d
would be the candidate with lowest matching cost).
Raw values (750x1)d
efyi
Li = − log( P fi )
je
Figure 5. An example of two different ways to formulate semi-
global matching as a classification task. The one on the left is
Our original implementations for this method were all
explored in section 4.1, while the one on the right is explored in
straightforward multi-layer perceptions (MLP), using a one,
4.2.
two, or three layer neural network to produce a smarter
minimum selection algorithm. However, no matter the loss
function, shape, dimensions, regularization, or initialization
function, we were unable to get any MLP to converge. That
single pixel’s correlation curve into a more intelligent one.
is, using a 0-layer neural network (the input itself) was bet-
This view is fairly simple, doesn’t incorporate any neigh-
ter than any learned transformation to that shape and size.
borhood information, but in our testing was the most suc-
cessful model. This is elaborated in section 4.1. Instead, we found success by using a one dimensional
convolutional neural network as shown in figure 6. We sus-
A second view of what semiglobal matching does in pect a CNN was able to handle this task better, as one bank
practice is that is regularizes an entire scanline at a time, of convolutions could learn an identity transform, while oth-
performing scanline optimization and producing a robust ers could learn feature detectors that incorporated interest-
match for an entire set of correlation curves at once. This ing feedback into that identity transform. In comparison, a
was the view we took when building models in section 4.2. randomly initialized fully connected network may struggle
to learn a largely identity transform with minor modifica-
A third view of what semiglobal matching does is it tions. We implemented the neural network on top of Keras
serves as a way of remembering good matches, and propa- [4] and TensorFlow [1]. We additionally learned several
gating their information to their neighbors. This is straight- non-gradient based classifier baselines such as SVMs and
forward and almost certainly what semiglobal matching random forests using scikit-learn [17].
does. This would require a pixel recurrent neural network
such as that in [14]. In our limited time and testing, we 4.2. 2D Smoothing
were unable to get any of these architectures to converge
and hence have excluded them from this paper. However, As shown in figure 5, there is an alternative concept of
our primary focus was on building a bidirection RNN with how semiglobal propagation. This one incorporates pixel
GRU [3] activations. In practice, small pixel patches didn’t neighborhoods, and seems a more natural fit for the energy
converge while large patches were not able to fit into the function presented in equation 1. For this formulation of
memory of the machines we had available for training. a neural network, the correlation curves of an entire scan-
line are reshaped into an image in disparity cost space no-
For testing and training, we gather a subset of the Mid- tation is described in figure 7. We then create a model us-
dlebury images [18], and split into into a random training ing a two-dimensional convolutional neural network [12] on
and testing set with a 80%-20% split. The unseen samples top of these disparity cost space images The top level is a
are then used for evaluation. Middlebury provides 15 im- column-wise softmax classifier of the same size as the input
ages for training and 15 for evaluation. For the classifiers in dimensions. In order to implement this in TensorFlow [1],
section 4.1, this results in roughly 500,000 annotations per we first pass in a single disparity image as a single batch.
image (using quarter sized images), and 500,000 tests of the We run our convolutional architecture over this model, and
network. While for the classifiers in section 4.2, this results then reshape the output into pixel-many ”batches” for each
in 500,000 annotations computed over about 1,000 runs of of which we have a label. This allows the built-in softmax
the network (since it computes 500 outputs at the same and cross-entropy loss formulations to work out-of-the-box
time). See below for details of how this is implemented. with no hand-made loops.
4
Model RMS Error Runtime
Census 28.92 1.2s
SGBM 28.12 3.1s
softmax
SGBM + BF 32.8 5.8s
70d linear projection OpenCV SGBM 38.00 0.9s
MC-CNN (acct) 27.5 150s
7x1 Average Pool Table 1. A summary table of numerical results on the training Mo-
torcycle image. The error metric is root-mean-squared error in
disparity space, and the run-times are on a quad-core i7 desktop.
The first three lines are baseline implementations implemented by
64, 5x1 convolutions, stride 1 us, while the last two are standard algorithms available on the
dataset website [18]. The MC-CNN results were run on a GPU
[21] (which were on an GPU).
32, 9x1 convolutions, stride 5
5
Test Accuracy RMS Error
91.2
84.0
74.4
70.4
6. Conclusion
Additionally, as can be seen in table 2, the 1D CNN
model is not yet exhibiting overfitting on out-of-bag sam- We have presented a new method for taking stereoscopic
ples, and might benefit from additional training time. It correlation costs and smoothing them into a more refined
can also be seen that our best 2D CNN architecture dras- estimate. This method is gradient-trainable, and outper-
tically underperforms even the standard baselines. While forms the semiglobal matching [7] heuristic technique used
there may be some more optimal 2D CNN architecture than in state-of-the-art methods such as MC-CNN [21]. This
the one we tried, our poor initial results made us moved to- leads support to the hypothesis proposed in the introduc-
wards trying to build an RNN method instead. However, tion, which is that continuing to replace components of the
we did not have enough time to finish designing and train- stereo matching pipeline with machine-learned models is a
ing our RNN models for replacing semiglobal matching. way to improve their performance. Since the models pre-
sented here were done with ADCensus costs [13] and not
Another interesting experimental result is the qualitative MC-CNN costs [21], and we did not have enough time to
performance of the classifier models. As shown in figure train on the full Middlebury dataset [18], we don’t present a
9, the classification-based models sometimes generate com- new state-of-the-art for stereoscopic correspondence. How-
pletely erroneous results for parts of the image. While Cen- ever, we believe that these results suggest that one may be
sus will fail to generate a result sometimes, and semiglobal possible by simply running the proposed techniques with
matching learns a smooth transformation. In contrast, while MC-CNN on the full dataset.
the classification models have lower error, they sometimes
predict very non-smooth results, as the classifier is run per In addition, we’ve created a new, simply, fast and cross-
pixel. This is suggestive that a classifier, such as an RNN, platform stereo correspondence implementation. We’ve
that accounts for neighborhood information may perform shown it to be about as fast as the one in OpenCV, and to
even better. Also, while we did not combined semiglobal produce results that are notably more accurate. We hope
matching with our 1D CNN, it is possible to use the nor- this can be used as a base for others to experiment with
malized probabilities from the neural network together with other stereoscopic correspondence ideas without having to
semiglobal matching to overcome this lack of smoothness dive into complicated OpenCV SSE code or deal with slow
and achieve perhaps an even better result. MATLAB implementations.
6
J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,
V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. War-
den, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-
Flow: Large-scale machine learning on heterogeneous sys-
tems, 2015. Software available from tensorflow.org.
[2] G. Bradski. Opencv library. Dr. Dobb’s Journal of Software
Tools, 2000.
[3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
Figure 10. An example of a pixelwise RNN from [14], a gradient- F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase
learned method for propagating information across images. representations using rnn encoder-decoder for statistical ma-
chine translation. arXiv preprint arXiv:1406.1078, 2014.
[4] F. Chollet. Keras. https://github.com/fchollet/
keras, 2015.
[5] W. S. Fife and J. K. Archibald. Improved census transforms
for resource-optimized stereo vision. Circuits and Systems
for Video Technology, IEEE Transactions on, 23(1):60–73,
2013.
[6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
tonomous driving? the kitti vision benchmark suite. In
Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
Figure 11. An example of a spatial transformer for region selection [7] H. Hirschmüller. Accurate and efficient stereo processing by
[11], a gradient-learned method for region selection. semi-global matching and mutual information. In Computer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, volume 2, pages 807–814.
7. Next Steps IEEE, 2005.
[8] H. Hirschmüller and D. Scharstein. Evaluation of stereo
To continue this theme of research, we wish to explore
matching costs on images with radiometric differences. Pat-
additional architectures for stereo correspondence algo-
tern Analysis and Machine Intelligence, IEEE Transactions
rithms that are trained with error gradients. While the one- on, 31(9):1582–1599, 2009.
dimensional CNN presented here works well, it isn’t able [9] S. Hochreiter and J. Schmidhuber. Long short-term memory.
to capture the neighborhood information that semiglobal Neural computation, 9(8):1735–1780, 1997.
matching can. To incorporate neighborhood information, [10] A. Hosni, M. Bleyer, C. Rhemann, M. Gelautz, and
we’d like to explore recurrent neural network models, which C. Rother. Real-time local stereo matching using guided im-
we began to design but were unable to get running in time age filtering. In Multimedia and Expo (ICME), 2011 IEEE
for this project submission. By coupling our 1D-CNN ar- International Conference on, pages 1–6. IEEE, 2011.
chitecture with either a spatial transformer networks front- [11] M. Jaderberg, K. Simonyan, A. Zisserman, and
end [11], or a recurrent neural network backend [3] [9] , we K. Kavukcuoglu. Spatial transformer networks. CoRR,
might produce a new state-of-the-art algorithm for the clas- abs/1506.02025, 2015.
sic stereo problem. Examples of these models are shown in [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
figures 10 and 11. based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
8. Code [13] X. Mei, X. Sun, M. Zhou, S. Jiao, H. Wang, and X. Zhang.
On building an accurate stereo matching system on graph-
Code is made available at https://github.com/ ics hardware. In Computer Vision Workshops (ICCV Work-
leonidk/centest. Running the stereo matching algo- shops), 2011 IEEE International Conference on, pages 467–
rithm is straightforward and documented in the README, 474. IEEE, 2011.
but running the learning algorithms (found in the learning/ [14] A. V. D. Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel
recurrent neural networks. CoRR, abs/1601.06759, 2016.
folder) varies depending on the method.
[15] OpenMP Architecture Review Board. OpenMP application
program interface version 3.0, May 2008.
References
[16] M.-G. Park and K.-J. Yoon. Leveraging stereo matching with
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, learning-based confidence measures. In Computer Vision
C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghe- and Pattern Recognition (CVPR), 2015 IEEE Conference on,
mawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, pages 101–109. IEEE, 2015.
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
7
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Ma-
chine learning in Python. Journal of Machine Learning Re-
search, 12:2825–2830, 2011.
[18] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl,
N. Nešić, X. Wang, and P. Westling. High-resolution stereo
datasets with subpixel-accurate ground truth. In Pattern
Recognition, pages 31–42. Springer, 2014.
[19] C. Tomasi and R. Manduchi. Bilateral filtering for gray and
color images. In Computer Vision, 1998. Sixth International
Conference on, pages 839–846. IEEE, 1998.
[20] R. Zabih and J. Woodfill. Non-parametric local transforms
for computing visual correspondence. In Computer Vi-
sionECCV’94, pages 151–158. Springer, 1994.
[21] J. Zbontar and Y. LeCun. Stereo matching by training a con-
volutional neural network to compare image patches. CoRR,
abs/1510.05970, 2015.
[22] K. Zhang, J. Lu, and G. Lafruit. Cross-based local
stereo matching using orthogonal integral images. Circuits
and Systems for Video Technology, IEEE Transactions on,
19(7):1073–1079, 2009.
8
Human Pose Estimation for Multiple Frames
Abstract
State of the art models for human pose estimation
Human pose estimation is a well studied topic in vision. that are implemented for single static RGB images, also
However, most modern techniques in human pose estima- have some minimal but noticeable accuracy shortcomings
tion on multiple, consecutive frames, or motion capture, [1]. Currently, when used on video frame sequences, these
require 3D depth data, which is not always readily avail- models do not utilize the additional information provided
able. Prior work using single view 2D data, on the other by surrounding frames. Operating under the assumption
hand, has been limited to pose estimation in single frames. that human poses change minimally between frames,
This raises some interesting questions. Can human pose we improve the accuracy of Yang et al.’s [1] efficient
estimation in multiple frames be effected using 2D single and flexible model for human detection and human pose
frame techniques, thereby discarding the expensive reliance estimation in single static images. We take into account
on 3D data? Can these 2D pose estimation models be im- the sift features of other frames in the same video clip
proved upon by taking advantage of the data similarities by training SVMs on these features. We can improve the
across multiple consecutive images? In this paper, we en- output of Yang’s model by testing the SVMs on parts of
deavor to answer these questions. We take Yang et al.’s [1] the images and adjusting the original body parts to reflect
single frame pose estimation model using flexible mixture the scores calculated by the trained SVMs. The result is
of parts and apply it in a multi-frame context. We demon- a notable increase in accuracy of the imperfect Yang pose
strate that we can achieve improvements on the original estimation.
method by taking advantage of the inherent data similar-
ities between consecutive frames. We achieve speed im- After discussing related work and the implications of
provements by restricting Yang et al.’s to search locally in our method in Section 2, we further describe our process,
intermediate frames and, under certain circumstances, ac- resulting algorithm, and our evaluation process in detail in
curacy improvements by running a second, corrective, pass Section 3. Finally, we analyze our testing data and experi-
using SVMs trained for instance recognition. mental results for our various methods and hyperparameters
in Section 4.
1. Introduction 2. Background
Human pose estimation has become an extremely 2.1. Review of Previous Work
important problem in computer vision. Quality solutions
to this problem have potential to impact many different Human pose estimation is a well studied subject, both
aspects of vision such as activity recognition and motion in video (multiple frames) and in images (single frames).
capture. Additionally, success in these aspects can be Currently, most modern techniques for pose estimation in
applied to gaming, human-computer interaction, athletics, video rely on 3D depth data. A well known example of
communication, and health-care. Despite huge progress this is the xBox Kinect [2], which uses pose estimation to
in motion capture, as exemplified with the Xbox Kinect, determine the gamer’s motion. 3D depth data has many
the current solutions used in gaming require extensive advantages over 2D image data, not the least of which is
hardware making it impossible for such technology to be the additional dimension of information. However, 3D data
used in daily human-computer interactions [2]. We hope to can only be captured using specialized, and often expen-
improve motion capture to work with simple RGB single sive equipment and is not as nearly ubiquitous as 2D videos.
view cameras allowing this technology to exist on everyday
phones and computers. Recent work in pose estimation on 2D image data
1
feature a wide range of techniques and approaches, among 3.2. Yang Algorithm Speedup
them Yang’s [1], Agarwal’s [3], Dantone’s [4], and To-
The original implementation of Yang’s mixture of parts
shev’s [5]. These methodologies are similar in that they
algorithm runs in 30 seconds on a typical clip from our
focus on pose estimation on single images. We focus
test set (see section 4.1). Since we are testing on upwards
primarily on Yang’s [1] method of pose estimation using a
of 2000 images, this is unacceptably slow. Also, in a
flexible mixture-of-parts. Yang’s method has the advantage
multi-frame video with multiple people the highest scoring
of producing relatively good results on full body images
bounding boxes often migrate from person to person. To
across a variety of poses and background contexts, while
remedy these issues we reduced the space in which the
still retaining a significant speed advantage over certain
mixture of parts algorithm searched for the bounding boxes.
other approaches, such as Toshev’s [5] pose estimation
using convolutional neural networks. A relatively fast
For the first frame of the video clip we run the full
algorithm is of particular significance when we consider
Yang algorithm. For the second frame, we crop the image
pose estimation in the multi-frame context.
to the box bounding containing the entire person plus
a little extra, the size of a body part bounding box, on
2.2. Our Method the top, bottom, and sides. We then run the full Yang
algorithm on the cropped image. We store the pyramid
Previous methods for pose estimation in the multi-frame level that is used for the bounding boxes on the second
realm rely on 3D depth data. Our method uses only RGB image. For the third frame and all subsequent frames,
single view image data to accurately locate 26 different we crop the image using the same method to crop the
body parts. Additionally, our SVMs are trained specifi- second image and we search only within the pyramid lev-
cally on information from a given video clip resulting in els above, at, and below the previously stored pyramid level.
a more accurate classification of small, specific body parts.
Because deep learning would not be feasible in this con- Cropping the image ensures the bounding boxes do
text, as neural networks take too long to train and require not migrate to another person and speeds up the search
an extremely large amount of training data, we believe our for the bounding boxes. Reducing the pyramid levels also
method is the best learning-based technique to improve pose results in significant speedup. Instead of 30 seconds, the
estimation in the multi-frame context. algorithms runs in about 0.1-0.4 seconds per frame. This
speedup made our SVM correction method, described in
section 3.4, feasible because it allowed us to run Yang on
3. Technical Details all the frames of a given video clip in a reasonable amount
3.1. Overview of Methodology of time. This was necessary to obtain enough training data
for the SVMs.
Utilizing the available source code, we are improving
3.3. Interpolation with HOGs
Yang et al.’s Image Parse model algorithm [1] on a vari-
ety of image sequences of human motion gathered from The HOG interpolation method relies on the assumption
Youtube. As an initial attempt, we implemented a HOG that a person’s pose can change only so much between
features search algorithm where we compute HOG features consecutive frames. Therefore, given the bounding boxes
for each frame and find the location of body parts by search- for body parts in one frame, we are assured that the
ing for similar features to those calculated for the body part associated bounding boxes in subsequent frame may be
in the frame prior. We found that although this method found in the same vicinity and would retain similar features.
dramatically speeds up the process, the results are wors-
ened. Then, we implemented an SVM correction method Our implementation uses Yang’s model to select bounding
where we train an SVM for each body part for each video boxes for the first frame of the target sequence. In each
clip. We improve the original Yang by testing the SVMs subsequent frame, for each bounding box, we run a sliding
on parts of the image and adjusting the Yang output based window search in the local vicinity of its location in the
on the scoring results. Expanding upon this method, we prior frame to select candidate bounding boxes. We then
integrated hard negative mining [6] for computing logical select the candidate with the closest match in HOG features
negatives for each SVM. Additionally, we added a double- to the associated bounding box in the prior frame.
pass with another SVM trained to classify a sub-image as
a body part or background. Finally, in order to measure By running Yang’s relatively expensive procedure only on
the accuracy of our computed bounding boxes we manually the first frame, we are able to achieve significant speed
annotate ground truth bounding boxes on the same image improvements over a full run of Yang’s across all frames.
sequences. However, this methodology has two disadvantages. Firstly,
2
any pose estimation error made by Yang in first frame
are propagated into the subsequent frames. Secondly,
the quality of the interpolation disintegrates the farther
removed we are from the initial frame. The key weakness
of interpolation with HOGs is that it takes into account
the output of Yang’s model for only a single frame. In
subsequent investigations, we focus instead on producing
accuracy improvements using SVMs trained on the output
Figure 2: The process of finding negative examples in each
across all frames.
frame. The leftmost image shows the boundary around the
3.4. SVM Correction person in which random bounding boxes are found. The
center image shows these boxes. Then, all the boxes that
Considering only a single human in each of the image se- overlap with any of the body parts are filtered out and the
quences, we notice that various features, such as the color of resulting bounding boxes that will become negative exam-
their skin or clothing, do not change over frames. Using this ples are displayed in the rightmost image. This process is
observation, we train video clip specific SVMs to improve repeated until a sufficient amount of negative examples are
the output from Yang’s model [1]. From Yang, there are found.
26 bounding boxes indicating locations of 26 body parts for
each frame. We split up the frames into sub-images defined
by each bounding box as seen in Figure 1 and treat each closest cluster center to each sift feature vector, we create a
of these sub-images as training data for the SVMs. Addi- histogram of this distribution and concatenate all sub-image
tionally, for each frame, we compute negative examples by histograms together to form our Bag of Words. The Bag
randomly selecting bounding boxes within a certain area of of Words features are then used to train the 26 SVMs. For
the human and then discarding those that overlap with any a given SVM for body part a, all features for the 25 other
of the calculated body part bounding boxes. We then re- body parts and for the negative sub-images are treated as
peat this process until enough negative examples are found negative examples.
(Figure 2).
In order to improve the original output from Yang’s
model [1], we test the SVM on every 10 frames using a
sliding window. As shown in Figure 3, for a given frame
and a given body part, a, we initialize a score for the
SVMs associated with a on the original calculation from
Yang. Then, we start sliding a window of the same size as
the original computing a score at every position with the
SVM for a. The window position with the maximum score
becomes the corrected bounding box.
3
3.4.1 Double-Pass SVM To evaluate the performance of our algorithms, we
compute an AP vs. overlap threshold curve (AOC), similar
After our initial results, we noticed that if Yang’s model
to the AP curve described in [8]. A robust algorithm
mistakenly placed enough bounding boxes on parts of the
should generate a curve that maintains high AP for all
background that our SVM would do the same. We improve
overlap thresholds, however some drop off is expected. If
our method by using an additional, background distinguish-
there is a drop off it should occur at high overlap thresholds.
ing SVM. We train this SVM on the same feature set as the
26 body parts SVMs but using as positives all body parts
Different regions of the body have drastically differ-
bounding boxes and as negatives all background bounding
ent performances. In general arms and legs perform more
boxes. During the sliding windows stage, this SVM is used
poorly than head and torso in Yang’s algorithm. Therefore,
to filter candidate bounding boxes. Only bounding boxes
we also look at the average raw IOU for each region of
that are classified as non-background are kept and subse-
the body for each clip to see if the relative performance
quently scored by the corresponding body part SVM.
between different algorithms depends on the body region.
We defined seven regions: head, left torso, left arm, left
3.4.2 Hard Negative Mining leg, right torso, right arm, and right leg.
To further improve our method, we take advantage of the
4. Experiments
hard negative mining method [6]. In this addition to our
SVM Correction technique, we train our original 26 SVMs 4.1. Dataset
without any negative examples aside from other body parts.
Yang’s model [1] is pre-trained on the Image Parse
Then, using these SVMs, we test the on the randomly se-
dataset [9]. For testing, we require a dataset containing
lected negatives collected by our previous method. We do
human full-body footage because the model is trained on
this over a series of iterations where in each iteration we
images containing full-body poses.
collect new negative examples, test these negative examples
on all 26 SVMs, take the maximum score, and then keep a
To capture a variety of poses, we pulled video footage from
maximum of 30 negative examples for each video frame
Youtube containing varied subject matter [10], [11], [12],
that have a positive score. Our iterations stop once we have
[13] such as people walking, dancing, and playing sports.
kept a sufficient amount of negative examples. Using this
We cut these videos such that each clip contains a single
technique, we are able to collect the most ”confusing” neg-
camera view and the full-body of the subject. We prepro-
atives to train our SVMs on. We then recompute the cluster
cess the clips to obtain image sequences of the frames.
centers and Bag of Words features including the negative
Each frame is downsized using bicubic interpolation to be
examples and re-train all 26 SVMs. The correction step us-
about 256x256 pixels while maintaining the original aspect
ing the sliding window technique remains the same.
ratio. The downsizing is done to match the approximate
3.5. Evaluation size of the testing images used in [1].
To evaluate the performance of our algorithm, we The ground truths associated with our dataset were
measure how many body parts are correctly localized by made by manually clicking the points of all 26 different
comparing the pixel positions of the computed bounding body parts for every 10 frames. Each click is the centroid
boxes and the manually annotated ground truth bounding of a bounding box for a given body part. For evaluation, we
boxes. The Image Parse model outputs the four corners of believe comparing every 10 frames with the ground truth
a square bounding box while the manual annotation only values is sufficient to determine accuracy.
stores the centroid of a bounding box. We measure the
intersection over union of the computed bounding box and
the ground truth. We assume the size of the bounding box
for the ground truth is the same as the size of the computed 4.2. Results
bounding boxes. A bounding box is labeled ”correct” if its
4.2.1 HOG interpolation
IOU is above a certain threshold.
The HOG interpolation failed to provide accurate bounding
To aggregate this data for a single video clip, we count the boxes for subsequent frames because of drift. Any pose
number of frames a body part is correctly localized and estimation error made by Yang in first frame are propagated
divide that by the total number of frames. This number is into the subsequent frames and the quality of the interpo-
the average precision (AP) of the algorithm for that body lation disintegrates the farther removed we are from the
part in that video clip. initial frame. Figure 4 shows the decrease in average IOU
with increasing frame number. In general, the average IOU
4
Figure 4: Average IOU over for all video clips. Each thin
Figure 6: AP vs. Overlap Threshold Curve of the original
solid line represents a clip. There are 12 clips ranging from
Yang output (red) and the HOGs interpolation output (blue).
51 to 121 frames. The black dotted line is an average IOU
Lines with corresponding symbols indicate corresponding
over all clips.
clips. For example, the triangle symbol is the Yang and
HOG evaluation for Walking Clip 1.
Figure 5: Average IOU over all clips for each body region
of Yang output (blue) and HOGs interpolation (yellow). Figure 7: Average IOU in Walking Clip 3 for each body
region of Yang output (blue) and HOGs interpolation (yel-
low).
overlap with the ground truth over all frames in all clips is
significantly lower than in the original Yang output (Figure 4.2.2 One Pass SVM with Randomly Selected Nega-
5). tives
All clips performed worse under HOG interpolation Our single pass SVM has a pyramid depth of 5 and 100
except for Walking Clip 3 (the diamond in Figure 6). A cluster centers because those parameters produced consis-
histogram of the average IOU for each body region rein- tently good results. We trained and tested the SVM on 5
forces that finding (Figure 7). This is likely not because the clips and found that it improved the performance of two of
HOGs performed well but instead because the Yang output the clips, and decreased performance in two of the clips,
performed poorly for that particular clip. Note that the and did not change the performance in one of the clips (see
left arm in Figure 8 is not properly localized by the Yang Figure 9). Specifically, the SVM improved Beyonce Clip
output, but the HOGs have some overlap with the ground 1 and MLB Clip 1, it made worse Dog Walking Clip 2
truth. Also note that the right arm has better localization in and Walking Clip 1, while Dog Walking Clip 1 remained
the HOGs interpolation than the Yang output. the same. The improvement in Beyonce Clip 1 was very
5
(a) Yang Output (b) HOGs (a) Yang Output (b) Single Pass SVM
6
(a) MLB Clip 1 (b) Dog Walking Clip 1 (c) Beyonce Clip 1 (d) Walking Clip 1
Figure 12: Example frames from various different clips displaying the SVM correction using hard negative mining and a
double pass. The top row is the original Yang result and the bottom row is the result after our SVM correction.
best with the head, left arm, left leg, and right torso. This vary the threshold and create another AP curve or ROC
indicates that the extra background SVM pass and the hard curve based on that threshold to determine its effect on the
negatives mining did improve the performance of the SVM performance of the SVM.
correction overall especially since the arms from incorrect
Yang outputs tend to include background sub-images. The current implementation of the SVM is impracti-
cally slow. The most time is spent computing the Bag of
Figure 12 shows example frames of where the double Words feature vectors in various parts of our algorithm
pass SVM corrects errors in the original Yang output. For including the hard negative mining loop and the sliding
example, the right arm for Walking Clip 1 in the Yang window correction section. Therefore, we believe this to
output bounds the background while in the SVM correction be the bottleneck of our method. Thus, parallelizing this
it bounds the right arm. For MLB Clip 1 and Beyonce Clip computation such that all frames or even all body parts in
1 the arms move closer to the body in the SVM correction each frame are computed in tandem could have a significant
except for one bounding box. The solitary bounding speed up.
box remains far away because the true arm is outside of
the search space defined by our correction algorithm. In
6. Conclusion
Beyonce Clip 1 the left and right legs alternate probably
because the sift features are very similar between left and It is certainly true that human pose estimation is a
right legs. There is also a right arm bounding box on the challenging subject with many avenues of research yet to
left leg because, please, Beyonce’s legs basically look like be explored. We have made a small effort by introducing
arms anyway. a method that utilizes the similarities among video frames
to improve a single image pose estimation model when
5. Future Work used in a multi-frame context. The improvement was
particularly marked on the clips where the original Yang’s
If given the time, we could make several modifications algorithm performed the most poorly - and arguably where
to our SVM. Firstly, we did not tune all of the parameters of improvement was most necessary.
the SVM across all of the clips to find the best overall set of
parameters. We also noticed that while some values worked More importantly, we have highlighted areas where
well for some clips, they worked less well for others. More more research is possible and laid the groundwork for
investigation in this area could produce interesting insights. future avenues of investigation.
7
(CVPR), pages 1385–1392, Washington, DC, USA,
2011. IEEE.
[2] B. Bonnechre, Jansen B., P. Salvia, H. Bouzahouene,
Omelina L., J. Cornelis, M. Rooze, and S. Van
Sint Jan. What are the current limits of the kinect sen-
sor? In 9th International Conf. on Disability, Virtual
Reality and Associated Technologies, pages 287–294,
Laval, France, 2012.
[3] A. Agarwal and B Triggs. 3d human pose from silhou-
ettes by relevance vector regression. In IEEE Conf.
on Computer Vision and Pattern Recognition (CVPR),
volume 2, pages II–882–II–888 Vol.2, June 2004.
[4] M. Dantone, J. Gall, C. Leistner, and L. van Gool. Hu-
man pose estimation using body parts dependent joint
regressors. In IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), pages 3041–3048, Port-
land, OR, USA, June 2013. IEEE.
[5] A Toshev and C Szegedy. Deeppose: Human
pose estimation via deep neural networks. CoRR,
abs/1312.4659, 2013.
[6] Andrea Vedaldi. Object category detection practical.
http://www.robots.ox.ac.uk/ vgg/practicals/category-
detection.
[7] A. Vedaldi and B. Fulkerson. VLFeat: An open
and portable library of computer vision algorithms.
http://www.vlfeat.org/, 2008.
[8] M. Everingham, L. Van Gool, C. K. I. Williams,
J. Winn, and A. Zisserman. The pascal visual object
classes (voc) challenge. International Journal of Com-
puter Vision, 88(2):303–338, June 2010.
[9] D. Ramanan. Learning to parse images of articulated
bodies. In Advances in Neural Information Processing
Systems 19, Proceedings of the Twentieth Annual Con-
ference on Neural Information Processing Systems,
Vancouver, British Columbia, Canada, December 4-7,
2006, pages 1129–1136, 2006.
[10] beyonceVEVO. Beyonce - sin-
gle ladies (put a ring on it).
https://www.youtube.com/watch?v=4m1EFMoRFvY.
[11] Barcroft TV. Dog whisperer: Trainer
walks pack of dogs without a leash.
https://www.youtube.com/watch?v=Cbtkoo3zAyI.
[12] Cesar Bess. Mlb top plays april 2015.
https://www.youtube.com/watch?v=mpe9w-CHsoE.
[13] BigDawsVlogs. Walking next to people extras.
https://www.youtube.com/watch?v=776niN4-A58.
8
Indoor Scene Segmentation using Conditional Random Fields
Abstract cess cases and failure cases. We hope that in doing so, we
can gain intuition on directions for future work. We follow
Indoor scene segmentation is a problem that has become and re-implement the technical details of Silberman et. al’s
very popular in the field of computer vision with applica- method of indoor semantic segmentation. Furthermore, we
tions that include robotics, medical imaging, home remodel- test modifications to the method.
ing, and video surveillance. This problem proves even more
difficult when the scene is cluttered. Our project aims to 2. Related Work and Contributions
explore ways to improve indoor scene segmentation algo- 2.1. Literature Review
rithms by examining and evaluating a popular method.
We focus on evaluating the robustness of the algorithm There is a large body of work on using conditional ran-
for indoor scene segmentation described in [16] by Silber- dom fields (CRF’s) in order to produce semantic segmen-
man et. al which uses SIFT features and conditional ran- tations of images. In [11], He et. al present an approach
dom fields to produce segmentations. In our project, we re- using multi-scale conditional random fields for image seg-
implement their method and compare performance by mod- mentation. They leverage 3 different probabilistic mod-
ifying specific sections. els: a classifier relying only on local information, a con-
We find that the neural network used in [16] is not robust ditional random field relying on hidden regional variables
to an increasing number of classes, but the CRF model is to model interactions between object classes, and a CRF
in the sense that as we increase the number of classes, the that incorporates global label information. Their model
CRF becomes increasingly important to produce an accu- relied on a complicated training loop. Subsequent works
rate segmentation. Furthermore, we also demonstrate that such as [15, 18] use pairwise potentials to model the in-
changing the algorithm we use to generate superpixel seg- teractions between neighboring pixels when producing se-
mentations increases classification accuracy of the entire mantic segmentations. In [16, 14], Silberman et. al fol-
pipeline. low the same CRF framework and also introduce the NYU
dataset, a densely labeled dataset of indoor scenes. Finally,
in [8], Chen et. al introduce a state-of-the-art segmenta-
1. Introduction tion pipeline which utilizes a deep convolutional neural net-
work and fully-connected CRF to produce accurate segmen-
In our paper, we explore the semantic segmentation of tations.
indoor scenes, even cluttered ones. The main goal is that
given an RGB or RGB-D image of a cluttered indoor scene, 2.2. Our Contribution
we output a properly labeled image with each individual We reimplement the semantic segmentation approach in
pixel corresponding to an object class, such as a television, [16], and run the following main experiments:
chair, or table. Although semantically segmenting a scene
1. We evaluate the ability of the neural network used in
is an easy task for humans, automatic segmentation using
[16] to learn 100 object classes. In [16], 13 object
machines proves to be a challenging problem.
classes are used. This experiment allows us to mea-
The solution we are implementing is modeled after the
sure the robustness of their neural network.
segmentation algorithm described in [16], which utilizes
neural networks trained on SIFT features along with con- 2. We evaluate the performance of their CRF pipeline
ditional random fields in order to produce a semantic seg- using different superpixel algorithms to create initial
mentation. low-level segmentations. We also qualitatively analyze
The goal of our paper is to thoroughly evaluate the al- the performance of the CRF on images in the test set
gorithm described in [16] in order to understand its suc- and note potential areas for improvement.
1
szwalb segmentation, the approach used in [16], as
well as quickshift and SLIC segmentation. All points i
in a superpixel are assigned class probabilities Pi that
are equal to the average of the class probabilities for
all descriptors corresponding to grid points in a super-
Figure 1: Example data from the NYU dataset: Left=RGB pixel. If no grid points fall inside a superpixel, we as-
Image; Middle=Raw Depth Image; Right=ground truth sign the superpixel uniform class probabilities.
class labels created by Amazon Turk
4. Model pixel labels as a conditional random field. The
energy of the CRF is defined as follows:
3. We evaluate the performance of their CRF pipeline in X X X
the 100 object class setting, and show that the CRF E(y) = φi (yi ) + ψij (yi , yj )
model is robust to an increasing number of object i∈I i∈I j∈N (i)
classes.
The summations are taken over all pixels in the image.
3. Technical Details The first summation models unary potentials for each
class, while the second summation models pairwise in-
3.1. Dataset teractions between neighboring pixels.
We use the SUN RGB-D dataset [17], which contains the Silberman et. al model φi as the negative log of the
RGB-D images from the NYU depth dataset [14] which is product of Pi (yi ) and a location prior on the class yi .
the one that Silberman et. al. created and used, the Berkeley We did not implement location priors and instead set
B3D0 dataset [12], and the SUN3D dataset [21]. φi (yi ) = − log Pi (yi ).
In our project, we use 1449 images from only the NYU Finally, we set pairwise potentials
v2 dataset [14], although we use the labels provided by the
ψij (yi , yj ) = 1(yi 6= yj )ηe−αkIi −Ij k2
2
SUN dataset [17]. The NYU dataset includes the raw RGB
images, the raw depth images, and the labeled images, as
shown by example in Figure 1. We chose to use the SUN where Ii and Ij are the i-th and j-th RGB color chan-
RGB-D version of the NYU images because the SUN RGB- nels and η, α are hyperparameters. This potential func-
D dataset is around 10000 images total, allowing us to ex- tion mirrors the one used in [10]. Although Silberman
tend our work to settings with more data in the future. The et. al use a different pairwise potential function, we
SUN RGB-D dataset provides object class labels for each find that this potential function is easier to tune.
individual pixel of every image. Although the dataset pro- 5. Minimize the energy function of the CRF. In [16], Sil-
vides depth information for each image, we only use RGB berman et. al use the scheme provided by [4]. We
information for all of our implementations. experiment with both Boykov et. al’s α-expansion al-
3.2. Segmentation Pipeline gorithm in [4] and simulated annealing [2].
object class labels, a number that is too big for a neural net-
work with a single hidden layer to accurately classify, we
instead train on subsets of the classes. In total, we train
2 different neural networks. Following [16], we handpick
11 object classes to train on, shown in Table 1. Silberman
et. al use 13 classes - the classes that we use, in addition
to the “blind” and “background” classes. We do not use
these classes, however, because they do not appear in the
SUN RGB-D labels. We also choose the 100 most common
labels, and we train a neural network to classify between
these classes. For reference, our code includes a pickle file
containing the names of these 100 classes. The 11 hand-
picked classes form a large subset of these 100 most com- (b) Confusion matrix for testing.
mon classes. Figure 2: Confusion matrix for 11 class dataset. Indices
From Table 1, we can see that class distributions between correspond to indices in Table 1.
train and test are pretty similar, but class distributions are
both very skewed. This also holds for the 100 classes set.
Because of this skewed distribution, it is possible to train seen during training.
a neural network that achieves high test accuracy but only Judging from the confusion matrices in Figure 2, it
learns a few classes properly. To remedy this issue, we bal- seems that the 11 class neural network performs worst on
ance the training distribution by capping each class at 5000 classes that are both relatively scarce and similar in appear-
examples per split in the 11 class case and 1000 examples ance to other classes. For example, sofas and tables are
per split in the 100 class case. We are unclear on how Sil- commonly classified as floors. These sofa and table classes
berman et. al work around this problem. are very scarce compared to floors and exhibit similar prop-
In Table 2 and Figure 2, we show the accuracy results erties such as a large flat surface, leading to this incorrect
and confusion matrices for the 11 class dataset. In Table 3 classification. We have already mitigated many of these in-
and Figure 3, we show the accuracy results and confusion correct classifications by trying to balance classes during
matrices for the 100 class dataset. From the discrepancy be- training, but we are unsure how to improve this further with-
tween the training and testing accuracy for both datasets, it out switching to a deeper network architecture. Since Sil-
is clear that our models overfit, even though we use a sub- berman et. al do not provide their neural network results
stantial amount of training data given the size of networks in [16], we cannot perform a direct comparison. However,
we train. There are two main reasons why this could hap- their CRF with only unary potentials achieves a 40.9% pixel
pen. First, descriptors from the same image corresponding accuracy on 13 classes, which implies that our results on 2
to nearby points possess some redundancy, which means the fewer classes are on a comparable performance level.
effective number of training samples is smaller than the ac- We cannot make any comparison to [16] on the 100 class
tual number. Second, intra-class discrepancy is very high case because they only provide results for 13 classes. How-
between different indoor scenes. Since the train/test splits ever, the confusion matrix in Figure 3 shows that the sin-
in [14] are arranged so that not a single scene is in both train gle layer neural network is not robust enough for the 100
and test, test images could present variations of a class not class case. Many of the classes that are very incorrectly
Table 4: Superpixel Algorithm and 11 Class CRF Perfor-
mance
Superpixel Alg. Unary Acc. CRF Acc.
Felzenszwalb 48.74% 49.95%
Quickshift 48.20% 51.21%
SLIC 49.93% 51.35%
Accuracy of pixel-level labels for segmentation of test im-
ages on 11 classes. We only consider pixels that fall in one
of the 11 classes. Unary accuracy is computed from the
segmentation given by minimizing the unary terms of the
energy function. CRF accuracy is computed from consider-
ing pairwise terms too.
is able to fix half of this superpixel, neural network in- SLIC CRF SLIC CRF
accuracy is not the only problem - therefore, transition Figure 6: Sample truth maps and segmentations for the 100
potentials must be an issue too. class case.
3. The CRF can fix labels for pixels only if labels for We provide our accuracy results in Table 5. Surprisingly,
nearby pixels in the same object are correct. This is whereas the superpixel segmentations hurt our unary poten-
because the CRF model is based on pairwise interac- tial performance in the 11 class case, they actually improve
tions between neighboring pixels. In future work, we our performance in the 100 class case over train and test ac-
could attempt to fix this issue by using a fully con- curacy given in Table 3. We are unsure why this is the case.
nected CRF model as in [8], which allows the model In addition, Felzenszwalb actually provides better perfor-
to account for global feature interactions. mance for unary potentials now. This discrepancy might
be due to the fact that in the 100 class case, larger clusters
Another way to potentially address the first and second lim-
might result in significant improvement for some test im-
itations is to fine-tune the neural network probabilities and
ages because averaging class probabilities of large clusters
learn the pairwise interaction potentials by directly train-
reduces noise, and there is more noise in the 100 class case
ing a CRF model instead of handcrafting the pairwise po-
as opposed to the 11 class case.
tentials. We can do this by modeling the pairwise poten-
The increase in accuracy percentage between the unary
tials as the result of some convolution kernel applied to
and CRF models is an interesting result that demonstrates
the local image patch followed by some nonlinearity. We
the robustness of the CRF model. As the number of classes
could formulate a training objective that maximizes the log-
increases, the CRF actually seems to make a bigger impact
likelihood of the true labels, optimize it using contrastive di-
on the final segmentation. Although the actual increase in
vergence [6], and perform back-propagation into the unary
accuracy is lower in the 100 class case than the 11 class
and pairwise potentials to learn these functions. We suspect
case, the increase is higher in proportion because many
that this approach could mitigate the first and second limita-
fewer pixels get labeled correctly in the 100 class case.
tions by providing an energy function that is optimized for
From the examples shown in Figure 6, we can also qual-
the desired task, producing correct pixel labels. Because of
itatively observe the increased impact of the CRF on seg-
lack of computational power, we leave this idea for future
mentation quality. In Figure 6, we show example segmenta-
work.
tions using the SLIC superpixel algorithm; we choose to an-
alyze SLIC because SLIC and Felzenszwalb exhibit similar
4.3. Varying the Number of Classes
CRF performance on the 100 class set, while SLIC is clearly
We also explore the robustness of the approach in [16] better on the 11 class set. Whereas the truth maps in Fig-
to a varying number of classes. To do so, we run the seg- ure 5 do not change much between the unary and pairwise
mentation pipeline in Section 3.2 the test set using the 100 cases, the truth maps shown in Figure 6 exhibit significant
most common class labels. We run our experiments using changes between the two cases. Furthermore, the segmenta-
Felzenszwalb segmentation and SLIC in order to generate tions obtained using the CRF contain far fewer clusters than
superpixels; we forgo testing quickshift because it provided the segmentations obtained only using the unary potentials.
similar performance to SLIC in our 11 class test set and the We can explain the increased impact of the CRF as fol-
segmentation algorithm takes too long to run in conjunction lows: since there are a larger number of classes the network
with performing α-expansion optimization on 100 classes, is less certain about its classification choices and therefore
which already requires significantly more time than the 11 assigns more uniform class probabilities. As a result, the
class case. pairwise potential term is larger in magnitude than the unary
Table 6: Optimization Algorithm Comparison
Optimization Alg. % Acc on Random Sample
Simulated Annealing 51.55%
Boykov et. al 53.01%
Unary 51.56%
Accuracy of pixel-level labels for segmentation on 11
classes using different optimization scheme. In the “Unary”
scheme, we ignore pairwise potentials and take the argmax
class probability for each pixel. For all optimization meth-
ods, we use SLIC to obtain superpixels.
Iro Armeni
Voxelized 3D Sliding
Cube (Input)
x,y,z global
location
[0,0,0,0,...,1] [1,0,0,0,...,0]
null ceiling
Leaky ReLU
Leaky ReLU
Leaky ReLU
Leaky ReLU
Leaky ReLU
Softmax
3DConv
3DConv
3DConv
3DConv
3DConv
3DConv
or [0,0,1,0,...,0]
wall
[0,0,0,1,...,0]
other [0,1,0,0,...,0]
floor
Fully Convolutional Multi-class Voxel labels
Neural Network Classification (Output)
Figure 1: Frameworks for 3D Parsing of Large-Scale Indoor Point Clouds into their Space-Semantics. Exploring two different
network architectures: (1) A fully 3D CNN receives as an input a 3D voxelized sliding cube with binary occupancy and performs a per
voxel multi-class classification into 5 semantic labels. (2) A fully 3D CNN receives as an input a voxelized enclosed space with binary
occupancy and performs a per voxel multi-class classification into 10 semantic labels. The result in both cases is a class prediction per
output voxel.
1
parts [18]. of the 3D information in the depth, but remains 2D-centric.
Most work in the context of 2.5D and 3D using The presented work differs from these in that I employ a
ConvNets is targeting other applications like depth from fully volumetric representation, resulting in a richer and
RGB [7], camera registration [13], and human action recog- more discriminative representation of the environment.
nition [11]. This has limited the amount of produced knowl- 3D Convolutional Neural Networks: 3D convolutions
edge and available implementations, pretrained models and have been successfully used in video analysis ([11], [12])
training data for 3D related tasks. Alternative approaches to where time acts as the third dimension. Although on an al-
the problem of detecting 3D space semantics in large-scale gorithmic level such work is similar to the proposed one, the
indoor point clouds could be formed by posing the problem data is of very different nature. In the RGB-D domain, [16]
as 2D or 2.5D. Although these could benefit from pretrained uses an unsupervised volumetric feature learning approach
models, existing architecture and other 2D or 2.5D datasets as part of a pipeline to detect indoor objects. [32] proposes
for training (e.g. [27] and [22] respectively), they would a generative 3D convolutional model of shape and apply it
not take advantage of the rich spatial information provided to RGB-D object recognition, among other tasks. VoxNet
in 3D point clouds, which can help disambiguate problem- [20] presents a 3D CNN architecture that can be applied
atic cases. It has been shown that 3D parsing methods can to create fast object class detectors for 3D point cloud and
perform better than their 2.5D counterparts [5]. RGB-D data. This work has similarities, however among
I propose instead a framework for the task of parsing 3D the differences: it uses a different input representation, it is
point clouds of large scale indoor areas into their space- not performing voxel-to-voxel classification and since the
semantics using an end-to end 3D CNN approach. In a task is detection it uses fully connected layers.
higher level, the network receives as an input a voxelized
3D portion of a large-scale point cloud 1 and through a se- 3. Method
ries of fully 3D convolutional layers it performs multi-class
The proposed method receives as an input a voxelized
classification on the voxel level, and outputs the predicted
3D portion of the point cloud and through a series of 3D
class for each voxel. The network classifies each input voxel
convolutions results to a class label prediction for each
into 10 semantic labels2 related to structural and building
voxel. I gradually explored 3 different approaches:
elements, clutter and empty space.v
• 3D Sliding Window: In this network, the input is a
2. Related Work voxelized 3D cube of constant size with binary oc-
Traditional Approaches: Semantic RGB-D and 3D cupancy that is sled over the large-scale point cloud.
segmentation have been the topic of a large number of pa- When fed to the network, it passes through a series of
pers and have lead to a considerable leap in this area dur- 3D Fully Convolutional layers which result to a per-
ing the past few years. For instance [29, 24, 22] propose a voxel multi-class classification (see Figure 1-Left).
RGB-D segmentation method using a set of heuristics for • Adding Context: The previous approach does not
leveraging 3D geometric priors. [21] developed a search- provide any context about the content of the sliding
classify based method for segmentation and modeling of in- cube in relation to the rest of the point cloud. How-
door spaces. These are different from the proposed method ever, context can strongly influence inference. To this
as they mostly address the problem in a small scale. A few end, I provide the global position of the sliding cube in
methods attempted using multiple depth views [28, 9], but the point cloud as a second input to the network, fol-
they remain limited to small scale. Unlike approaches such lowing a similar approach to [6] (see Figure 1-Right).
as [26], [15], my method learns to extract features and clas-
sify voxels from the raw volumetric data. Vote3D [31] pro- • Enclosed Spaces: The use of a sliding cube with con-
poses an effective voting scheme to the sliding window ap- stant size cannot account for the different sizes that el-
proach on 3D data to address their sparse nature. ements in the point cloud appear with. Although for el-
2.5D Convolutional Neural Networks: A subsequent ements that belong to the category of things (e.g. chairs
extension to RGB-D data followed the success of 2D CNNs or tables) one can learn a dictionary of shapes, for el-
([17], [30], [4], [10]). However, most work handles the ements that can be categorized as stuff (e.g. walls or
depth data as an additional channel and hence it does not ceiling) it is harder to identify repetitive shape and size
make full use of the geometric information inherent in the patterns. To address this issue I explored an approach
3D data. [8] proposes an encoding that makes better use similar to [5] to take advantage of the repetitive layout
1 The scale of the point cloud can range from a whole building to a floor,
configuration that indoor enclosed spaces present (e.g.
or any large portion of the former.
elements are placed in a consistent way inside a room
2 Due to memory restrictions some of the presented experiments use with respect to the entrance location). The semantics
either 5 labels. For more details see 4.3. in such spaces remain intact (e.g. the wall, ceiling and
2
3.1. Input
n
(Sliding 3D Cube or
x,y,z global
Enclosed Space)
1 0 1
0 1
n 1 0 0
1
0
3D Sliding Window: Here the input is a 3D sliding cube
0
0 1
1
0 binary
of size 10x10x10 voxels. The stride of the cube on the point
00 1 1
3DConv FC which can encompass both the minimum wall width and a
gap between rooms larger than the standard wall size (ei-
Leaky ReLU ther due to noise or occlusion of the wall surfaces by e.g. a
bookcase in highly cluttered scenes).
3DConv Unpool3D Adding Context: In this approach a second input to the
Leaky ReLU 3DConv
voxelized sliding cube is fed to the network, which is the
global location of the cube with respect to the whole point
3DConv Leaky ReLU cloud. This is represented by its x, y, z coordinates from a
defined starting point (one of the point cloud’s corners) and
Leaky ReLU Unpool3D forms a vector of size 3.
Enclosed Spaces: As mentioned above the input here
3DConv
3DConv is an enclosed space. One can segment the point cloud
Classification
3
Table 1. Details of 3D Fully Convolutional Neural Network 4. Experiments
Approach 3D Sliding Cube Enclosed Space
Input 4.1. Dataset
Size: N x10x10x10 N x50x50x50
Number of Channels: 1 1 For the evaluation I used the Stanford Large-Scale 3D
Stride: 10x10x10
-
Indoor Dataset [5] which comprises of six large indoor
3D Conv1 parts in three buildings of mainly educational and office
Number of Filters: 32 use (see Figure 2). The entire point clouds are automati-
Filter Size: 5x5x5 cally generated without any manual intervention as the out-
Stride: 1x1x1
Padding: 2x2x2
put of the Matterport camera ([1]). Each area covers ap-
Output Size: N x32x10x10x10 N x32x50x50x50 proximately 965, 1100, 450, 870, 1700 and 935 square me-
3D Conv2 and 3D Conv3 ters (total of 6020 square meters). Conference rooms, per-
Number of Filters: 64 sonal offices, auditoriums, restrooms, open spaces, lobbies,
Filter Size: 5x5x5
stairways and hallways are commonly found. The areas
Stride: 1x1x1
Padding: 2x2x2 show diverse properties in architectural style and appear-
Output Size: N x64x10x10x10 N x64x50x50x50 ance. The dataset has been annotated for 12 semantic el-
3D Conv4 and 3D Conv5 ements which pertain in the categories of structural build-
Number of Filters: 128 ing elements (ceiling, floor, wall, beam, column, window
Filter Size: 5x5x5
Stride: 1x1x1 and door) and commonly found furniture (table, chair, sofa,
Padding: 2x2x2 bookcase and board). A clutter class exists as well for all
Output Size: N x128x10x10x10 N x128x50x50x50 other elements. The dataset was slpit into training, vali-
3D Conv6 dation and testing as follows: 4 areas for training, one for
Number of Filters: 5 10
Filter Size: 5x5x5
validation and one for testing. In this way I ensure that the
Stride: 1x1x1 network sees areas from different buildings during training
Padding: 2x2x2 and testing. The same data split was used in all approaches.
Output Size: N x5x10x10x10 N x10x50x50x50
Output
Size: N x5x10x10x10 N x10x50x50x50 4.1.1 Preprocessing
4 The number of layers and filters in the network was a direct result of • All data preprocessing steps were implemented in
the memory limitations. Python3 as well.
4
Figure 2. Stanford Large-Scale 3D Indoor Dataset [5]: I split the dataset into training (4 areas), validation (1 area) and testing sets (1
area). The raw point clouds are shown in the first row and the voxelized ground truth ones in the second row.
• 3D Sliding Cube: After generating the input, I no- although substantial effort was put to tune the hyper param-
ticed that the amount of sliding cubes that contained eters, the network did not learn. I identify four main factors
only empty voxels was greatly larger than the sliding as the principal reasons: (a) memory limitations did not al-
cubes that contained at least one voxel of the rest of the low to explore a number of hyper parameters such as using
classes. To counterpoise the skewness of the distribu- all available classes in the dataset, or different sliding cube
tion I removed part of the empty sliding cubes. sizes. In both cases the resulting matrices were too large and
the GPU would fail to handle them; (b) limiting the number
• I shuffled the data before training since without it of classes forced to place a great number of elements under
the learning process was getting compromised. The the other class, which as a result created a class with low
network was receiving sequentially inputs of similar discriminative power due to the resulting amorphous shape
classes in the first case due to the sliding nature of the and geometry, but highly represented in the dataset due to
input and the semantic consistency of the configuration the number of voxels falling in this category; (c) using a
of spaces and in the second due to spaces with similar generic constant size of the cube did not permit to capture
functions. the geometry of other elements; and (d) there was a lack
of context regarding the content of the voxelized input with
• I used the Adam [14] adaptive learning rate method,
respect to the rest of the point cloud. An example of the
with parameters: 0.9, 0.99, and 1e − 08.
training loss can be seen in Figure 3.
• The size of the batch per iteration was limited to 500 Adding Context: Following the previous failed attempt
sliding cubes for the first and second approaches (slid- to learn space-semantics, my next step was to add global
ing cube) and 4 for the third (enclosed space) due to context as a second input to the network. Following the ar-
memory restrictions. chitecture described above, the network continued not to be
able to learn. Once again memory limitations restricted the
• I used as metric the mean accuracy of prediction per number of layers, number of filters, and other network pa-
voxel. rameters. The training loss of this network is marked with
green line in Figure 4. Although the results are not as ex-
4.3. Results pected (see Figure 5), it did perform better than the previ-
3D Sliding Cube: The initial idea towards this problem ous network, which demonstrates that the global informa-
was to use a 3D sliding window approach. The main moti- tion was helpful, however not powerful enough to solve the
vation behind it was the fact that the input size to the net- ill-posed problem the sliding window approach created. I
work and the size of the voxelization grid could remain con- also experimented with the same architecture as that pro-
stant no matter the size of the point cloud (buildings have posed in [6], however the results did not differ.
different sizes). Previous experience with this framework in Enclosed Spaces: For the enclosed spaces approach I
traditional pipelines has been shown successful. However, experimented with 5 and 10 semantic classes. In this case
5
Figure 5. Testing Accuracy.
6
[4] L. A. Alexandre. 3d object recognition using convolutional [19] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
neural networks with transfer learning between input chan- earities improve neural network acoustic models. In Proc.
nels. In Intelligent Autonomous Systems 13, pages 889–898. ICML, volume 30, page 1, 2013.
Springer, 2016. [20] D. Maturana and S. Scherer. Voxnet: A 3d convolutional
[5] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, neural network for real-time object recognition. In IEEE/RSJ
M. Fischer, and S. Savarese. 3d semantic parsing of large- International Conference on Intelligent Robots and Systems,
scale indoor spaces. In Proceedings of the IEEE Interna- September 2015.
tional Conference on Computer Vision and Pattern Recogni- [21] L. Nan, K. Xie, and A. Sharf. A search-classify approach for
tion, 2016. cluttered indoor scene understanding. ACM Transactions on
[6] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learn- Graphics (TOG), 31(6):137, 2012.
ing to generate chairs with convolutional neural networks. [22] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor
In Proceedings of the IEEE Conference on Computer Vision segmentation and support inference from rgbd images. In
and Pattern Recognition, pages 1538–1546, 2015. ECCV, 2012.
[7] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction [23] S. Ochmann, R. Vock, R. Wessel, M. Tamke, and R. Klein.
from a single image using a multi-scale deep network. In Automatic generation of structural building descriptions
Advances in neural information processing systems, pages from 3d point cloud scans. In GRAPP 2014 - International
2366–2374, 2014. Conference on Computer Graphics Theory and Applications.
[8] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik. Learning SCITEPRESS, Jan. 2014.
rich features from rgb-d images for object detection and seg- [24] J. Papon, A. Abramov, M. Schoeler, and F. Worgot-
mentation. In Computer Vision–ECCV 2014, pages 345–360. ter. Voxel cloud connectivity segmentation-supervoxels for
Springer, 2014. point clouds. In Computer Vision and Pattern Recogni-
[9] A. Hermans, G. Floros, and B. Leibe. Dense 3d semantic tion (CVPR), 2013 IEEE Conference on, pages 2027–2034.
mapping of indoor scenes from rgb-d images. In Robotics IEEE, 2013.
and Automation (ICRA), 2014 IEEE International Confer- [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
ence on, pages 2631–2638. IEEE, 2014. real-time object detection with region proposal networks. In
[10] N. Höft, H. Schulz, and S. Behnke. Fast semantic segmen- Advances in Neural Information Processing Systems, pages
tation of rgb-d scenes with gpu-accelerated deep neural net- 91–99, 2015.
works. In KI 2014: Advances in Artificial Intelligence, pages [26] X. Ren, L. Bo, and D. Fox. Rgb-(d) scene labeling: Features
80–85. Springer, 2014. and algorithms. In Computer Vision and Pattern Recogni-
[11] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neu- tion (CVPR), 2012 IEEE Conference on, pages 2759–2766.
ral networks for human action recognition. Pattern Analysis IEEE, 2012.
and Machine Intelligence, IEEE Transactions on, 35(1):221– [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
231, 2013. S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[12] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
and L. Fei-Fei. Large-scale video classification with convo- Recognition Challenge. International Journal of Computer
lutional neural networks. In Proceedings of the IEEE con- Vision (IJCV), 115(3):211–252, 2015.
ference on Computer Vision and Pattern Recognition, pages [28] T. Shao, W. Xu, K. Zhou, J. Wang, D. Li, and B. Guo. An
1725–1732, 2014. interactive approach to semantic modeling of indoor scenes
[13] A. Kendall and R. Cipolla. Modelling uncertainty in deep with an rgbd camera. ACM Transactions on Graphics (TOG),
learning for camera relocalization. Proceedings of the In- 31(6):136, 2012.
ternational Conference on Robotics and Automation (ICRA), [29] N. Silberman and R. Fergus. Indoor scene segmentation us-
2016. ing a structured light sensor. In Computer Vision Workshops
[14] D. P. Kingma and J. Ba. Adam: A method for stochastic (ICCV Workshops), 2011 IEEE International Conference on,
optimization. CoRR, abs/1412.6980, 2014. pages 601–608. IEEE, 2011.
[15] H. S. Koppula, A. Anand, T. Joachims, and A. Saxena. Se- [30] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng.
mantic labeling of 3d point clouds for indoor scenes. In Convolutional-recursive deep learning for 3d object classifi-
J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. C. N. Pereira, cation. In Advances in Neural Information Processing Sys-
and K. Q. Weinberger, editors, NIPS, pages 244–252, 2011. tems, pages 665–673, 2012.
[16] K. Lai, L. Bo, and D. Fox. Unsupervised feature learning for [31] D. Z. Wang and I. Posner. Voting for voting in online point
3d scene labeling. In Robotics and Automation (ICRA), 2014 cloud object detection. In Proceedings of Robotics: Science
IEEE International Conference on, pages 3050–3057. IEEE, and Systems, Rome, Italy, July 2015.
2014. [32] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and
[17] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting J. Xiao. 3d shapenets: A deep representation for volumetric
robotic grasps. IJRR, 2015. shapes. In Proceedings of the IEEE Conference on Computer
[18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional Vision and Pattern Recognition, pages 1912–1920, 2015.
networks for semantic segmentation. CVPR (to appear),
Nov. 2015.
7
Adding Shot Chart Data to NBA Scenes
Neerav Dixit
Stanford University
450 Serra Mall, Stanford, CA 94305
ndixit@stanford.edu
Abstract
1
(a) Broadcast image of left side of court (b) Left side court geometry (c) Right side court geometry
Figure 2: Broadcast image (a) with image coordinates (u, v). Court geometry shown in (b), (c) along with court
coordinates (x, y) used in each case
2
Figure 3: Flow chart of algorithm including images generated at each stage
ing to the HSV colorspace of the pixels. Each dimen- Using this Hough voting scheme, some isolated
sion was divided into a n bins (n = 3 or n = 4) for a pixels would end up as false positives or false nega-
total of n3 bins in the Hough space. Each pixel in the tives. Since the court is mostly continuous in the im-
bottom 75% of the image voted for the bin correspond- age, a median filter was applied to the court mask out-
ing to its HSV numbers. The top 25% of the image was put from the Hough voting scheme in order to elimi-
discarded because the court typically does not extend nate isolated pixels. The size of the median filter was
to that part of the image. Bins corresponding to low varied for images of different courts to give the best
value pixels were discarded since the court is not dark results. After thresholding the output of the median
and black is a common background color that can re- filter, the final court mask, shown in Figure 3b, was
sult in coherent votes for non-court colors. The high- obtained. This court mask does not include players or
est remaining m bins (m = 1 or m = 2) were retained, referees, and it successfully separates the court from
and pixels contributing to these bins were identified most of the occlusions on the court.
as the court mask. The values of n and m were var-
ied depending on which court was shown in the im- 3.2. Key Point Identification
age. Because of some differences in the court designs
of the 30 NBA teams, different values of n and m gave Following identification of the court mask, this
the best results for the different courts. mask was used to identify key points on the court to
be used for camera calibration. Without prior knowl-
3
fitting the selected points using the Hough transform,
so long as enough of the points fall along the desired
line of interest. These detected lines are superimposed
on the image in Figure 3d. Once the court lines in the
image have been found, their intersection can be used
to locate the key points in the image. These key points
are shown in green in Figure 3d.
4
3.4. Shot Chart Projection
Following camera calibration, the court coordinates 20
(x, y) of each pixel in the court mask can be found by
first solving for the homogeneous court coordinates p
u
p = H −1 v (9) 10
1
p1 p2
x= , y= (10)
p3 p3 5
5
Figure 6: Four results of the algorithm projecting shot charts onto broadcast NBA images. Image (a) corresponds
to blue X in Figure 5; (b) to green X; (c) to red X, (d) to blue O
court to find key points in the image, camera cali- culties were encountered in images of certain playoff
bration and assignment of pixels on the court to shot games, in which fans were provided with shirts of the
chart regions, and combination of the original im- same color. Rather than the region of fans near the top
age with the image of the projected shot chart. The and bottom of the image having an arbitrary assort-
method shows reasonable accuracy in separating the ment of colors, in these cases there were many pixels
court from other features in the image as well as hav- in the image that would vote for the same incorrect
ing boundaries between shot chart regions at the ap- bin in the HSV Hough space that was used. This issue
propriate locations in the image. was overcome by ignoring the top 25% of the image,
The method was applied to a variety of different which is nearly always occupied by the crowd, and by
NBA courts, and the dominant color detection method enforcing that the Hough bins corresponding to dark
that was implemented through a Hough transform pixels with low values were ignored when identifying
appeared to work very well despite some differing the dominant colors.
color schemes for these different courts. Some diffi- Only images using a broadcast angle and focusing
6
primarily on the left or right side of the court were
considered. However, the camera angle in the broad-
cast is not stable in these situations, so there was some
variability in the images used to test the algorithm.
This variability resulted in some difficulty in the step
for detection of potential points on the court lines of
interest, shown as the transition from Figure 3b to 3c.
Knowledge of the court based on the shape and ex-
tent of the court mask had to be incorporated in order
to accurately identify these candidate points across a
range of camera angles.
Considering images of both the left and right sides
of the court also resulted in a need to differentiate be-
tween these images when identifying points along the
lines of interest. Using different sets of court coordi-
nates, shown in Figures 2b and 2c, for camera calibra-
tion of images showing the left and right sides of the
court allowed these different cases to be treated iden-
tically for the later steps in the algorithm following
camera calibration.
This method excludes the key, the colored rectangle
near the hoop, when projecting the shot chart onto the
court. In order for a good projection to be made onto
the key, its color would have to be changed so that
the projected shot chart would be visible in the final
image. This was not done for this work due to dif-
ficulties in changing the color of the entire key while
ignoring occluding players and maintaining the court
markings that exist in the key. However, implement-
ing this into the algorithm would be a good next step
to extend this work further.
This algorithm could definitely improve the viewer
experience of an NBA broadcast if implemented ap-
propriately. The addition of shot chart data in proper
context to the broadcast would help commentators
make salient points with the help of a great visual
aid. Future work to further increase the accuracy of
the projection and identification of court pixels, along
with the inclusion of the key in the shot chart pro-
jection, would result in a useful tool to enhance NBA
broadcasts.
The Matlab code and images used for
this work can be accessed at the follow-
ing link: https://stanford.box.com/s/
akh4sdyput0ez37yy0he3la43y2ono3s
The shot chart projection is done by running pro-
gram main.m.
6. References
[1] Hu, Min-Chun, Ming-Hsiu Chang, Ja-Ling Wu,
and Lin Chi (2011), ”Robust camera calibration and
player tracking in broadcast basketball video.” IEEE
Transactions on Multimedia, 13(2): 266- 279.
7
Optical Recognition of Hand-Drawn Chemical Structures
Bradley Emi
Stanford University
Dept. of Computer Science
1 Introduction
1
2.2 Improvements to Existing Approaches
There are several additional advantages of being able to
recognize handwritten molecules in addition to computer- The focus of this paper is the novel approach to the correct
generated molecules. For example, the computer identification of hand-drawn bonds in the low-level
generation of molecular structures is currently quite module, the correct identification of atoms and edges
tedious, and an application to perform real-time without the use of high-level correction using chemical
recognition of small components of hand-drawn structures and graphical knowledge. Previous attempts at optical
does not yet exist. structure recognition, even state-of-the-art approaches, are
heavily dependent on the correct identification of fine
2 Review of Previous Work lines (the individual thin lines constituting double, triple,
and dashed bones), which fails in the case of imperfect
2.1 Summary of Previous Work hand-drawn bonds. Frasconi et. al.’s algorithm, MLOCSR,
uses the Douglas-Peucker algorithm [9] to approximate
Previous work on the optical structure recognition the contour of the molecule with a polygon which fits the
problem has to date focused exclusively on computer- least-vertex polygon to the contour within a certain
generated structures. Early research began in the 1990s, precision. We hypothesize that line detection-only
with IBM receiving a patent for recognition of chemical vectorization algorithms such as the Douglas-Peucker
graphics among other printed material on a page as well as algorithm may fail in cases where bonds are not straight
basic line tracing techniques (Fig. 3) to recognize (Fig. 4), assigning too many vertices to the molecule.
structures. [4] A similar approach was developed by Furthermore, classification algorithms can fail when
University of Leeds researchers and called CLiDE in the dashes follow an irregular pattern and/or touch (Fig. 5).
same year. [5]
2
approach was ultimately unsuccessful. HOG classifiers
were found to suffer from a large number of false positives
due to a lack of negative training examples. With more
data, this approach could also prove to be more successful,
but was inapplicable with our limited amount of training
data.
3
maximum scale for the images at 20x20 pixels and 60x60 For clarity, we use the terminology of MLOCSR, defining
pixels respectively and implemented a spatial pyramid a C-point to be a corner corresponding to the intersection
sliding window with the length of each square window of the main bonds of a carbon, a D-point to be the
increasing from 20 to 60 pixels in steps of 5 pixels. This endpoint of a line segment not connected to the main bond
was a very conservative estimate, and for to represent a double or triple bond, and a T-point to be the
reimplementation on different image sizes, we recommend end of a line segment drawn to a text box to indicate a
scaling from 0.3% of the total area of the image to 3%. bond to a non-carbon.
For the supervised learning classifiers, to collect negative 3.2.4 Best-Fit Polygon Reimplementation
training examples, we randomly selected 1200 of these
windows that were verified by hand to have no text from As in MLOCSR, we use the Douglas-Peucker algorithm to
the training set. We then collected 5 of each of the 4 detect the vicinity of C-points and T-points, and look for
templates from the training set. To augment the number of D-points later once the main corners are located. This
positive examples, we additionally used 55 images for algorithm for each contour iteratively tries to fit n-vertex
each template from the open source Chars74K handwritten polygons to the contour, increasing n until no point on a
dataset [11]. We cropped each image to eliminate contour is further than a threshold distance away from the
whitespace and extracted histogram of oriented gradients polygon. The algorithm then returns the vertices of the
features from each using 64 bins. We then compared the polygon.
performance of a logistic regression classifier, a linear
SVM classifier, and a neural network with one hidden We search for clusters of all points of polygons that fit the
layer with 30 nodes. Results are presented in section 4. opposite contours of the image after a Canny edge detector
is applied in order to accomplish this goal. We use the
For the scale-invariant template-matching, we applied a threshold of √2 times the edge length as prescribed in
Gaussian filter with size equal to half the width of the MLOSCR. Fig. 8 shows the results of this stage.
measured strokes to all training templates and the image
for matching, and then used the spatial pyramid sliding
window described above to match the images. We then
chose the tolerance level, 0.77, for which the F1 score was
maximized. Non-maximal suppression is used to remove
overlapping bounding boxes. A sample of the output of
this stage is presented in Fig. 7. We used the results of this
algorithm for the next stages of the pipeline. More details
are presented in section 4.
4
3.2.6 Bond Detection
3.2.5 Harris Corner Detector Since a carbon can only have four bonds, for each of the
nodes detected in the previous stage, we look at the four
The goal of the Harris detector [12] in this context is the closest nodes to see if there is a bond between them. While
same; to identify the C- and T-points but not necessarily further molecules are not strictly forbidden from being
the finer D-points that distinguish double and triple bonds. connected to a carbon, it is extremely uncommon, and this
The Harris corner detector looks for a high variation in the case does not occur in any of the molecules in our dataset.
gradient of an image in two directions. For more general molecules, more nodes can be examined
and spurious matches can be removed using a Markov
We first apply a coarse Gaussian filter to the image with logic network similar to what is implemented in
the size of the estimated stroke width. We then run the MLOCSR, but we do not implement that here for
Harris corner detector, once again requiring corners to be a simplicity.
threshold distance apart.
The other heuristic we use is that if three nodes are
A sample result on the same molecule after filtering is collinear, there is not a bond between the two outer nodes.
presented in Fig. 10. This situation only occurs when there are two bonds at a
180-degree bond angle, so the outer nodes cannot have a
bond between them.
5
molecules and would also be removed by a Markov logic window gets a “vote” for the overall type of bond. A
network. Results are presented in Sec. 4. The process is sample result is shown below in Fig. 12, and overall
visualized in Fig. 11. results are presented in Sec. 4.
4 Results
Fig. 11: Top left, top right: Hough line detections (blue) 4.1.1 Scale-Invariant Template Matching
for the node-node pairs with a bond for a given window
(red) in the bounding box. Bottom: A window between The results of text recognition are presented here. In order
two opposite nodes that will not have a Hough line to optimize the tolerance of the scale-invariant template
detection, so the algorithm will not assign it a bond, even matching, we measured the precision and recall on the test
though there is contamination elsewhere in the bounding set. The results are presented in Fig. 13. We chose the
box. tolerance that maximizes the F1 score, 0.77.
6
Molecule ID Precision Recall Accuracy For future work we would like to expand the size of the
1 1.0 1.0 1.0 training set to improve the accuracy; but for now we use
2 1.0 1.0 1.0
3 0.54 1.0 0.50
template matching.
4 1.0 1.0 1.0
5 0.95 0.95 0.90 4.2 Corner Detection Results
6 1.0 1.0 1.0
7 0.96 0.79 0.40 As shown in Table 3, we conclude that the Harris corner
8 0.79 0.65 0.53
detector outperforms the baseline method of the MLOCSR
9 0.98 0.90 0.58
Total 0.91 0.92 0.77 polygon reconstruction method quite significantly, by
approximately 15% on the molecule level (node level
Table 1: Results of scale-invariant template matching on refers to the number of correctly detected nodes over the
test set. total number of nodes, molecule level refers to the number
of correctly detected molecules with no false positives
While an accuracy of 77% is far from ideal, it is divided by the total number of molecules. There are
surprisingly effective considering we only used 5 training several reasons that this result is the case. First, the
images to build the templates. With more examples, this polygon reconstruction method performs very poorly in
method could perform even better in future work. We use the case of dashed bonds; whereas the Gaussian smoothing
the images where text was accurately identified from this applied to the image before applying Harris corner
stage in the further stages of the pipeline. detection “blends” dashed bonds into an edge before
finding the corners. This kind of preprocessing is not
4.1.2 Supervised Classifiers feasible for the polygon reconstruction method since it
relies on the narrow opposite contours that form the edges
Parameters Train Cross- #Iter- Avg. of the thick lines. The performance on a dashed molecule
Set Validation ations Acc. is demonstrated in Fig. []. While neither method performs
Logistic Regularization 1330 10-fold 100 0.97 particularly well on dashed bonds (polygon method
Regression coeff. = 1.0, L2 examples
with HOG norm
performs at 15% on these dashed bonds, while Harris
features method performs at 45% on dashed bonds), which when
Linear Regularization 1330 10-fold 100 0.96 blended are very wide, making corner detection difficult
Support coeff. = 1.0, L2 examples especially when they a dashed bond is near other corners.
Vector norm
Machine Molecule Overall Overall
with HOG accuracy Precision Recall
features
One-Layer One hidden 1330 10-fold 100 0.99
Polygon 0.748 0.960 0.970
Neural layer with 30 examples
Method
Network nodes
with HOG Harris Method 0.896 0.987 0.989
features
Table 2: Results of supervised classifiers on the OCR
training set.
7
Fig. 14: Harris corner detection on a molecule with dashed Molecule ID Molecule Overall Overall
bonds (left) and polygon reconstruction method with Accuracy Precision Recall
initial corners in blue and clustered corners in red. (right). 1 1.0 1.0 1.0
A wide Gaussian filter helps “blend” dashed bonds 2 0.31 0.94 0.95
together. Since each dash of the dashed bond is a contour, 3 0.0 1.0 0.78
many spurious corners in blue are detected with the 4 0.26 1.0 0.79
polygon reconstruction method and make the 5 0.70 1.0 0.88
agglomerative clustering inaccurate. 6 0.82 1.0 0.94
7 0.22 0.86 0.97
As we hypothesized, the Harris method also outperforms
8 0.70 1.0 0.90
the polygon reconstruction method when bonds are not
9 0.10 0.90 0.83
perfectly straight. This was particularly evident in the
benzene rings, which the Harris corner detector (95% Total 0.55 0.96 0.91
accuracy on benzene rings) was able to substantially Table 4: Bond detection results.
outperform the polygon reconstruction method (50%
accuracy on benzene rings). This effect is shown in Fig. Typical errors included a missing bond, as shown in Fig.
15. 16, and false triangular closures with bonds at very wide
angles (bonds that are nearly, but not quite collinear), as
shown in Fig. 17.
8
4.4 Bond Classification Results with more training data we will be able to obtain nearly
100% accuracy with this method in the future.
4.4.1 Comparison of Classifiers
Table 5: Cross-Validation Results on a 90-10 training set Fig. 18: Two examples of correctly recognized molecules
split of known bond labels. after completion of the full pipeline. These can be easily
converted to a standard chemical data format.
4.4.2 Performance on Test Set
5 Conclusion
Based on the results on the cross-validation set, we use the
SVM for classification on the full set of molecules. We Although our overall accuracy is low, we believe that the
find that there is no great disparity in confusing one type work presented in this paper will lay the foundation for
for another; despite the only 5 training examples of triple hand-drawn structure recognition in the future.
bonds, we find that double bonds are no more often
mistaken for single bonds as triple bonds, for example. Much of the low accuracy can simply be attributed to a
lack of training data. State-of-the-art OCR methods, for
Molecule ID Accuracy By Bond Accuracy By example, would boost the accuracy of text recognition
Molecule
1 0.98 0.90 from 77% to near perfect. We also believe that more
2 0.97 0.81 training data will ultimately allow us to use a
3 0.57 0.0 convolutional neural network for the bond classification
4 0.80 0.30 stage rather than an SVM, and more data will be able to
5 1.0 1.0 significantly improve the accuracy of bond classification
6 0.83 0.50
7 1.0 1.0
as well.
8 0.98 0.93
9 1.0 1.0 Additionally, as mentioned previously, the focus of this
Total 0.94 0.75 project was on the low-level recognition of atoms and
Table 6: Test results by molecule using an SVM classifier. bonds; or the nodes and vertices that make up the overall
graph. There are additional heuristics that can be applied
4.5 Overall Results in higher-level modules, such as bonding patterns like
valence rules that we did not take into account, which will
When the overall pipeline is run on the entire set of significantly improve the performance of bond detection.
molecules, 94 out of the original 360 molecules are
correctly recognized in their entirety. While this accuracy Accuracy may also be a misleading metric for certain
may seem low, it is still higher than the performance of the applications of hand-drawn structure recognition as well,
“out of the box” existing optical structure recognition in cases where more information is obtained. For example,
algorithms, the most well-known being OSRA, which in an electronic tablet drawing application, in a similar
when used on handwritten data have nearly 0% accuracy. way to how Chinese and Japanese characters are
We also find that even the approach of MLOCSR applied recognized by OCR software, information about how the
to the data, which relies on the Douglas-Peucker polygon user is drawing the structure is also available. This can
fitting algorithm, does not even detect C- and T- points as improve the localization of corners (using information
successfully as our algorithm on our hand-written dataset. about when the user picks up and puts down the pen) and
We also find that our supervised learning bond identify bonds with much greater accuracy (based on
classification algorithm performs extremely well given the speed of stroke, etc.). Additionally, if there is a limited
very small training data set, which was extracted from subset of molecules that the engine is required to
only 5 images of each molecule. We are optimistic that recognize, various molecule similarity algorithms can be
used to compare the molecule against the database of
9
possible molecules and return the one with the greatest [7] Filippov, I. and Marc Nicklaus. Optical Structure
similarity. This is often the case for simple molecules and Recognition Software to Recover Chemical Information:
could be very useful in chemistry education. OSRA, An Open Source Solution. J. Chem. Inf. Model., 49
(3), pp. 740-743, 2009.
We conclude that handwritten structure recognition and
analysis is a difficult problem, one that cannot be treated [8] Frasconi, P. et. al. Markov Logic Networks for Optical
in the same way that computer-generated structure Chemical Structure Recognition. Journal of Chemical
recognition is treated. More flexibility must be applied in Information and Modeling. 54, pp. 2380-2390, 2014.
accounting for the greater degree of variability in hand-
drawn images, and we have accounted for that in this work [9] Douglas, D.; Peucker, T. Algorithms for the reduction
with modern corner and line detection techniques. The key of the number of points required for represent a digitized
insight of this project was analyzing small cross-sections line or its caricature. Can. Cartogr. 1973, 10, 112−122.
of bonds so the algorithm can gain a consensus from many
cross-sections instead of trying to analyze bonds as a [10] Navneet Dalal, Bill Triggs. Histograms of Oriented
whole, as previous algorithms have done. Overall, there Gradients for Human Detection. Cordelia Schmid and
are many parts of this pipeline that can be improved as Stefano Soatto and Carlo Tomasi. International
mentioned, but much progress has been made towards Conference on Computer Vision & Pattern Recognition
being able to apply these methods to a public application. (CVPR ’05), Jun 2005, San Diego, United States. IEEE
Computer Society, 1, pp.886–893, 2005.
6 References
[11] T.E. de Campos, B.R. Babu and M. Varma.
[1] Gaulton, A.; Overington, J. P. Role of open chemical Character Recognition in natural images. In Proceedings
data in aiding drug discovery and design. Future Med. of the International Conference on Computer Vision
Chem. 2010, 2, 903−7. Theory and Applications, Lisbon, Portugal, February
2009.
[2] Kind, T.; Scholz, M.; Fiehn, O. How large is the
metabolome? A critical analysis of data exchange [12] Harris, C. and M. Stephens, A Combined Corner and
practices in chemistry. PLoS One 2009, 4, e5440. Edge Detector. Plessey Research, 1988.
10
A Appendix: Molecule Table
2
Pedestrian Detection and Tracking in Images and Videos
1
then other approaches.
The Deformable Parts Model (DPM) is another
technique for object detection that performs well at
classifying highly variable object classes. In this tech-
nique, for each image, a HOG feature pyramid is
formed by varying the scale of the image, and defin-
ing a root and parts filter. The root filter is coarse and
is used to capture the general shape of the object, while
higher resolution part filters are used to capture small
Figure 1: Example of HOG features, the right picture is the
parts of the object. Objects are then detected by com- original image and the left one is the extracted HOG
puting the overall score for each root location based on features.
the best possible placement of the parts [7].
The other technique in pedestrian detection is Con-
volutional Neural Networks (CNN). This technique
shows outstanding power in addressing the pedestrian found from our SVM model. Since our sliding window
detection problem, especially in the context of au- was kept at a constant size of 64x128 pixels, we imple-
tonomous driving. In CNN, it learns which convo- mented an image pyramid approach for our detection.
lution parameters can produce better features to eas- In this approach, for each image, we scaled down the
ily predict an optimal output. Then it uses these image by 15% of its original size for several iterations
features by extracting them from the last fully con- until the size is below a threshold of 64 pixels for width
nected layers to train an SVM model for pedestrian and 128 pixels for height. For each iteration, our de-
detection[11][12]. tector window searched the entire scaled image and
calculated scores using our SVM model. Once this al-
3. TECHNICAL APPROACH
gorithm was finished, scaled bounding boxes was dis-
Using our dataset of positive and negative images, played on the original image and non-maximal sup-
we extracted features using the histogram of oriented pression applied to eliminate redundant boxes.
gradients technique described by Dalal and Triggs.
In order to reduce false positive rates, we mined
This technique divides the image into dense equal
for hard negative examples using our negative train-
sized overlapping blocks. Each of these blocks are
ing data. We extracted all false positive objects found
further divided into cells which will be used to find
within negative images and included these examples
a 1-D histogram of gradient edge orientations over the
into our training data for retraining the classifier.
pixels of the cell. For this project, we experimented
with block sizes that were 2x2 and 4x4, and cells sizes Due to the exhaustive search performed during
that were 8x8 and 16x16 in order to find the best com- HOG feature extraction, the time complexity for
bination. For our histograms, we used 9 orientation object detection is very high. This poses a problem for
bins across all experiments. Histograms for each block pedestrian tracking in videos because detection rates
are combined and finally normalized to have better in- would be too slow. In order to remedy this problem,
variance to illumination and shadowing [6]. Figure objects that are moving will be extracted from each
1 shows an example of extracted HOG features for a frame using background subtraction [8]. Using this
pedestrian. method, we detected motion by segmenting moving
In order to train the model, the feature vectors for objects from the background and passing these smaller
the images were fed into a linear SVM classifier. This images into our model instead of passing the whole
model was then used to classify pedestrians from non- frame for detection. The nth frame can be represented
pedestrians. We implemented a sliding window ap- as In which is its intensity value. In−1 will correspond
proach to exhaustively search static images for win- to the previous frame. Doing a pixelwise subtraction,
dows with the scores greater than 0.2. The scores for we get the equation
each window was calculated using the weight and bias
2
(
In (i, j) ∆(i, j) ≥ Tthreshold from learning more trees will be lower than the cost in
Mn = computation time for learning these additional trees.
0 ∆(i, j) < Tthreshold
For our dataset, Random Forest provided the best ac-
where i and j are pixel positions and Mn is the mo- curacy with 1,000 trees.
tion image. By finding the motion image, we can dra- Furthermore, in order to tune the hyper parameters
matically reduce the complexity of our computation for HOG features, we extracted them using different
[9]. Using these motion images, we were be able to block sizes and cell sizes. Table 1 shows the results
run our model on video frames much faster than when of these experiments. As seen in this table, the block
we did not have any motion detection. size and the cell size have a significant affect on the
Figure 2 provides a summary of the steps we have in accuracy of our models. In the other words, the effec-
our detection algorithm. tiveness of the models strongly depends on the HOG
feature parameters. Also from the table, we can see
4. EXPERIMENTS AND RESULTS that the Random Forest outperforms SVM in all the
To obtain the model with the highest accuracy, we cases except when the block size is 2x2 and the cell
tried two different classifiers: SVM and Random For- size is 4x4 which the accuracy of the SVM model is
est. To evaluate our models, we tested them on vali- higher than the Random Forest. According to these re-
dation set including 1,000 pedestrian images. For the sults, we can conclude that there is no optimal config-
SVM classifier, we investigated different regulariza- uration for HOG features and it depends on the dataset
tion parameters(C) to get the highest accuracy. The we are using. In order to reduce false positive rates in
regularization parameter tells the SVM optimization our models, we exhaustively searched all 2,100 nega-
how much we want to avoid misclassifying each train- tive images and extracted 5,800 windows with the size
ing example. For large values of C, the optimization of 64x128 pixels as false positive objects and then re-
will choose a smaller-margin hyperplane if that hy- train our model with the new augmented set. Using
perplane does a better job of getting all the training 1,000 new negative images for validation, the original
points classified correctly. Conversely, a very small model had a false positive rate of 0.005% while the
value of C will cause the optimizer to look for a larger- new model with hard negative mining had a 0% false
margin separating hyperplane, even if that hyperplane positive rate. Most of the false positives came from ob-
misclassifies more points. We got the highest accuracy jects that are erect and skinny such as poles and trees.
for the SVM model when the regularization parameter However, our hard negative mined model eliminated
has the value of 0.001. many of these false positives. An example of this im-
For Random Forest, we examined different number provement is seen in Figure 3.
of trees in training the model. Random Forest uses As mentioned in the section 3, since we found mul-
bagging (picking a sample of observations rather than tiple bounding boxes for each object, we used non-
all of them) and random subspace method (picking a maximal suppression to remove the redundant bound-
sample of features rather than all of them) to grow a ing boxes. Figure 4 shows an example of using non-
tree. If the number of observations is large, but the maximal suppression for two images.
number of trees is too small, then some observations For the purpose of background subtraction, we
will be predicted only once or even not at all. If the calculated a reference image using a Gaussian
number of predictors is large but the number of trees Mixture-based background/foreground segmentation
is too small, then some features can be missed in all algorithm. Then, we subtracted each new frame from
subspaces used. Both cases results in the decrease of this image to compute a foreground mask. The result
random forest predictive power. But the last is a rather is a binary segmentation of the image which highlights
extreme case, since the selection of subspace is per- regions of non-stationary objects. This way we were
formed at each node. In general, the more trees we be able to get the segmentation of moving regions in
use the better get the results. However, the improve- image sequences in Real-time. Figure 5 shows an ex-
ment decreases as the number of trees increases, i.e. ample of the background subtraction for one frame of
at a certain point the benefit in prediction performance a video.
3
Figure 2: Flow Chart of Major Steps for Pedestrian Detection and Tracking.
4
Figure 5: Example of applying background subtraction on
one frame of a video.
in section 6.
5
[2] http://pascal.inrialpes.fr/data/human/
[3] http://mmlab.ie.cuhk.edu.hk/projects/PETA.html
[4] http://cbcl.mit.edu/software-
datasets/PedestrianData.html
6
Reconstructing Roller Coasters
Tyler J. Sellmayer
Stanford University
tsellmay@stanford.edu
1
video frames. But our images are noisy, and the light-
ing changes throughout the video, so this ideal scenario
doesn’t quite work if we use the pixel colors directly from
the recorded images. Instead we bucket these pixel colors
into a color palette using nearest-neighbor search [11], then
find the palette color whose member pixels occur most of-
ten in the bottom-center of the frame. This color we call our
track color. The full explanation of this algorithm is in the
Technical Content section 3.
Figure 1. Frame number 3070 from a first-person ride video [3], 2.2. Estimating Track Structure
unedited.
As stated above, we use the locations in world space of
our camera as an approximation of the track position. This
lets us use the camera pose estimation stage of structure-
from-motion [12] as the basis of our algorithm. We mark
every frameskipth frame as a keyframe with the frames
between them called inbetweens.
Before processing any frame, we first undistort it using
manually-tuned camera parameters for correcting fisheye
distortion and a calculated intrinsic camera matrix K. To
obtain the approximate focal length of our camera, we com-
pute the average width-at-widest-point of the roller coaster
Figure 2. Frame number 2784 from a first-person ride video [3], track’s image in pixels across a random subset of frames,
unedited. Here, the rider is inside a dark tunnel. and use this average width to obtain a ratio between a width
in world-space (the real width of the track, which we as-
frameskip values. We then draw conclusions from these sume to be 48 inches) and a width in the image plane in
results. pixels. We use this ratio to convert the known focal length
of our camera (14mm, according to [5]) to pixels. We as-
2. Problem Statement sume square pixels and zero skew, so this focal length is all
we need to compute our intrinsic camera matrix K.
Our problem has two independent pieces: estimating For each keyframe, we attempt to compute the camera
the roller coaster track’s color, and estimating a three- pose relative to that of the previous frame. MATLAB’s pose
dimensional model of the roller coaster track’s path. We estimation toolkit assumes that the camera poses are exactly
examine these problems separately. 1 unit distance apart, an assumption which is corrected by
performing bundle adjustment [15] after each frame’s pose
2.1. Estimating Track Color is added.
This problem can be concisely stated as ”Given a first- During the computation of the relative camera pose,
person ride video of a roller coaster, return the RGB value MATLAB’s helperEstimateRelativePose func-
which most closely approximates the paint color of the tion [12] attempts to estimate the fundamental matrix F [6,
roller coaster’s track.” p. 284]. This estimation can throw an exception when there
First-person ride videos have the property that the image are not enough matching correspondence points to complete
of the track always touches the bottom of the frame near the eight-point algorithm, or when there is no fundamen-
the center, as shown in 1. This property is only untrue in tal matrix found which creates enough epipolar inliers to
cases where the camera is not pointing forward along the meet the requirement set by our MetricThreshold pa-
track (we found no examples of this) or when the track is rameter. When an exception occurs, we do not want to
not fully visible. For example, the track is not visible at simply stop calculating. Instead, we make use of the in-
the bottom of the frame when the camera’s automatic white between frames, retrying the relative camera pose computa-
balance adjustment causes it to be blacked out (or whited tion with each inbetween frame after our second keyframe
out) in response to changing environment light, as seen in until we find one that succeeds, or until we run into the
figure 2. next keyframe, whichever comes first. If we run out of in-
Using this mostly-true property, we can conclude that betweens without successfully computing a relative cam-
ideally, the track color will be approximately that color era pose, we terminate our SFM computations immediately
which appears most often in the bottom-center area of our and return a partial result. In our full results (see section 6)
2
we report the mean and maximum numbers of unsuccessful
fundamental matrix computations per keyframe for each of
our experimental runs.
SFM requires feature correspondences for the funda-
mental matrix calculation [6], which requires features. We
choose to use SURF [1] as our feature detection algorithm Figure 3. Ten color centroids calculated from a random subset of
because it provides scale- and rotation-invariant features. pixels in [3]. Notice that these colors are similar to those found in
This is necessary because roller coasters often rotate the figure 1.
rider relative to the environment (which rotates the pro-
jected images of features in our scene between frames), and
because the camera moves forward along the z-axis between We take this set of several thousand pixels, and run k-
frames which changes the scale of the projected images of means clustering [10] on it. This gives us k centroids in
features in our scene. In our results we report experiments RGB space, and we use the colors those centroids represent
with controlling the SURF parameters NumOctaves, as our color palette. Because of the randomness in this al-
NumScaleLevels, and MetricThreshold as defined gorithm, we do not get the exact same palette every time.
in [7]. One example of a k = 10 color palette is seen in seen in
To avoid needless reimplementation of past work, we use figure 3.
MATLAB’s built-in toolkit [12] for computing correspon-
dences between features, estimating camera poses, tracking 3.2.2 Finding The Track
views, triangulating 3D world points, and performing bun-
Once we have established our color palette, we need to de-
dle adjustment [15]. Together, these produce a final set of
termine which of the colors in the palette most closely ap-
camera poses, including camera location and orientation in
proximates the color of our track. To accomplish this, we
world-space. We then plot the camera locations and color
must rely on our knowledge that in first-person ride videos,
our plot using the RGB value of the calculated track color.
the roller coaster track usually touches the bottom of the im-
age frame, near the center, and almost never touches the left
3. Technical Content
or right sides of the frame.
3.1. Splitting Video Into Individual Frames We first select a new random subset of t frames
[r1 , . . . , rt ] ⊂ [f1 , . . . , fe ]. In each frame, we examine only
Our video is downloaded from YouTube [3]. We run the
the bottom 10 rows of pixels. We split this 10-pixel-high
following command to split it into its individual PNG for-
strip horizontally into g 10-pixel-high segments. For an im-
mat images at a rate of 30 frames per second of video [13]:
age of width W , this gives us g regions [γ1 , . . . , γg ] each
$ avconv -i video.mp4 -r 30 -f image2 \ of size 10 × W g . Our ultimate goal with these regions is to
output_dir/%05d.png find which palette color is least often present in the left- and
right-most regions.
We manually select the range of frames [f1 , . . . , fe ] from Rather than just counting every pixel, we choose to count
the video which comprises the first-person ride video, ex- only those pixels which lie on either side of an edge. This
cluding the copyright notice at the beginning and the credits increases the number of pixels we count that represent track
at the end. (which is made of hard-edged steel parts, in focus, and rel-
atively large in the frame, giving it more sharp edges) com-
3.2. Calculating Track Color pared to the number of pixels we count in noisy background
regions (which tend to be out of focus, motion-blurred, or
3.2.1 Determining The Color Palette
so far away that their edges are not distinguishable at the
We first decide on a palette size. For our experiments, we camera’s resolution, giving them few sharp edges). We
use palette size 10, meaning we will calculate 10 centroids count the pixels (ex − 1, ey ), (ex + 1, ey ) which lie on ei-
in the RGB space. ther side of the edge, rather than the pixel (ex , ey ) which
Our code examines a random subset of t lies directly on the edge, because we want to capture the
frames [s1 , . . . , st ] ⊂ [f1 , . . . , fe ], and takes colors inside the regions more than we want to capture the
a random subset of q pixels in each frame colors of the edges themselves. We call the set of points
[p1,1 , . . . , pq,1 , p1,2 , . . . , pq,2 , . . . , pq,t ], where each (ei,x − 1, ei,y ), (ei,x + 1, ei,y ) for all edge pixels ei our set
pixel pi,j is represented as a triplet of values between 0 and of half-edge pixels.
255, indicating the red, green, and blue values comprising Furthermore, because we care about hue more than satu-
the color of that pixel, respectively. This is a standard ration or value when determining which pixels to count, we
representation of colors in RGB space. perform the edge-finding computation on the ’hue’ layer of
3
the column [0, 5, 124, 26, 0]T satisfies this condition, so our
track color is the one corresponding to that column. In this
example, that color happens to be the reddish-orange palette
color which covers the bulk of the track in figure 4.
4
Figure 7. SURF feature correspondences between frames 192 (red)
and 212 (blue).
5
in section 3.2.2. Defines the experiment parameter
BOTTOM STRIP SEGMENTS.
6
Table 2. A histogram over 5 segments and 10 colors.
7
ange from red or brown or any other primary or secondary
color, and we have achieved this level of accuracy in this
paper, but only by manually tuning the NUM COLORS and
BOTTOM STRIP SEGMENTS parameters until we got the
desired result. This is less useful than simply picking the
color manually.
8
due to too few features. When the threshold is much
lower than 2000 (we tested with MetricThreshold ∈
[800, 850, 900, 1000, 1100]) we will reach a failure state
earlier. With low thresholds we obtain so many erroneous
feature correspondences that they will cause MATLAB’s
estimateFundamentalMatrix function to fail with
an exception because there are never enough epipolar in-
liers for any of the sampled fundamental matrices [9]. Un-
fortunately, we cannot provide a specific suggestion for a
good MetricThreshold parameter, because the effects
of this value are entirely dependent on the quality and struc-
ture of the input images. We can suggest that future work
start by doubling MetricThreshold until the quality of Figure 14. Frame number 2814 from a first-person ride video [3],
their output degrades, then doing binary search to find a unedited.
good-enough MetricThreshold between the two best
values.
6. Code And Full Results
Choosing a high MetricThreshold also increases
MATLAB code and the full experimental results of this
the (totally subjective) smoothness of our point plot. This is
paper are available at https://github.com/tsell/
because the high quality features are less likely to be incor-
reconstructing-roller-coasters.
rectly corresponded with the wrong feature in an adjacent
frame. This is especially important because the scene in and
around a roller coaster is full of repetitive elements, like the References
repeating structure of the track, the similar pieces of support [1] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. Surf: Speeded
steel, and repeating patterns in the nearby rides and build- up robust features. Computer Vision and Image Understand-
ings. These elements are often incorrectly matched as corre- ing, 110(3):346–359, 2008.
spondences, as seen in figure 8. Modifying NumOctaves [2] P. Beardsley, P. Torr, and A. Zisserman. 3d model acquisition
and NumScaleLevels also helps with this by narrow- from extended image sequences. Computer Vision, pages
ing the range of feature scales we detect, reducing the oc- 683–695, 1996.
currence where a nearby feature in one frame is incorrectly [3] FrontSeatCoasters. Six flags magic mountain goliath pov hd
corresponded with a far-away feature in another frame. roller coaster on ride front seat gopro steel 2013. Web, 2014.
https://www.youtube.com/watch?v=N uV0Q2UH98.
[4] M. Frucci and G. S. di Baja. From segmentation to binariza-
tion of gray-level images. Journal of Pattern Recognition
5.3. Final Word Research, 1:1–13, 2008.
[5] GoPro. Hero3 field of view (fov) information. Web,
Overall, we consider these experiments a failure. Our 2016. https://gopro.com/support/articles/hero3-field-of-
camera pose estimation is not robust enough to create view-fov-information.
smooth models of the entire track. There are large portions [6] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, 2003.
of first-person ride videos which are totally inscrutable to
our methods, including frames like the ones seen in figure [7] MathWorks. detectsurffeatures: Detect surf fea-
tures and return surfpoints object. Web, 2016.
2 and 14 which have been nearly destroyed by the cam-
http://www.mathworks.com/help/vision/ref/detectsurffeatures.html.
era’s auto white-balance feature. We were unable to find a
[8] MathWorks. edge: Find edges
configuration of SURF parameters and frameskip value
in intensity image. Web, 2016.
which reduced the reconstruction error sufficiently to make http://www.mathworks.com/help/images/ref/edge.html.
a smooth-looking track model, so none of our results are [9] MathWorks. estimatefundamentalmatrix: Es-
worthy of being 3D printed. Also, the processing takes so timate fundamental matrix from correspond-
long (on the order of 1 hour per 150 frames successfully ing points in stereo images. Web, 2016.
processed, though we did not take explicit notes of our tim- http://www.mathworks.com/help/vision/ref/estimatefundamentalmatrix.html.
ing), and needs to be manually re-calibrated for each video [10] MathWorks. kmeans: K-means clustering. Web, 2016.
(because the scale and quality of features in different videos http://www.mathworks.com/help/stats/kmeans.html.
varies widely depending on video quality and camera reso- [11] MathWorks. knnsearch: Find k-nearest
lution), that this is not faster or better than simply construct- neighbors using data. Web, 2016.
ing the model manually in some 3D modeling software. http://www.mathworks.com/help/stats/knnsearch.html.
9
[12] MathWorks. Structure from motion
from multiple views. Web, 2016.
http://www.mathworks.com/help/vision/examples/structure-
from-motion-from-multiple-views.html.
[13] J. Nielsen. How to extract images from a video with av-
conv on linux. Web, 2015. http://www.dototot.com/how-to-
extract-images-from-a-video-with-avconv-on-linux/.
[14] T. Sellmayer. Rotate camera points fig. 13. Web, 2016.
https://www.youtube.com/watch?v=N uV0Q2UH98.
[15] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.
Bundle adjustment: A modern synthesis. Proceedings of the
International Workshop on Vision Algorithms, pages 298–
372, 1999.
10
Recovery and Reconstruction of Blackboard or
Whiteboard images with occlusions
Vijayaraj Gopinath
vgopinat@stanford.edu
Computer Vision: From 3D reconstruction to recognition
CS231A
Stanford University
Abstract 2 Proposed solutions
We have all taken the pictures of In this work, the multiple pictures
blackboard for reference purpose, but the of the blackboard can be in any orientation
problem is most of the time those pictures comes and we assume the first picture dictate our
with occlusion such as professor standing in required orientation for result purpose and
between or other students. Even if you can take
all other pictures would be orientated
multiple image, there will one or other occlusion
towards our first picture. Also the
obstructing the blackboard information. By the
time one student or professor has moved out proposed algorithm will find out whether
another will often have moved in, making it the given pictures would able to recover
difficult and timeconsuming to get the perfect the blackboard (or whiteboard) or not
blackboard picture. Also we can’t wait for the otherwise report an error. Since we have a
right moment, since the blackboard can be specific scenario here, we would tap on the
erased multiple time which makes this problem properties of this scenario (blackboard
even more difficult. In this project we will make only) to segment the occlusion with a
use of available multiple images and better techniques such as background
automatically recover the blackboard by
subtraction and image morphology
removing all the occlusions.
techniques which provides better results
than without worrying about other more
1 Review of previous work
complex segmentation techniques.
We can find many works related to
generic scene reconstruction with occlusion
3 Summary of the technical solution
using multiple images. Especially the work
To solve this problem, we have
[1] from Microsoft research which tries to
remove any occlusion from landmarks or proposed the following solution to this.
monuments from multiple pictures. Generic 1, Since we are dealing with multiple
scene construction has lot of assumptions and image, find homography for first image
work around with that, but here in this work, and to all other image and using the found
we are only concerned about a specific homography warp all other images to the
scenario, reconstructing the blackboard first image. 2, Detect, segment and label
all the occlusions in all the images. Use
images from occlusions such as person
image subtraction and other morphology
(professors or students) or any other objects.
techniques to detect and segment. 3, Once In figure 1, you can see that lot of features
labelled, we need to find out, which (more outliers than inliers) are found around
occlusions comes from which original occlusion since those regions have more
image and decide whether the scene can be intensity changes. Also, we will not get
recovered or not. 4, Using the label and enough features on the corners always since
identification of occlusion from the original the color of the whiteboards sometimes
image, recover the scene by copying the closely match the background of the scene
pixels from non occluded region to the itself. For all these reason, we need user
occluded region. 5, Finally blend the image feedback to select four corner of the
to complete the reconstruction. blackboard for us. We use these four points
to calculate our initial homography and we
4.1 Finding Homography remove all the outliers using RANSAC or
It is important to have an accurate other such methods.
homography since we later use image
subtraction to identify occlusions. It is
common nowadays we have more
whiteboards than blackboards which makes
finding homography more challenging.
With our experiments, we found that
detecting rich features proven very difficult
around whiteboards because of
homographic surface nature of the scene.
Also most of the found features are also
around the occlusion since we often see a
steep change in intensity levels near the Figure 2 Scene after the removing all
occlusions. outlier features.
Only using the inliers set which we found
using the last step, we recalculate the
homography which becomes our final
homography. We need these two step
approach to get our homography since we
need accurate homography and also we
assume that the user selection of four points
will be not accurate enough for homography
calculation but accurate enough to remove
outliers. We have used SURF features here.
Using the final homography we warp all the
Figure 1 with outlier features. images to match our base first image.
4.2 Detection Segmenting and Labeling accurate homography, results after image
occlusion. subtraction will have noises around it which
The first attempt we made in we will deal later. After image subtraction,
detecting the occlusion was face detection. we can use global image threshold using
Since most of the time, our occlusion will Otsu’s method [2], to compute the global
be a person occluding the blackboard it threshold level which is normalized to [0 1].
would be very interesting to take this Using the found threshold level, we can
approach. We can find the bounding box for convert the intensity image to a binary
face after detecting it and We can use face image.
to body ratio to define a segment box to
cover entire occlusion.
Figure 4 After image subtraction
Figure 3 Face detection
This idea has clearly many issues, We will
get little noise in terms of detecting other
small or non faces in the images also the
occlusion need not to be a person here or
even if the occlusions are assumed to be
only person, finding bounding box would be
wrong to do since person can take other
forms. After experimenting with lot other
methods, we found out that the
morphological techniques would suit this Figure 5 Binary image with noise
scenario more perfectly.
To remove noises in the binary image we
Once we have accurate homography can use morphological techniques. Use
and good warping, we can use image structural element of ‘Disk’ with radius of 3
subtraction between base image and all and erode the picture. The above step helps
other image to remove everything except to remove the noise which we got after
the occlusion. Though we found out image subtraction and thresholding.
The binary image we found from the last
step will have lot of discontinuity. In order
to get the complete segment, we need to
dilate the binary image to get supersize of
the occlusion. We can use the structural
element of ‘Disk’ with radius of 25 to get
the supersize of the occlusion. Figure 6
shows the binary image after dilation.
Figure 7 After filling the holes
Figure 6 Image dilation, Disk radius 25
Depends upon the intensity level of
occlusion in comparison to the blackboard Figure 8 Erode to get original size
or whiteboard, the found occlusion
potentially can have holes inside it. Which After finding the binary image of the
can be noticed from the figure 6. Here occlusions, label all the independent
occlusion originally had intensity level connected components to get the accurate
similar to the background so after image boundary for every occlusions in the scene.
subtraction and thresholding will have holes
in it. We can find holes by finding the
connected components in the binary image
and finding the missing pixel inside it. Find
such missing pixel and fill it. Figure 7
shows the binary image after filling up the
hole. Once finding the supersize occlusion,
erode the binary image with SE of radius of
22 to get back the original size of the
occlusion. Figure 8 shows the final binary
image with the occlusion having same size.
Figure 9 Color map of the found label.
4.3 Mapping occlusions to the original We address this in the future works section.
image region. Blackboard reconstruction by removing
After the segmenting and labeling, it occlusion is very interesting project and has
is important to map each found occlusions an important application in the education
to the original image. We assume that the domain. This can be pretty much a cool app
object which is occluding has totally which students can install in the smartphone
different average intensity value in and take multiple pictures and app can
comparison to the blackboard itself. We can automatically recover the entire blackboard
build a model based on the intensity level scene in a self contained manner.
around the occluded object and occluded
region in the original images, which can be 6 Future works
used to map the objects. 1, As previously discussed
sometimes we won't get enough features to
find homography, and so need user
selecting conors for us. We can work on
sophisticated rectangle detection to avoid
user input. 2, Fine tuning segmentation
using other techniques such as Fast
marching method, 3, For mapping occlusion
to the original image, currently we build
model based on intensity value, this can be
improved to other sophisticated model using
Figure 10 Recovered image features. 4,After the scene has been
reconstructed, we can try recognizing the
4.4 Recovering and Blending text in it and create a doc.
Once we identified, labeled and
mapped the occlusions we can recover the REFERENCES
scene by copying the pixels from non
occluded portion of the images to the 1. http://research.microsoft.com/pubs/69386/peop
occluded portion of the images. Since, we lemover.pdf
copy pixels come from different image to 2. http://ijarcet.org/wpcontent/uploads/IJARCET
recover the scene, the boundary of the VOL2ISSUE2387389.pdf
3. http://www.eiti.uottawa.ca/school/research/viva
recovered region in the final image will be
/papers/homographie.pdf
visible, so we need to blend the image to
4. http://www.cescg.org/CESCG2016/papers/Jari
complete our reconstruction. Figure 10
abkaGeneration_of_lecture_notes_as_images
shows the final recovered image by _from_recorded_whiteboard_and_blackboard_
removing all the occlusions. based_presentations.pdf
5. http://visual.cs.ucl.ac.uk/pubs/learningOcclusio
5 Conclusion n/CVPR_2011_learningoccl.pdf
Dealing with similar background in
computer vision is very challenging, human
eyes has evolved to detect this seamlessly.
Real-Time Semi-Global Matching Using CUDA Implementation
Robert Mahieu Michael Lowney
Stanford University Stanford University
Department of Electrical Engineering Department of Electrical Engineering
rmahieu@stanford.edu mlowney@stanford.edu
Abstract—With recent rise of technology such as these techniques are significantly more memory intensive
augmented reality and autonomous vehicles, there and end up being much slower.
comes a necessity for speedy and accurate depth
estimation to allow these products to effectively interact To obtain both reasonable accuracy as well as real-time
with their environments. Previous work using local performance, we instead move to what can be referred to as
methods to produce depth maps has generally been fast a semi-global matching technique [2], which makes some
but inaccurate and work using global methods has been use of both local and global methods. Additionally,
accurate but too slow. A new technique referred to as offloading data calculation onto the GPU, which is ideal for
semi-global matching combines local and global handling SIMD (single instruction multiple data)
methodologies to balance speed and accuracy, computation, allows us to exploit the parallelizable nature
producing particularly useful results. This project of our image-based calculations and significantly reduce
focuses on the implementation of slight variations on the computation time for estimating the optimal disparity map.
original algorithm set forth by Hirschmuller [2] to
increase accuracy and using CUDA to accelerate the
runtime. Results show sufficiently low error, though
runtime was found to be imperfectly optimized. 2. Problem Statement
In order to tackle this problem, we make the assumption
that the input images are a rectified stereo pair. This is
1. Introduction inherently the case when two cameras are orthogonal to
Acquiring depth information from sets of images is the baseline, and point in the same direction. Mpopular
incredibly important in many emerging fields such as stereo vision datasets, such as the Middlebury dataset,
augmented reality, robotics, autonomous vehicles, etc. which is used in this paper [3][6][8], provide stereo pairs
However, these applications rely on the produced depth that have been rectified. The benefit to having rectified
information to be both accurate and generated in a short images is that the epipolar lines are horizontal and
amount of time—ideally close to real-time—to ensure corresponding lines are at the same height in each image.
safety of the system and users, as well as for reliable pose This simplifies the problem because corresponding points
tracking. will lie on the same epipolar line, and so we only have to
search in horizontal directions.
Many algorithms for computing this information have Local methods are more prone to noise in their disparity
been previously explored, either looking at localized maps due to the fact that there may be several local
information or global information throughout the entire minima in their cost function. Because of this the semi
image. Local techniques such as Winner-Takes-All (WTA) global approach uses a model which penalizes changes in
or scanline optimization (SO) [7] compute results for pixels disparity values in local neighborhoods. This causes the
independently and require minimal computation, but due to resulting disparity map to be smoother by attenuating high
lack of consideration for global trends typically result in frequency noise, which provides a clearer estimate of the
inaccurate conclusions. The dynamic programming [5] true relative depth of the objects in the scene.
approach is also computationally efficient, but because the
algorithm only looks at a single row per iteration, it also
lacks consideration for global trends and commonly causes 3. Technical Content
streaking patterns to show up in the output. On the other
The implementation of the semi-global matching method
hand, while global techniques such as a Graph Cuts [1] and comes down to minimizing an energy function describing
Belief Propagation [9] produce more accurate results and the quality of a potential disparity image. This is
better avoid the errors encountered in the local methods,
represented by the expression below:
reads than global memory.
𝐸(𝐷) = ∑ (𝐶(𝑝, 𝐷𝑝 ) + ∑ 𝑃1 𝟏{|𝐷𝑝 − 𝐷𝑞 | = 1} To carry out the next step denoted “cost aggregation”, we
𝑝 𝑞∈𝑁𝑝
iterate and compute the energy function locally over 8
directions (two horizontal, two vertical, two for each
diagonal). An example for the recursive expression is
+ ∑ 𝑃2 𝟏{|𝐷𝑝 − 𝐷𝑞 | > 1})
shown below for the horizontal direction, going from left to
𝑞∈𝑁𝑝
right across the image:
is to determine areas of occlusion in the image, meaning rows in the image, 𝑛 is the number of columns in the image,
areas that are visible in the base image, but blocked in the and 𝑑𝑚𝑎𝑥 is the number is the maximum number of
matching image. This can be done by running a slightly disparity values. Storing the initial cost matrix and the
modified version of the algorithm as defined above again to matrices for the 8 search directions may exceed the total
generate a disparity map for the match image. The only amount of memory on the GPU. For this paper the
modification is in the initial cost function which becomes: algorithm was run on a laptop with a NVIDIA GeForce
940M GPU with 2GB of memory. In order to stay below
𝐶(𝑝, 𝑑) = |𝐼𝑚 (𝑝𝑥 , 𝑝𝑦 ) − 𝐼𝑏 (𝑝𝑥 + 𝑑, 𝑝𝑦 )| this 2GB threshold the input images must be down sampled.
Unless otherwise specified the images will be down
Once we have disparity maps for both the base and match sampled so that the number of columns will be 450, and the
image we can compare the results to identify occluded number of rows will be scaled accordingly.
regions. For each pixel in the base disparity map we sample
the disparity value. We then compare this to the value in the
match disparity map at the same pixel location shifted by 4. Results
the base disparity value we just sampled. If these two values The quality of our algorithm was tested using the
are the same (within some small tolerance), we judge them Middleburry dataset. Figure 1 shows the results of our
to be true correspondences, otherwise we make them as algorithm compared to the ground truth of the depth
occluded pixel and set them to zero in the base disparity map for various image pairs. For these trials we used
map. This technique outputs a refined base disparity map. the values suggested in [4] for 𝑃1 , 𝑃2 . All 𝑤𝑟 were set
to the same value to ensure equal weighting. Table 1
Finally, to eliminate residual noise in the output we filter shows the values of 𝑃1 and 𝑃2 used for our results.
the disparity map using a median filter. Good results were
observed while using a small kernel of 3x3. This allows us Table 1: Penalty Values
to keep the major edges and details in the map while ↔ ↕ ↖↘ ↙↗
removing the unwanted high frequency components.
𝑃1 22.02 17.75 14.93 10.67
A consideration worth noting is the memory 𝑃2 82.79 80.87 23.30 28.80
requirements of this algorithm. The amount of memory
used scales like 𝑂(𝑚𝑛𝑑𝑚𝑎𝑥 ), where 𝑚 it the number of
Qualitatively the results appear to be quite a close Unfortunately, due to time constraints on the project, we
match to the ground truth. The regions towards the left were unable to spend much time optimizing the CUDA
border of the image are consistently unlabeled. This is implementation, so tests on runtimes have returned sub-
due to the fact that they represent pixels that are only optimal results. In the current somewhat naïve
seen in the base image. Only pixels that are in the field implementation, we are still able to get runtime down to
of view of both cameras will result in accurate disparity around one frame/sec for images with sizes below about
values. 250x217 (54250 pixels). The relationship between runtime
and input image size is illustrated in Figure 2. As shown in
Table 2: MSE and runtime for images in Midlleburry the figure, the Semi-Global Matching algorithm has a
dataset runtime that scales linearly with the number of pixels in the
input images. It is also worth noting that the graphics cards
IMAGE PAIR MSE Runtime
used in modern stereo research are much more powerful
aloe 0.0296 3715ms than the ones used in this paper (consumer grade laptop
GPUs). This difference in hardware is a main contributor to
books 0.0687 3394ms
the longer runtimes found in this project.
dolls 0.0712 3385ms
It is important to note that one of the most significant
laundry 0.0935 3517ms
factors in the runtime, however, is actually the resizing of
pots 0.1053 3607ms images that occurs at the start of the program after being
baby 0.0385 3759ms read in by the CPU. This is necessary to ensure that we do
not overload the GPU memory, however the time cost is
bowling 0.1083 3692ms very high. When images do not need to be resized on-the-
art 0.0949 3385ms fly, total speeds are greatly increased (about 2x speedup).
In future implementations, intelligent use of shared
cones 0.0409 3152ms memory and memory access within warps should be able to
wood 0.0671 3403ms dramatically increase performance. Some of these
techniques are outlined by Michael et al. [4].
Table 2 shows the mean-squared error (MSE) between For posterity, to demonstrate the robustness of our
our experimental depth maps and their respective ground algorithm we also selected an arbitrary stereo image pair
truths. Note that the error values are generally quite low,
never reaching any higher than around 10% for any of the
images we tested. Although differences in scaling of depth
to grayscale may be present between the experimental
results and the ground truth, this appears to be minimal and
therefore the MSE should still provide a good metric for
analyzing the success of the algorithm.
Link to code:
https://github.com/rmahieu/SemiGlobalMatching
References
Abstract—This project attempts to reproduce the genetic algorithm in a paper entitled ”A Genetic Algorithm-Based
Solver for Very Large Puzzles” by D. Sholomon, O. David, and N. Netanyahu. [3] There are two main challenges in
solving jigsaw puzzles. The first is finding the right fitness function to judge the compatibility of two pieces. This has
thoroughly been studied and as a result, there are many fitness functions available. This paper explores the second part
that is crucial to solving jigsaw puzzles: finding an efficient and accurate way to place the pieces. The genetic algorithm
attempts to do just that. The crucial part of the algorithm is in generating a new ordering of pieces called ’child’ from
two possible orderings of pieces, called ’parents’. Each generation learns from good traits in the parents. After going
through a hundred generations, the ordering will reflect the original image to a high accuracy. This paper also makes
use of CNN to start with reasonable orderings of ’parents’. This cuts down on the number of generations required to
reach the correct ordering of the pieces.
first time that the genetic algorithm has been Two chromosomes from the current population
used to solve the jigsaw puzzle problem, it has are selected, and a function called crossover
only been used to solve puzzles of a limited generates a child chromosome that learns from
size. This paper attempts to solve puzzles with the parents, and has a better reordering of the
larger pieces. In addition to the genetic algo- pieces, and hence, a better fitness score. It is
rithm, this paper also attempts to use CNN to via this mechanism that each generation gets
arrive at the correct reconstruction of the image a better fitness score than the previous genera-
in less iterations. tion. The selection process of which parents to
choose to give birth to a new child chromosome
3 T ECHNICAL D ETAILS discriminates towards parents with a better
3.1 Genetic Algorithm fitness score. The selection process is called
a roulette selection. The likelihood of being
The genetic algorithm as implemented for solv- selected is directly proportional to how good
ing the jigsaw puzzle problems starts out with the fitness score is. This way, the algorithm
a thousand different ways to order the pieces. makes sure that selected parent chromosomes
Each way of ordering a piece is called a chro- have good traits (as evidenced by their fitness
mosome. The entire set of a thousand chro- scores) to be passed on to the children.
mosomes is called a population. At each stage
of the process, called a generation, we have a
population of a thousand chromosomes. Now, Fitness Function
the goal is that with each passing generation, The estimation function utilizes the fact that
i.e. with the next thousand chromosomes or adjacent pieces in the original image will most
a population, the orderings of the pieces will likely share similar colors along their edges.
begin to look more and more like the original Hence, computing the sum of the squared color
or correct image. During each population, the differences along pixels that are adjacent to
best chromosome will be determined by the each other (between two different pieces) will
estimation function. give us an indication of whether the two pieces
belong adjacently in the direction they shared
the pixels. Hence, the less the specific sum is ,
the more likely they are to be adjacent to each
other. From the image below for example, we
can expect the fitness function to give us a high
score for piece 5 and 6 as the color difference in
the edges seem to be high, while piece 8 and
9 will have a very low fitness score. We can
further assume that piece 5 and 8 will have a
high fitness score while 6 and 9 will have a
lower one in comparison.
score of two pieces in a left-right adjacency are similar to that of the parents. The process
relationship, and a function which computes of growing the kernel will go on until all the
the fitness score of a given chromosome (i.e. pieces have been used.
computes the score for all edges and direc- The final absolute location of a given piece
tions.) is only determined after all the pieces have
been used. This is because as recommended
earlier, the kernel growing process must allow
for independence or flexibility in the placement
as the algorithm plays out. To begin, crossover
K is the number of pixels in each piece in the selects a random piece from either parent and
vertical direction. places it in the kernel. After that, it keeps
track of all the available boundaries for a new
piece to be added to the kernel. An available
boundary can be thought of as a piece and the
This way, it covers all the available edges direction in which a new piece can be placed
in a chromosome. Note that D is the fitness adjacent to it. There are three main phases
score for the compatibility of piece xj to the involved in crossover.
direction (left, right, down, or up) of xi. In
the selection process of the algorithm (roulette Phase One
selection) make sure that a lower fitness score It goes through the boundary pieces in the
is treated as more likely to be chosen. kernel. Let’s say that piece xi in the direction
d, for example is selected. Phase one checks to
Crossover see if both parents have the same piece xj in
the direction of d of xi . If it so, xi is added
Crossover can be considered the heart of the to the kernel. If x has been already added,
i
algorithm. Crossover receives two parent chro- it will of course be skipped. The only pieces
mosomes and creates a child chromosome. It under consideration should be unused pieces
allows ”good traits” to be transmitted from (pieces not in the kernel). This phase keeps on
the parents to the child. The goal is to have going until there is no boundary on which both
the child with a better fitness score than both parents agree.
parents. The fitness function does a good job
of discriminating between adjacent pieces, but Phase Two
does not give any indication of whether the
pieces are placed at the correct absolute po- Assume (xi ,R) is available on the kernel. Check
sition in the image. The implementation of if one of the parents contains a piece xj in
crossover then must allow for independence spatial relation R of xj , which is also a best-
in the placement of pieces. (It should be a buddy of xi in that relation. Two pieces xi and
dynamic process. Just because a piece was at xj are considered best-buddies if D(xi ,xj ,R) is
some point assigned to say, (2,3) of the the the lowest fitness score they can achieve.I.e.
image, it must not remain there, it should be there is no better piece xk , that will give a
able to transition into a different place based lower fitness score D(xk ,xj ,R) as well as no xk
on how the pieces build up around it.) available, that can give D(xi ,xk , R) lower than
The implementation of crossover suggested D(xi ,xj ,R). The piece considered xj must be
starts out with a single piece and then grad- adjacent to xi in one of the parents.
ually joins other pieces at available bound- If such a piece is found, go back to phase
aries.The image is always contiguous since one, if not, proceed to three
new pieces are only added adjacent to existing
ones. Keeping track of the pieces used and Phase Three
the dimensions of the child being formed is Pick random (xi,R) from the kernel and assign
important so that the dimensions of the child it xj from available pieces such that D(xi ,xj ,R)
CS231A FINAL PROJECT, JUNE 2016 4
Figure 2. 96 PIECES:GENERATION 1
5 C ONCLUSION
The Jigsaw puzzle problem is an interesting
Left: Actual Image. Right: Reconstructed Im- problem with applications in many domains.
age. Looking forward, one extension we plan to
explore is to solve the jigsaw problem using
only a neural network. We envision embed-
ding convolutional layers in a Long Short Term
4.3 Genetic Algorithm + Convolutional Memory or Recurrent Neural Network which
Neural Network (CNN) would directly predict the right configurations
instead of using our current trick of having 100
representative configurations. We would also
As an augmentation to the original algorithm,
like to investigate more avenues for improving
we fed the reconstruction output of the CNN
the run time of our current model.
as the starting population of the Genetic
Algorithm. We were mainly interested in two
effects
R EFERENCES
1.Did the run time of the algorithm improve? [1] H. Freeman and L. Garder. Apictorial jigsaw puzzles:
The computer solution of a problem in pattern recog-
2.How did the accuracy of the reconstruction nition. IEEE Transactions on Electronic Computers, EC-
change? 13(2):118127, 196
CS231A FINAL PROJECT, JUNE 2016 7
1
to smoothing long feature trajectories, and achieved com- use Tchebycheff (L1 ) smoothing. For error distributions
parable results to 3D reconstruction based methods. Gold- at the other end of the spectrum, which is with long tails,
stein and Fattal[10] proposed an epipolar transfer method one should use L1 smoothing. In between these extremes,
to avoid direct 3D reconstruction. Obtaining long feature which are short-tail spectra such as normal distribution,
tracks is often fragile in consumer videos due to occlusion, least squares or L2 smoothing appears to be best.
rapid camera motion and motion blur. Lee et al. [11] incor-
porated feature pruning to select more robust feature trajec- 3.1.2 L1 -Norm Optimization
tories to resolve the occlusion issue.
Motion estimation methods calculate transitions between In the perspective of a single feature point, the video mo-
consecutive frames with view-overlap. To reduce the align- tion can be viewed as a path of its coordinates (x, y) move-
ment error due to parallax, Shum and Szeliski[12] im- ment with respect to the frame number. Since it is diffi-
posed local alignment, and Gao et al.[7] introduced a dual- cult to avoid jitters with hand-held devices, we will observe
homography model. Liu et al[13] proposed a mesh-based, that the path the is wiggling. Video stablization is to ob-
spatially-variant homography model to represent the motion tain the new coordinates (x, y) at each frame and thus a
between video frames, but the smoothing strategy did not new path with enhanced smoothness. In the perspective of
follow cinematographic rules. the frames, the task is to smooth the transformations be-
Our implementation, based on [1], apply L1 -norm op- tween frames so that the feature points movement would
timization to generate a camera path that consists of only be minimal. The frame transformation is generalized as
constant, linear and parabolic segments, which follow cine- affine transform, including translational and rotational mo-
matographic principles in producing professional videos. tion, and scaling caused by object/camera distance change.
We estimate the camera path by first matching features
2.2. Our Contribution between consecutive frames Ct and Ct+1 , and then cal-
culate the affine transformation Ft+1 based on the match-
In this work, we re-implement the L1 -norm optimization
ing. That is, the process can be formatted as Ct+1 =
algorithm [1] to automatically stabilize the videos captured,
Ft+1 Ct . Then we estimate the affine transformation Ft+1
with a smoothed feature path containing only constant, lin-
using these two set of feature coordinates, Ct and Ct+1 . In
ear and parabolic segments. Additionally, in order to en-
this work, we extract features of each frame (opencv func-
able the video to retarget on human faces, we use the facial
tion cv::goodFeaturesToTrack), and find the matching in the
landmark detection algorithm from OpenFace toolkit [3] to
next frame using iterative Lucas-Kanade method with pyra-
set facial saliency constraints for the path smoothing; the
mids (cv::calcOpticalFlowPyrLK).
strength of the constraint could be tuned from 0 (no facial
We denote the smoothed features as Pt , then we have a
retargeting) to 1 (video fixing on facial features), and in this
correlation between the original features in frame t and the
way we are able to combine both video path smoothing and
smoothed ones, as Pt = Bt Ct , where Bt is is the stabiliza-
facial retargeting according to specific user needs.
tion/retargeting matrix, transforming the original features to
Beyond that, in order to make our work more fun, we the smoothed ones. Since we only want the smoothed path
also manage to attach interesting decorations such as hat, to contain constant, linear, and parabolic segments, we min-
glasses, and tie above, on, or below the human faces de- imize the first, second, and third derivatives of the smoothed
tected, and their transformations are based on the movement path with weights c = (c1 , c2 , c3 )T :
of human face in the video.
O(P ) = c1 |D(P )|1 + c2 |D2 (P )|1 + c3 |D3 (P )|1 , (1)
3. Proposed Method
where
3.1. L1 -Norm Optimized Video Stablization X X
|D(P )|1 = |Pt+1 Pt | 1 = |Rt |1 ,
In this section, we describe the method of video stabliza-
t t
tion in this work. X
|D2 (P )|1 = |Rt+1 Rt | 1 , (2)
t
3.1.1 Norms of smoothing X
|D3 (P )|1 = |Rt+2 2Rt+1 + Rt |1 .
When applying path smoothing algorithm, we should al- t
ways be careful to choose which regularization method we
Here the residual is Rt = Bt+1 Ft+1 Bt .
use, since different regularization methods works differ-
For each affine transform:
ently for different error distribution. [2]
For error distributions with sharply defined edge or ex- b b t
tremes (typified by the uniform distribution) one should Bt = 11 12 x (3)
b21 b22 ty
2
in 6 DOF, we vectorize it as pt = We use Constrained Local Neural Fields (CLNF) for fa-
(b11 , b12 , b21 , b22 , tx , ty )T , which is the parametriza- cial landmark detection available on OpenFace. Detail of
tion of Bt ; correspondingly the algorithm can be found in [3]. The CLNF algorithm
works robustly under varied illumination and are stabilized
|Rt (p)|1 = |pTt+1 M (Ft+1 ) pt | 1 . (4) for video. It outputs a fixed number of facial landmarks,
We make use of Linear Programming (LP) technique including the face silhouette, the lips, nose tip and eyes, as
to solve this L1 -norm optimization problem. To minimize shown in Fig. 2c. These multiple landmarks allow a more
|Rt (p)|1 in LP, we introduce slack variables e1 0, so stable and accurate estimate of the facial position. In con-
that e1 Rt (p) e1 ; similarly there are e2 and e3 for trary, other face detector, for example the opencv built-in
|Rt+1 (p) Rt (p)|1 and |Rt+2 (p) 2Rt+1 (p) + Rt (p)|1 , ones, were observed to produce inaccurate bounding box
respectively. For e = (e1 , e2 , e3 )T , the objective function and are not stable over video frames during our experiment.
of the problem is to minimize cT e. The detailed facial landmarks from CLNF also enable us to
In addition, we want to limit how much Bt (or pt ) could perform other post-processing on the video, for example the
deviate from the original path, i.e. the actual shift should face decoration described in Section 3.4.
within the cropping window. Thus, we can add constraints After detecting the facial landmarks in each frame t, we
on the parameters in LP, such as: lb U pt ub, where U estimate the center of face Cf,t by averaging all the land-
is the linear combination coefficient of pt . The complete L1 marks. Let C0 be the desired position of the center of face,
minimization LP for smoothed video path with constraints for example the center of frame. Let Pt and St be the orig-
is summarized below: inal and smoothed camera trajectory, then the saliency con-
straint can be posed as a additional term to the loss function
Algorithm 1 Summarized LP for the smoothed video path
Input: Frame pair transform Ft , t = 1, 2, ..., n Lt = (1 ws )(St P̄t )2 + ws (St Pt + Cf,t C0 )2 (5)
Output: Update transform Bt
. Bt could be transformed to pt where P̄t is average over a window of frames, and ws is a
Minimize: cT e parameter to adjust how much weight the saliecy constraint
w.r.t p = (p1 , p2 , ..., pn ) have on the optimization. Minimizing Lt then produce the
where e = (e1 , e2 , e3 )T , ei = (ei1 , ei2 , ..., ein ), c = desired smoothed trajectory St .
(c1 , c2 , c3 )T 3.3. Metrics & Characterization
subject to:
1. e1t Rt (p) e1t 3.3.1 Evaluation of Smoothed Path
2. e2t Rt+1 (p) Rt (p) e2t
For the stabilizing problem we are concerning about, it
3. e3t Rt+2 (p) 2Rt+1 (p) + Rt (p) e3t
would be inappropriate to simply regard the undesired shak-
4. eit 0
ing as short-tail normal distribution, so using the L1 norm
constraints:
between each frame pair during minimization is more suit-
lb U pt ub
able. In addition, L1 optimization has the property that the
resulting solution is sparse, i.e. the computed path there-
We use lpsolve library for modeling and solving our LP fore has derivatives which are exactly zero for most seg-
system. ments. On the other hand, L2 minimization (in a least-
squared sense), tend to result in small but non-zero gradi-
3.2. Facial Features Detection and Retargeting
ents. Qualitatively, the L2 optimized camera path always
In many videos, a particular subject, usually a person, is has some small non-zero motion (most likely in the direc-
featured. In this case it is not only important to remove fast, tion of the camera shake), while the L1 optimized we used
jittering camera motions, but also unintended slow panning (|D(P )|1 , |D2 (P )|1 , and|D3 (P )|1 ) will create path is only
or swanning that momentarily move the subject off-center composed of segments resembling a static camera, (uni-
and lead to distraction for the viewer. This can be posed form) linear motion, and constant acceleration [1].
as a constraint on the path optimization as requiring that Therefore, we will compare the L1 norm |D(P )|1 be-
salient features of the subject to be closed to the center re- tween the original video feature path and the smoothed one,
gion throughout the video. and use this comparison as metrics of our experiments de-
The first step towards salient-point-preserving video sta- scribed below. Specifically, we will calculate the average
bilization is salient feature detection and tracking. In par- absolute shift between adjacent points on the video feature
ticular, it is desirable to have the algorithm automatically path, with respect to both x and y directions, and average
recognize and detect these salient features without user in- absolute rotation angle increment. The same calculations
put. There are many face detectors available for such task. will be done to the smoothed path
3
3.3.2 Evaluation of Facial Retargeting 4.1. Video Stabilization
As for the part of facial retargeting, in addition to the com- We apply our path smoothing algorithm to shaky videos
parison between the L1 norm |D(P )|1 of the original video and observe significant reduction of jittering. An example
feature path and the new one, where we can extract the in- output can be found on Youtube.
formation about smoothing, we are also interested to see To visualize the effect of stabilization, we plot the esti-
how the facial features are targeted. So we will calculate mated camera trajectory before and after our algorithm in
the average position of the face features with respect to the Fig. 1. We also provide a quantitative measurement of the
center of frame, and simultaneously calculate the average L1 norm |D(P )|1 before and after smoothing in Table. 1.
absolute position deviation. As we can see the L1 norm decreases a lot, which means
the abrupt jitters are significantly decreased.
3.4. Face Decoration
With per frame face features detected, we can add fun
face decorations to our videos, such as glasses, hat and mus-
tache. By incorporating feature locations, we are able to
translate, scale and rotate the decorations to place them ap-
propriately onto human faces. Since our videos are stabi-
lized and focus on faces, the transitions of the decorations
are smoother. Here is an example of how we utilize the fea-
ture points in adding decorations.
Adding glasses: we extract left eye, right eye, left brow
and right brow feature points to calculate a horizontal eye
axis, and use it to estimate the orientation of the glasses.
Scale is approximated from eye distance, and translation de- Figure 1. Path before and after (Left column) L2 -norm smooth-
pends on the locations of the eye points. ing (Right column) L1 -norm smoothing. (Top)x-direction.
Since face silhouette feature points are usually less sta- (Middle)y-direction. (Bottom)rotational angle.
ble, we avoid using those points in adding face decorations.
Screenshots of adding hat and glasses are shown in Fig-
ure 4. Table 2. L1 norm |D(P )|1 between the original video feature path
and the smoothed one, in both x and y directions and the rotational
angle.
4. Experiments
Table 1 lists the algorithm run time on our laptop. The path < | xt | > < | yt | > < | at | >
second column lists time for path smoothing without facial original 1569 857 1.12
feature, and the third column lists time for path smooth- smoothed 705 234 0.44
ing with facial feature as salient constraint. In the lat-
ter case, the CNFL facial landmark detection takes up the
biggest chunk of time (⇠ 45ms per frame). [1] reported 20 4.2. Facial Retargeting
fps on low-resolution video, and 10 fps with un-optimized
saliency. Our experiment with video stabilization using facial fea-
tures are shown in Fig. 2. Fig. 2(a) is the original video,
Table 1. Timing per frame of the algorithm. Video resolution which contains slow swanning motion of both the camera
640 ⇥ 360. and the subject person. Fig. 2(b) is the stabilized output us-
ing only camera path smoothing. The slow motion of the
w.o. face w. face subject is still prominent. Fig. 2(c) is the stabilized output
motion estimation (ms) 12.1 59.1 using camera path smoothing with a constraint of the mo-
optimize camera path (µs) 0.15 0.40 tion of facial features. It leads to stabilization of the subject
render final result (µs) 2.7 2.7 at the center over frames. Both result videos can be found
face decoration (ms) - 5.7 on Youtube link 1 and link 2.
total (ms) 15 68 As expected, stabilization comes at a price of reduced
speed (fps) 67 15 resolution. The original image are cropped by 20% in
Fig. 2(b) and (c) to remove black margins due to warpping.
There are still residue margins in Fig. 2(c).
4
We also quantify the smoothing effect and the facial tar- video. In Proceedings of the 15th ACM international con-
geting, as we can see from Table. 2. With the increase of the ference on Multimedia (MM ’07). ACM, New York, NY,
facial saliency constraint ratio !, both L1 norm and the ab- USA, 27-36.
solute position shift drops, which means, the larger ! is, the [5] Yasuyuki Matsushita, Eyal Ofek, Weina Ge, Xiaoou
smoothier the video gets, and the more centered the human Tang, and Heung-Yeung Shum. 2006. Full-Frame Video
face is. The result is expected from our algorithm. Stabilization with Motion Inpainting. IEEE Trans. Pattern
Anal. Mach. Intell. 28, 7 (July 2006), 1150-1163.
4.3. Comparison with State-of-the-art Systems [6] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agar-
Due to no publicly available implementation of previous wala. Subspace video stabilization. In ACM Transactions
works, we obtain the original and output videos reported in on Graphics, volume 30, 2011.
Grundmann’s paper [1], and calculate the evaluation metrics [7] Junhong Gao , Seon Joo Kim , M. S. Brown, Con-
described in Section 3.3 on their output video and present structing image panoramas using dual-homography warp-
alongside with our results. As we can see from the compari- ing, Proceedings of the 2011 IEEE Conference on Com-
son below, our implemented algorithm is comparable to the puter Vision and Pattern Recognition, p.49-56, June 20-25,
state-of-the-art system. 2011
[8] Buehler, C., Bosse, M., and McMillan, L. 2001. Non-
4.4. Face Decoration metric image-based rendering for video stabilization. In
Proc. CVPR.
With per frame face features detected, we can add fun [9] Feng Liu , Michael Gleicher , Hailin Jin , Aseem
face decorations to our videos, such as glasses, hat and mus- Agarwala, Content-preserving warps for 3D video stabiliza-
tache. By incorporating feature locations, we are able to tion, ACM Transactions on Graphics (TOG), v.28 n.3, Au-
translate, scale and rotate the decorations to place them ap- gust 2009
propriately onto human faces. Since our videos are stabi- [10] Amit Goldstein , Raanan Fattal, Video stabilization
lized and focus on faces, the transitions of the decorations using epipolar geometry, ACM Transactions on Graphics
are smoother. Screenshots of adding hat and glasses are (TOG), v.31 n.5, p.1-10, August 2012
shown in Fig. 4. [11] Chen, B.-Y., Lee, K.-Y., Huang, W.-T., and Lin, J.-
S. 2008. Capturing intention-based full-frame video stabi-
5. Conclusion & Perspectives lization. Computer Graphics Forum 27, 7, 1805–1814.
[12] Heung-Yeung Shum , Richard Szeliski, Systems
All in all, video feature path is significantly smoothed
and Experiment Paper: Construction of Panoramic Image
using the L1 optimization stabilization algorithm; the L1
Mosaics with Global and Local Alignment, International
norm |D(P )|1 , which signifies the moving between frames,
Journal of Computer Vision, v.36 n.2, p.101-130, Feb. 2000
greatly drops after applying the stabilization.
[13] Shuaicheng Liu, Lu Yuan, Ping Tan, and Jian Sun.
If the facial retargeting method is included, the video 2013. Bundled camera paths for video stabilization. ACM
would be more focused on human faces; the larger the Trans. Graph. 32, 4, Article 78 (July 2013), 10 pages.
saliency constraint ratio ! is, the more centered the human
faces are with respect to the cropped video frame.
Decoration addition such as glasses, hat, or tie could also
be attached to the faces in the video, with the same orien-
tation as the faces. More fun stuffs will be applied to make
this work fancier in the future.
Reference
[1] Matthias Grundmann, Vivek Kwatra, Irfan Essa.
Auto-Directed Video Stabilization with Robust L1 Optimal
Camera Paths. CVPR, 2011.
[2] JR Rice, JS White. Norms for smoothing and estima-
tion. SIAM review, 1964.
[3] Tadas Baltruaitis, Peter Robinson, and Louis-
Philippe Morency. Constrained Local Neural Fields for ro-
bust facial landmark detection in the wild. ICCVW, 2013.
[4] Michael L. Gleicher and Feng Liu. 2007. Re-
cinematography: improving the camera dynamics of casual
5
Figure 2. Demonstration of facial retargeting in video stabilization. The green dot indicates the center of frame. Green lines show boarder
of frame. Red dots in (c) indicated detected facial landmarks from OpenFace [3]. They are intended as a guide to the eye. Both videos can
be found on Youtube (b) and (c).
Table 3. L1 norm |D(P )|1 between the original video feature path and the smoothed one, in both x and y directions and the rotational
angle.
! < | xt | > < | yt | > < |x xcenter | > < |y ycenter | >
original 1392 496 32805 5882
0.2 1139 254 32583 4902
0.5 792 234 21568 3433
0.95 221 247 2695 1954
6
Figure 3. Path smoothing before and after with facial saliency con-
straints. (Left column) x-direction. (Right column) y-direction.
From top to bottom, the facial constraint ratios ! are 0.2, 0.5, and
0.95, respectively.