Spatial Pyramid Matching For Scene Category Recognition

Spatial Pyramid Matching for Scene Category Recognition
Nityananda Jayadevprakash, Olzhas Makhambetov

University of California
Irvine, CA
njayadev@uci.edu, omakhamb@uci.edu
Abstract
In this work we present a method for scene category
recognition which is based on matching approximate global
correspondences. This work was proposed by the Lazebnik
et al[1]. It is based on partitioning images into increasingly fine sub-regions and then computing histograms of
local-features in those sub-regions. This approach creates a
spatial pyramid which is an extension of the orderless bagof-features representation of an image. This work shows a
significant improvement in scene categorization tasks and
is one of the best methods for object recognition in the
Caltech-101 database. Here, our work focuses on trying
this approach on the new Caltech-256 [8] dataset which is
generally considered to be a more challenging dataset than
the Caltech-101 [3].
1. Introduction
Our task in this project is to find the image category of
an image. One of the dominant approaches taken to achieve
this task is the bag-of-features methods, which represents an
image as a collection of local features. However, since this
method and related methods propose taking a histogram of
features of the dominant points of interest, we are throwing
away all the information giving spatial layout of the features. This way they are incapable of capturing the shape of
object or segmenting them from background. This means
that it would be better if we could use the spatial information to build a structural object description. However this
task in not very simple in the presence of occlusion, clutter
or view point changes. There has been considerable work
done towards building a robust structural object descriptor. Some of this work involves the generative parts model
[3], [4] which associates a certain relationship between the
position of detected parts. Another approach considered
efficient for this task finds pairwise relations between the
neighboring local features. However, these methods are too
computationally expensive or they have yielded inconclusive results.
However, in this work the authors move away from the

trying to develop geometrically invariant structural representations and propose a global non-invariant representation. The main idea here is to gather the local statistics of
an image over small patches on a regular grid and then finding geometric correspondences between these aggregated
statistics. So, this method makes sure that the images have
local invariance over small patches and the corresponding
local histograms in two images are matched to obtain correspondence matching in a global scale. To do this, we use
kernel based classification where we build a pyramid with
the image by subdividing the image into grids with increasingly fine resolution and then take the a histogram of local
features in each of these grid boxes. We then find the geometric correspondence with another image by matching it
with its pyramid which is constructed in the same fashion.
This idea is based on Grauman and Darrells pyramid matching scheme [5].
To evaluate this method, here we implement it and test it
on the Caltech-256. This dataset is considered harder than
the Caltech-101, since it has a lot of clutter, objects are not
centered and images in it do not contain corner artifacts.
The author argues that since the global statistics of an
image provide good hints about the category of an image
we can use this method as a precursor to subsequent object recognition tasks. This method can provide hints to the
object recognition component of the algorithm.
2. Previous work
In computer vision histograms are widely used for image
description. Koendernik and Van Doorn [?] replaced local
image structure with local histograms. Essentially they discard the precise location of individual image elements. In
this sense histogram images are locally orderless images.
(Basically for each region of interest, where ROI is a Gaussian aperture with given location and scale, they compute
histograms of features over that ROI). The proposed spatial pyramid approach can be considered as an alternative
way of creating histogram images, where instead of using
Gaussian apertures, a fixed hierarchy of rectangular win-
dows is used. Experiments have shown that spatial pyramid

approach for matching is a very powerful mechanism for
estimating overall perceptual similarity between images.
An opposite approach called the multiresolution histograms was proposed by Hadjidemetriou et. al. [?] Here,
they repeatedly subsample an image and compute the histogram for each different resolution image. While, they
change the resolution of the image and compute its features,
having a fixed resolution for the histograms we find the features of a fixed resolution image, but vary the spatial resolution at which the histograms are aggregated. By keeping
the resolution fixed we preserve more information and have
higher dimensional representations. Because of this key difference, this method can be used for approximate geometric
matching when an appropriate kernel is used.
The method of subdividing image into sub blocks and
then computing the histograms of local features has already
been used in computer vision for global image description
and for local description of regions of interest. The important point is to find the right subdivision scheme and the
right balance between subdividing and disordering. In
this work, the authors suggest an approach. They argue that
good results can be achieved by combining multiple resolutions in a principled way. Since this method performs
approximate geometric matching, the reason for success of
subdivide and disorder techniques could be because of
this operation.
3. Spatial pyramid matching

3.1. Pyramid Match Kernels
Assume X and Y to be sets of vectors in d-dimensional
feature space. In order to find the approximate matching
between these sets, Grauman and Darrell proposed Pyramid
matching technique. Briefly, they propose that we place a
sequence of increasingly coarser grids in the feature space
and take a weighted sum of the number of matches that occur at each level of resolution. We consider two points to
match if they appear in the same cell of the grid. Matches
at finer resolutions will be given higher weight when compared to matches that occur at a coarser resolution. Let us
construct the sequence of grids at resolutions from 0, ., L.
Grids will be constructed in a way, such that at level l it
will have 2l cells along each dimension, in total creating
l
D = 2dl cells. Assume HX
and HYl to be the histograms
l
of X and Y at the level l, so that HX
(i) and HYl (i) will
be the number of points falling into the ith cell of the grid.
Then we can compute the number of matches at level l by
using:
Il =
D
X
i=1
l
min(HX
(i), HYl (i))
(1)
Another thing to note here is that the number of matches

found at level l includes all the matches found at level l + 1.
This way the number of new matches found at level l will
be I l I l+1 for all l = 0, ..., L 1 and will have a weight
1
. Note that the weight assigned is inversely
equal to 2Ll
proportional to the cell width at that level. Intuitively, we
are trying to penalize the larger number of matches which
are found at lower resolutions, since they involve increasingly dissimilar features. In the end our pyramid matching
kernel will look like:
L1
X
X
1
1
(I l I l+1 ) = L I 0 +
Il
Ll
2
2
2Ll+1
l=1
l=0
(2)
Both the histogram intersection and the pyramid match
kernel are Mercer kernels [5]. This way, we know that when
we evaluate the kernel function, we are finding the inner
product between the elements of a rich feature space. Consequently this kernel function can be used in an SVM to do
classification.
k L (X, Y ) = I L +
3.2. Spatial Matching Scheme

Grauman and Darrell proposed that we match two collections of features when they are put in the high dimensional feature space (which throws away all the spatial information available). Here, however the authors propose
a different approach which does pyramid matching on the
two-dimensional image layout space. Each feature vector is
quantized into M discrete types (which we can call as Visual Words). We will assume that only features of the same
type can be matched to one another. For each channel m we
will get two sets of two-dimensional vectors, Xm and Ym ,
which represents the coordinates of features type m found
in the images. The final kernel will be the sum of all separate channel kernels:
K L (X, Y ) =
M
X
k L (Xm , Ym )
(3)
m=1
Since we are clustering the features into a Visual Vocabulary ,this method is a regular bag of features approach
when L = 0. Another important point to note here is
that we can implement the final kernel as a single histogram intersection of vectors, formed by concatenating the
weighted histograms of all the channels at all resolutions
(Fig 1). This is possible because the pyramid match kernel is a weighted sum of histogram intersections and since
cmin(a, b) = min(ca, cb) for positive numbers. For L levels P
and M channels, the resulting vector has dimensionality
L
M l=0 4l = M 13 (4L+1 1). If we set M = 400 channels
and have 3 levels we will have to do 34000-dimensional histogram intersection, but since these histograms are sparse,
set of patches from the training set to form a visual vocabulary. Typical vocabulary sizes for experiments are M = 200
and M = 400.
Since the results shown by the weak features were not
very good, in our work we decided to use only the strong
features which are the SIFT descriptors.
5. Authors Experiments
Figure 1. Example of constructing a three-level pyramid. The image has three feature types, indicated by circles, diamonds, and
crosses. At the top, the image is subdivided into three different
levels of resolution. Next, for each level of resolution and each
channel, features that fall in each spatial bin are counted. Finally,
each spatial histogram is weighted according to eq. (2)
these operations are efficient. The authors note that there

was no significant increase in performance beyond M =
200 and L = 2, where the concatenated histograms are
4200 dimensional. For simplicity, in our data set we will
resize all images to be the same size. If we did not do this,
we would have to normalize the histograms since the images in the datasets might not all be of the same size. This
involves normalizing with the total weight of all the features
in the image, which would force the number of features in
all the images to be the same.
Another interesting performance improvement that we
discovered and implemented was to only find the histogram
of the patches at the finest resolution. After this, we can
find the histograms of the coarser patches at the lower resolutions by simply adding the histograms of the appropriate
patches (from the finer resolution) that constitute this larger
patch.
4. Feature Extraction
The authors use two kinds of features for their experiments. The first type of features they use are dubbed weak
features, which are oriented edge points, namely points
whose gradient magnitude in a given direction exceeds a
minimum threshold. To create features which are similar to
Torralbas gist features [7], the authors extract edge points
at two scales and eight orientations, for total of M = 16
channels.
Then, for better discriminative power, they use what they
dub as strong features, which are actually SIFT descriptors of 16 16 pixel patches computed over a grid with
spacing of 8 pixels. The authors propose the use of a dense
regular grid instead of only interest points, since comparative evaluation of Fei-Fei and Perona [3] show that dense
features work better for scene classification. This is because
they capture uniform regions like the sky, calm water etc.
After this, we perform k-means clustering of a random sub-
For their experiments Lazebnik et al used three different

datasets: fifteen scene categories, Caltech-101, and Graz.
They use the per class recognition rate as a measure of
performance since there are different number of pictures
in each class. For this, they run the experiment 10 times
with different random training and test sets and take the
mean and standard deviation of the results from individual runs. They have done Multi-class classification using
many binary SVMs using the one against all classification
technique. The classifier with the maximum positive response determines the class of the test image. They used
only grayscale images for all their experiments.
The fifteen scene categories dataset is composed of thirteen provided by Fei-Fei and Perona [3] (eight of these were
originally collected by Oliva and Torralba [7]), and two
(industrial and store) were collected by the author. Each
category has 200 to 400 images, and average image size
is 300 250 pixels. Classification experiments were done
using 100 images per class for training and the rest for testing. Strong features do better than the weak features. Going
from M = 200 to M = 400 does not improve performance
very much. Although single level performances increase as
we increase the resolution of the grid, combining the levels
to form a pyramid gives better performance. For all three
kinds of features, results improve dramatically as they go
from L = 0 (bag of features) approach to a multi-level pyramid setup conferring a statistically significant benefit. But
for strong features, single-level performance actually drops
as they go from L = 2 to L = 3. This can be explained
with the pyramid being too finely subdivided, with individual bins yielding too few matches. Table 1 shows the results
for fifteen scene categories.
Their second set of experiments were done on the
Caltech-101 dataset. Although diverse, in this dataset most
images have relatively little clutter, and the objects are centered and occupy most of the image. It is also important
to mention that some images are affected by corner artifacts resulting from artificial image rotation. These artifacts
can provide cues resulting in misleadingly high recognition
rate. They train on 30 images per class and test on the rest
of the images with maxing out at 50 images per class. Table 2 gives a breakdown of classification rates for different
pyramid levels for weak features and strong features with
M = 200. For their experiments the successful classes
are either dominated by rotation artifacts (like minaret),or
L
0(1 1)
1(2 2)
2(4 4)
3(8 8)
Weak features (M = 16)

Single-level Pyramid
45.3 0.5
53.6 0.3 56.2 0.6
61.7 0.6 64.6 0.7
63.3 0.8 66.8 0.6
Strong features (M = 200)

72.2 0.6
77.9 0.6 79.0 0.5
79.4 0.3 81.1 0.3
77.2 0.4 80.7 0.3
Strong features (M = 400)

74.8 0.3
78.8 0.4 80.1 0.5
79.7 0.5 81.4 0.5
77.3 0.5 81.1 0.6
Table 1. Classification results for the scene category database
Class
Bikes
People
L=0
82.4 2.0
79.5 2.3
L=0
86.3 2.5
82.3 3.1
Table 3. Results of our method (M = 200) for the Graz database

and comparison with two existing methods.
have very little clutter (like chair), or represent coherent

natural scenes (like joshua tree and okapi). The least successful classes are either textureless animals (like beaver
and cougar), animals that camouflage well in their environment (like crocodile), or thin objects (like ant). Their work
beats state-of-the-art orderless methods and precise geometric correspondence.
Although their work was not designed to cope with
heavy clutter and pose changes, the authors wanted to check
how much of the global scene cues can their algorithm exploit even under these conditions. So, they tested their
method on the Graz dataset [9],which is characterized by
high intra-class variation. It has two object classes, bikes
(373 images) and persons (460 images), and a background
class (270 images). The image resolution is 640 480, and
the range of scales and poses at which exemplars are presented is very diverse. They train detectors for persons and
bikes on 100 positive and 100 negative images and test on
a similarly distributed set. Table 3 summarizes their results
for strong features with M = 200.
6. Our Experiments
We tested our implementation on the Caltech 256 [8]
which is considered to be harder than the Caltech 101 on
which the authors have already tested. This dataset is much
bigger (with 30608 images) and has many images per category (much more than the Caltech 101). The Caltech 256
also has more clutter and occlusion. The images are not leftright aligned and does not suffer from the rotation artifacts
which exists in the Caltech 101 (this gave high recognition
rates for the method because they provide stable cues).
For our experiments we are using only the strong features which we have described before. For this, we are
taking the SIFT descriptors of the image in a dense fashion over a 16 16 pixel patch of the image at a time and
moving the patch by 8 pixels. Like the authors, we are tak-
Number of Images
Classified Correctly
Percentage of Images
Number of Classes
Detected
Level 1
Level 2
Level 3
21
150
26
150
27
150
14 %
17.33%
18 %
17
50
17
50
18
50
Table 4. The results obtained for the following setup: Number of

Classes: 50, Training Images per Class: 5, Testing Images per
Class: 3, Number of Visual Words: 300
Number of Images
Number of Classes
Detected
Level 1
Level 2
Level 3
9
50
9
50
8
50
18 %
18%
16 %
9
50
9
50
8
50

ing the descriptors over a dense patch and not just for the
points of interest. We are also resizing all the images to
320320 to eliminate the need for histogram normalization
(normalization becomes necessary when the histograms are
collected over different patch sizes for images of different sizes). After generating the descriptors, we are creating a feature space by randomly selecting descriptors from
patches of the images. Random selection is done rather than
taking all the descriptors since that would be prohibitively
costly. After creating the feature space, we run a clustering
algorithm (we used k-means) on the feature space to obtain
a set of visual words. This forms our visual vocabulary.
For our experiments, we have used M = 300 visual words
and M = 500 visual words. Since we wanted to check
the benefit of having more visual words we created a set of
M = 500 visual words and tested the algorithm with these.
After creating the visual words, we assign the features to
the visual words that they are closest to (i.e. we assign the
features the cluster label of the cluster that they belong to).
L
0
1
2
3
Weak features
Single-level
15.5 0.9
31.4 1.2
47.2 1.1
52.2 0.8
Pyramid
32.8 1.3
49.3 1.4
54.0 1.1
Strong features (200)

41.2 1.2
55.9 0.9 57.0 0.8
63.6 0.9 64.6 0.8
60.3 0.9 64.6 0.7
Table 2. Classification results for the Caltech-101 database.
After this, we do Multi-class classification using many

binary SVMs. We are training every SVM to learn one
class from the rest of the classes. Naturally, we have as
many SVMs as there are classes. The kernel used for the
training and testing is the spatial pyramid match kernel as
described above. For testing, we check the probability of
every test image belonging to every class and classify the
test image as belonging to the class which is the most probable. We have used 50 categories from the 256 categories
of Caltech-256 mainly since we were working on our personal computers which are not the best, performance wise.
As our performance measure, we have used the percentage
of images classified correctly. This is a feasible measure of
performance since the test set for all the chosen classes are
all of the same size. For the first round, we have taken 5
training and 3 testing images per class for the 50 classes.
We have a visual vocabulary of M = 300 words and do the
classification using L = 1, 2 and 3. We see that the classification rate goes up as we increase L from 14% for L = 1
to 18% for L = 3. Please find the complete results in the
Table 4. These recognition rates are consistent with the current benchmark results on the Caltech-256 for a training set
(Fig 2) of size 5. As the size of the training set increases,
the results will improve. To test this, we re-distributed the
above set of training and testing images such that the size
of the training set is now 7 images per class and the size of
test set is now 1 image per class. Another reason for doing this is because taking dense SIFT descriptors for 500
images once again takes a very long time on our personal
laptops. With this new set we again found M = 300 visual
words and did the classification using many binary SVMs.
Our results were 18% for L = 1, 18% for L = 2 and 16%
for L = 3. Please find the complete results in the Table
5. Although, this is better than the results obtained with 5
training images per category, we feel that the effects of having a larger training set would be more pronounced if we
had a larger test set per class. With this belief, we now narrowed down the number of classes that we want to test as
5 from the 256 classes. However, this time we wanted to
test on a larger training and testing set. Hence, we used 20
training images per class for the 5 classes and 3 test images
per class. We once again found a dense set of SIFT descriptors for these 115 images and created a visual vocabulary
of size M = 300. Surely, our results were better. They
Number of Images
Number of Classes
Detected
100% detection?
No Detection?
Level 1
Level 2
Level 3
8
15
8
15
7
15
53 %
53%
46.67 %
4
5
3
5
3
5
Class 3
Classes
2,3
Classes
4,5
Class 2
Class 4
Classes
4,5

were 53% correct for L = 1 and L = 2 but only 46.67%

correct for L = 3. Another thing we wanted to test was the
effect of increasing the size of the visual vocabulary from
M = 300 to M = 500. So, we used the same setup as before but this time with M = 500 visual words. This gave a
huge benefit for level L = 1 since the correct classification
percentage sprung up to 66.67% here. However, as we increase the level to L = 2 and L = 3 the performance drops
to 46.67% and 40%. Please see the complete results in the
Tables 6 and 7.
Some of the hard classes to classify even with a large
training set were baseball bat because the test images used
were very different from the training set. Another hard class
was the baseball glove since most of the training images
has 2 gloves but the testing images (Fig 3) have only one
glove. However, this class did have some recognition as
opposed to the baseball bat. An easy class was the backpack class.
7. Conclusion
The above discussed method presents a holistic
method for image categorization. Despite its simplicity it
has shown good results when compared to methods that
construct a structural model for object recognition. It even
does better than an orderless image representation scheme
which is not a trivial accomplishment. It highlights the
power of global scene statistics and provides useful discrim-
Number of Images Classified Correctly

Percentage of Images Classified Correctly
Number of Classes Detected
100% detection?
No Detection?
Level 1
Level 2
Level 3
10
15
7
15
6
15
66.67 %
46.67 %
40 %
4
5
4
5
3
5
Classes 1,2,3
Class 4
Class 3
Class 4
None
Classes 4,5
Table 7. Results for the followin setup: Number of Classes: 5, Training Images per Class: 20, Testing Images per Class: 3, Number of
Visual Words: 500
Figure 2. Training Set: The training set used to train the SVM for the baseball bat, baseball gloves and backpacks
cremental Bayesian approach tested on 101 object categories. In IEEE CVPR Workshop on Generative-Model
Based Vision, 2004.
[4] R. Fergus, P. Perona, and A. Zisserman. Object class
recognition by unsupervised scale-invariant learning. In
Proc. CVPR, volume 2, pages 264271, 2003.
[5] K. Grauman and T. Darrell. Pyramid match kernels:
Discriminative classification with sets of image features. In Proc. ICCV, 2005.
Figure 3. Test set: Backpack always had high recognition, while
very difficult test images were the baseball bats (no recognition)
and baseball gloves (some low recognition)
inative information. The author claims that it can be used to

provide context for larger object recognition systems or
can be used to evaluate biases in new datasets.
References
[1] Svetlana Lazebnik, Cordelia Schmid, Jean Ponce, Beyond Bags of Features: Spatial Pyramid Matching for
Recognizing Natural Scene Categories, In Proc. CVPR,
2006.
[2] Griffin, G. Holub, AD. Perona, P. The Caltech-256,
Caltech Technical Report.
[3] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an in-
[6] J. Koenderink and A. V. Doorn. The structure of locally

orderless images. IJCV, 31(2/3):159168, 1999.
[7] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A.
Rubin. Context-based vision system for place and object recognition. In Proc. ICCV, 2003.
[8] G. Griffin and A. Holub and P. Perona Caltech-256 Object Category Dataset, California Institute of Technology , 2007
[9] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. Weak
hypotheses and boosting for generic object detection
and recognition. In Proc. ECCV, volume 2, pages 7184,
2004.

Spatial Pyramid Matching For Scene Category Recognition

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Spatial Pyramid Matching For Scene Category Recognition

Hochgeladen von

Copyright:

Verfügbare Formate

Spatial Pyramid Matching for Scene Category Recognition

Nityananda Jayadevprakash, Olzhas Makhambetov

However, in this work the authors move away from the

dows is used. Experiments have shown that spatial pyramid

3. Spatial pyramid matching

Another thing to note here is that the number of matches

3.2. Spatial Matching Scheme

these operations are efficient. The authors note that there

For their experiments Lazebnik et al used three different

Weak features (M = 16)

Strong features (M = 200)

Strong features (M = 400)

Table 1. Classification results for the scene category database

Table 3. Results of our method (M = 200) for the Graz database

have very little clutter (like chair), or represent coherent

Table 4. The results obtained for the following setup: Number of

Table 5. The results obtained for the following setup: Number of

Strong features (200)

Table 2. Classification results for the Caltech-101 database.

After this, we do Multi-class classification using many

Table 6. The results obtained for the following setup: Number of

were 53% correct for L = 1 and L = 2 but only 46.67%

Number of Images Classified Correctly

inative information. The author claims that it can be used to

[6] J. Koenderink and A. V. Doorn. The structure of locally

Das könnte Ihnen auch gefallen