Sie sind auf Seite 1von 21

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/312240708

Architectural style classification of Mexican historical buildings using deep


convolutional neural networks and sparse features

Article  in  Journal of Electronic Imaging · December 2016


DOI: 10.1117/1.JEI.26.1.011016

CITATIONS READS

5 367

4 authors, including:

Abraham Montoya Obeso Jenny Benois-Pineau


Instituto Politécnico Nacional Université Bordeaux
11 PUBLICATIONS   18 CITATIONS    243 PUBLICATIONS   1,424 CITATIONS   

SEE PROFILE SEE PROFILE

Alejandro Ramirez

23 PUBLICATIONS   78 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

student View project

Alzheimer disease studies View project

All content following this page was uploaded by Jenny Benois-Pineau on 18 June 2018.

The user has requested enhancement of the downloaded file.


Architectural style classification of Mexican historical buildings
using deep convolutional neural networks and sparse features
Abraham Montoya Obeso,a* Jenny Benois-Pineaub, Alejandro Álvaro Ramirez Acosta,c
Mireya Saraí García Vázquez,a,c
a
Instituto Politécnico Nacional-Centro de Investigación y Desarrollo de Tecnología Digital, Ave. Instituto Politécnico
Nacional 1310, Tijuana, México, 22435
b
LaBRI-Université de Bourdeaux, Cours de la Libération, Bordeaux, France, F-33405
c
MIRAL R&D&I, Imperial Beach, San Diego, USA, 91932

Abstract. In this article, we propose a convolutional neural network (CNN) to classify images of buildings using
sparse features (SF) at the network’s input in conjunction with primary colour pixel values. As a result, a trained
neuronal model is obtained to classify Mexican buildings in three classes according to the architectural styles:
prehispanic, colonial, and modern with an accuracy of 88.01%. The problem of poor information in training dataset
is faced due to the unequal availability of cultural material. We propose a data augmentation and oversampling method
to solve this problem. The results are encouraging and allow for pre-filtering of the content in the search tasks.

Keywords: CNN, cultural heritage, indexing, classification, image processing, deep learning.

*Abraham Montoya, E-mail: amontoyao1500@alumno.ipn.mx

1 Introduction

Since 1986 the European Union (EU) has supported research and innovation projects aiming to
improve the preservation of cultural heritage. The FP7 and H2020 programs1 have been found to
be interesting projects in the sphere of European cultural heritage such as AXES2, whose goal is
to develop tools that provide various types of users with new engaging ways to interact with
audiovisual libraries, helping them to discover, to browse, to navigate, to search, and to enrich
archives, I-SEARCH3, which aims to create a novel unified framework for multimedia and
multimodal content indexing, sharing, search, and retrieval, CultAR4, which is a platform
achieving personalized and engaging digital cultural experiences through enhanced representation,
hybrid space mediation, social engagement and awareness, and DECHIPHER5, which changes the
way people access digital heritage by combining much richer event-based metadata with causal
reasoning models. Some other projects in cultural heritage are EUROPEANA6, PRESIOUS7,
CULTURA8, PATHS9, PRESTOPRIME10, i-Treasures11, EAGLE12 and TAG CLOUD13. Another
innovative project has been proposed by Alleto14, which aims at assisting the visitor during an
outdoor tour of a cultural site using the unique first person perspective of wearable cameras.
Picard15 discuss the use of automatic tools to both automatically index the documents and search
through the heritage collections.

In Mexico, the interest in digital cultural heritage issue has increased considerably. Currently,
through the co-funding mechanisms of Conacyt-H202016, Mexicans partners can be funded in
Horizon 2020. The contribution in this paper results from a fruitful collaboration of Mexican and
French partners in the framework of the research project Mex-Culture, supported by both French
National Agency of Research (ANR17) and CONACYT as a bi-national project between France

1
and Mexico. It resulted in creation of the first multimedia indexing platform with the aim of
promoting Mexican culture.

In the perspective of enrichment of digital visual archives of Mexican Culture artifacts, the first
step in content classification in an architectural environment consists in determining the style of
buildings. For Mexico, the majority of cultural heritage buildings can be divided into three classes:
pre-hispanic, colonial, and modern. Visually these classes of architectural style can be
distinguished by the geometry of shapes and details. This is why it is interesting to use specific
geometrical features for their classification. These features describing lines, corners, orientations,
etc., can be back-projected into the image plane yielding sparse matrices.

On the other hand, the most powerful classification approaches for natural images and video
represent Deep Neuronal Networks18. Most of the work in image classification task ingest primary
colour pixel values. The novelty of our work consists in the use of a-priori knowledge expressed
in specific geometrical features, with deep convolutional neural networks. The idea here is to
enhance primary colour pixel values by specifically engineered sparse features. We apply our
method to classify architectural styles of Mexican buildings. Hence, the side contribution is a new
dataset with images of Mexican buildings.

The paper is organized as follows. Section 2 presents the related work both on image classification
for cultural aspects and on deep convolutional neural networks as a new approach for large-scale
image recognition. Section 3 describes the method for Mexican buildings architectural style
classification. Here the concepts of proposed sparse features and the convolutional neural network
architecture for image classification, are introduced. The experimental dataset, experiments, and
results are presented in Section 4. Section 5 contains the discussion of our results and perspectives
of the work.

2 Related Work

Before describing the proposed method, we briefly review related approaches for architectural
building style image classification and provide an overview of deep convolutional networks.

2.1 Image Classification

Image classification is still a challenging task and an open problem in the computer vision and
multimedia information retrieval communities. The problem of automatic classification of styles
of buildings in digital image classification domain becomes a focal point in cultural heritage
applications. In that sense, not much work has been reported. Salvador19 proposes to
automatically classify images from 50 different cultural events. Liu20 tries to recognize cultural
events by combining both ideas of object/scene content mining and strong image representation
via Convolutional Neural Networks. Gayane21 propose an approach based on clustering and
learning of local features, integrating at the training stage, the knowledge that architects use to
classify windows of different architectural styles. Wei-ta22 classifies images based on the idea that
visual pattern can describe elements conveying architectural style. Based on the PHOG23 features
the authors apply a Support Vector Machine to automatically classify the ancient Chinese
architectural images.
2
With other approach Ying and Förstner24 classifies the building facade in natural images using a
randomized forest classifier and local features. The focus is the classification of the object classes
pavement, car, door, road, sky, window and vegetation. Doersch25 et al. uses geotagged images to
find visual elements such as windows, balconies and street signs to detect an area that is described
by its architecture, they propose a clustering approach taking into account a weak geographic
supervision. However, the methods to classify architectural images are still few.

The deep learning approach for image classification bypasses performances of other supervised
learning methods on the bases of engineered features23. In our proposal, we try to mix the feature
generation during training process by a Deep Convolutional Neural Network (DCNN) and the use
of specifically generated sparse features as the input to a DCNN. In the following section, we
briefly introduce Deep CNNs.

2.2 Deep Convolutional Neural Networks

Deep Neuronal Networks have received ever growing popularity in supervised learning tasks due
to their generalization capability when a sufficient amount of training data is available26.

A Convolutional Neural Network (CNN27) is a type of feed-forward back-propagation neural


network modeling the biological visual processes. It consists of trainable multiple convolutional
stages/layers, in which input and output of each stage are different one/multi-dimensional arrays.
The output layer learns to extract high-receptive features from previous layers of features. As
shown in Fig. 1, a typical CNN is composed of several layers of feature extraction followed by a
classification stage at the end.

Each stage/layer consists of four steps: trainable convolution, non-linearity activation, contrast
normalization, and pooling/sub-sampling. Convolution filters transform an input map into
translation-invariant maps with different trainable weights and biases as in Eq. 1:

𝑋𝑗𝑙 = 𝑓 (∑𝑖∈𝑀𝑗 𝑋𝑖𝑙−1 ∗ 𝜔𝑖𝑗


𝑙
+ 𝐵𝑗𝑙 ) (1)

where,
𝑋𝑗𝑙 is the activity of the unit 𝑗 on the layer 𝑙,
𝑋𝑖 is the selection of the input feature maps,
𝐵𝑗𝑙 is the additive bias of the unit 𝑗 on the layer 𝑙,
𝑙
𝜔𝑖𝑗 represents the synaptic weights between unit 𝑗 of the layer 𝑙 and 𝑙 − 1, and
𝑓() is a nonlinear activation function.

Nonlinear activation function, i.e. hyperbolic, sigmoid or Rectified Linear Unit18 (ReLU) 𝑓(𝑥) =
𝑚𝑎𝑥(𝑥, 0) activates the neurons after each layer. Contrast normalization keeps output map values
within a pre-defined range. Final feature map in each layer is subsampled or max-pooled from
output maps of the non-linear transform layer to make it smaller in size. This puts in evidence
more structured features in the next layers. Finally, the fully connected layers realize the dot-

3
product to map the features into probabilities for the input data to belong to a class from a given
taxonomy. These probabilities are usually computed by a soft-max transform28.
A simplified illustration of a CNN architecture is presented in Fig. 1 below. Here we have removed
the non-linear transformation layers and contrast normalization for the sake of an easier
𝑙
interpretation. The optimization method used for training of the network parameters 𝜔𝑖𝑗 and 𝐵𝑗𝑙
for all layers is the Stochastic Gradient Descent18 (SGD). It seeks to minimize the loss function
which quantifies the classification errors of the network. The loss function can be expressed in
different forms, but the most popular in the image classification task is the logistic regression. We
use the soft max loss26 function with logistic regression.

For an input image 𝑋𝑖 with known label 𝑙𝑖 it is expressed as :


𝑁
1
̂𝑗 )𝛿(𝑙̂𝑗 , 𝑙𝑖 )
𝐸(𝑋𝑖 , 𝑙𝑖 ) = ∑ log⁡(𝑃 (2)
𝑁
𝑗=1

𝛿(𝑙̂𝑗 , 𝑙𝑖 ) = 1 if 𝑙 = 𝑙𝑖 , 0 otherwise (3)

Where 𝑃̂𝑗 is the probability for a label 𝑙̂𝑗 , obtained by a soft-max operator 26.
The loss over a dataset 𝐷 is:
|𝐷|
1
𝐿(𝑊) = ∑ 𝐸(𝑋𝑖 , 𝑙𝑖 ) + 𝜆𝑟(𝑊) (4)
|𝐷|
𝑖=1

𝑟(𝑊) is the regularization term with a weight decay 𝜆 = 0.0005.

Fig. 1 CNN architecture overview, with two layers of convolution and pooling, and two fully connected layers at the
end.

The approach to image classification with Deep CNNs involves three major stages: i) input data
selection, ii) CNN architecture design and iii) training.

4
First, the selection of the input image data with additional information, or without it, is done. The
additional information is needed to increase the training dataset as a large amount of training
samples is required for training of network parameters.
In practice, the most common way to increase the number of samples of the whole dataset is the
generation of new data calculated directly from the source29-31,18. Here the most commonly used
techniques for training data augmentation are affine transformations such as flip, rotation, etc.,
crop, selection of several patches from the image, a patch of the object-of-interest, context area,
amongst others. Data augmentation by using transformations is a fast way to get more information
to train-test networks compared with manual annotations. These transformations are called “label-
preserving” as they do not change the assignment of the images to their class from a given
taxonomy. We think that the transformations must be selected in relation to the classification
problem and the variety of image acquisition conditions. For example, Simard29 presents a
classification of handwritten English digits with CNN. The augmentation data task that presents
the best results is based on the type of content. This is an elastic transformation for handwritten
digit detection. The reason behind is that elastic transformation corresponds to uncontrolled
oscillations of the hand muscles. However, the elastic transformation may not be useful to detect
buildings; instead, simple affine transformations such as rotations or translations, scale changes
could be profitable because rigid structures do not change their shape, but only perspective and
scale are changed. Note, that we are rather careful about generation of geometrical distortions, as
e.g. perspective transformations do not preserve angles and alising effects can appear. We suppose
that the original dataset contains images with typical geometrical distortions and will not generate
them.

Nevertheless, data augmentation is not the only way to get more information to train recognition
systems. The use of descriptors as a complementary input have been proposed by Simonyan32 to
increase the classification accuracy. In our work we will follow this idea and propose specific
features as input to the Deep CNN.

Second, the design of the convolutional neural network architecture according to the type of
content to process and target classification task is fulfilled. Here the number of network layers, the
non-linear transformation, the polling strategies, and the loss function are proposed.

Finally, the training must be planned in relation to the available memory and the dataset volume
with selected optimization method. The method has to be sufficiently fast to tackle the high
dimensionality of the problem, but also sufficiently robust to the non-convexity of loss functional.
We do not go into details of optimization methods for parameter learning as we use the stochastic
gradient descent18. Here buckets of data are randomly sampled from the training dataset and
several complete passes, called “epochs” are realized through the dataset. Convolution weights
initialization is an important issue in the optimization process. The direct way to initialize them is
to do it randomly, otherwise, they can be initialized in different layers by the transfer learning
strategies33.

To train a convolutional neural network, one needs a set of data as large as possible18,28. The dataset
creation is a hard work; manual selection of content is a challenging problem due to the resources
that we need to perform a good database consolidation, we need time and trained annotators. After
a dataset consolidation, with a relatively limited number of samples for each class, a data
5
augmentation can be performed to increase the training data samples. Data augmentation consists
in applying the so-called label preserving transformations18,30,34 on selected image patches which
do not change the class of objects, such as geometrical transformations, smoothing, colour
changes. Hence the quantity and variability of training data is augmented and there is no “by heart”
learning danger at the training step. The drop-out method, consists in randomly switching off
neurons in the intermediate layers, thus the propagation in some positions of the grid is stopped. It
was proved to be an effective regularization way. Due to this, the second best entry was achieved
on the 2012 ImageNet Large-Scale Visual Recognition Challenge18. Moreover, Gong35 et al., show
that the combination of convolutional neural networks with top-k ranking improves the
performance of multi-label image annotation. A similar architecture from Krizhevsky18 is used,
which contains several convolutional layers followed by dense fully connected layers. They used
the top-k ranking loss to train the network using the NUS-WIDE36 dataset, which is the largest
multi-label available dataset. An image resizing and patches extraction are made for each image
for data augmentation. They mainly focused on loss layer, taking into account three different
functions to penalize the deviation between the true and predicted labels. Note, that in our work,
we do not perform multi-label predictions.

Later, Szegedy37 et al. proposed a deep CNN architecture called “Inception”. This architecture
achieved the best performance on ImageNet Large-Scale Visual Recognition Challenge 2014
(ILSVRC14). The results were obtained by using GoogLeNet, a deep convolutional neural
network with 22 layers, and the architectural decisions were based on the intuition of multi-scale
image processing and the Hebbian principle.

In the most recent publications about CNNs and recognition systems, Zisserman38 et al. present an
end-to-end system to locate and recognize text in natural images, using bounding boxes to locate
words and convolutional neural networks for text recognition. The data for training are obtain by
generation of synthetic text, and no human annotation is required. It is the fundamental advantage
of such a proposal. In all these works the input data are raw, centered (or not) RGB pixel values.
In our work, we use specific features which are sparsely distributed in the image plane. We call
them “sparse features”. These features are computed on the basis of domain knowledge for our
task in distinguishing architectural styles of historical buildings in Mexico.

In the following section, we present the design of DCNN for the classification of architectural
images to detect the style of Mexican buildings.

3 DCNN for Mexican buildings style classification

In the design of our Depp CNN, we start from selection of the input data which are most suitable
for our classification task. Usually, the CNN´s input consists of raw images; for example, most
literature use RGB channels18,34. The particularity of our approach consists in adding, to RGB
channels, specific engineered features which we built for our classification task on the basis of the
architectural knowledge of the buildings. Besides, we also compare our approach with a
representation of PCA-SIFT features.

6
3.1 Sparse features for classification of architectural style in Mexican cultural images

The distinction between pre-hispanic, colonial, and modern styles of buildings in Mexican cultural
archives can be made with expert-designed features. Indeed, the pre-hispanic buildings are
characterized by a strong presence of corners and slant lines (Mexican Pyramids), while colonial-
style buildings contain more curvilinear shapes. Modern buildings would have more vertical
geometric structures. Therefore, after the analysis of architectural structures, we selected some
basic features and interest points that are easy to locate, these features could be useful to increase
the accuracy of the CNN owing to the important reference characteristics for each architectural
style.
Thus, in our data selection, we use specific sparse features, but we also concatenate them to the
primary RGB channels as shown in Fig 2. below. Indeed, such a task-dependent “data enrichment”
proved to be efficient for prediction of areas of interest in images39. A multidimensional array is
needed to add the sparse features to each image at training-test phases.

Fig. 2 Multidimensional array for CNN feeding, (a) RGB planes and (b) sparse features.

Sparse Features (SF) are image characteristics present sparsely in the spatial domain. We consider
four types of sparse features: i) corner points, ii) lines, iii) line intersections, and iv) corner-line
relations. These characteristics are easy to relate with architectural structures because several
robust points of reference can be seen in the image.

Therefore, we believe that adding sparse features to the input improves the detection of
discrepancies between classes because this information is highly related to specific points/regions
of the buildings. The CNN receives an extra information, in other words, a hint to learn deep
features. The four types of features are computed as follows.

Corner detection. A layer for the input data is the response of the Harris40 corner filter, without
the non-maximum suppression step. This feature was selected because a map with variations on
the region where a corner is located is richer than a binary corner neighborhood. The corner zone
is a critical point of the building structure, and the corner and the neighborhood represent a robust
area on the image to identify structures. The parameters for Harris corner detector were the same
that the authors claim as the best choice, an umbral 𝑘 = 0.04 and a window observation of 3 × 3
pixels.

7
Lines detection. After Canny41 edge detection the lines map is calculated by the combination of
the Hough transform42 and the Bresenham’s43 line algorithm to draw a straight line between two
points. To compute the edge map with Canny we set the threshold heuristically43, depending on
the input data. In Hough transform algorithm, an important parameter is the angle step. We applied
quite a fine discretization to the angle variable: the angle step was 0.1 radians. The binary plane of
segments of straight lines traced by Bresenham's method is used as a mask to extract the pixel
values of the gray-scale image and then to compute the gradient magnitude of the gray-scale lines.
The combination of the three methods allows the representation of the variations of the contrast
on the building borders.

Line intersections. In order to obtain robust information related to specific points of the buildings,
the line intersections are calculated. Only the intersections that are located in the image region are
taken into account. These intersections are necessary to compute the relation between corners and
lines.

Corner-line relation. With the purpose of representing local points in images, we define a rule to
set a relation between detected corners and lines in the input image. The coincidence of the
intersection of two lines with a close corner in the image plane is taken as a feature, relating these
features as a local point of interest. The existence of those points in a centered neighborhood over
the intersection represents the predominant relation of a structure’s section. Figure 3 shows the
representation of this idea. Lines, corners and the intersections are required to compute the relation
response.

The response magnitude of a detected relation in (𝑥, 𝑦) is computed with Eq. 3 by using the
Euclidean distance of 𝑑 for each relation found as in Fig. 3.

𝑅(𝑥, 𝑦) = 𝑒 −𝑑(𝑥,𝑦) (3)

Where 𝑑 is the distance between corners and intersections and 𝑅 is the response output map.

Fig. 3 Graphical description of a detected relation between corners and lines: a) maximum response and b) weak
relation due to the distance 𝒅.

8
Figure 3 shows that exist two cases for a detected relation between corners and lines intersection:
(a) the first one is the maximum response in function of Eq. 1, when 𝑑 = 0, (b) and the second
case is the exponential response with⁡𝑑 > 0 inside the window. The second case presents a weak
relation due to the separation of corner and computed intersection.

The examples of computed sparse features for representative images of three considered
architectural style classes are shown in Fig. 4. One can notice the presence of different orientations
of structures in the three different classes.

Fig. 4 Sparse features for pre-hispanic, colonial and modern buildings.

9
3.2 PCA-SIFT Sparse Features

Sparse features as new data is an interesting approach to increase the CNNS input data at the
training step to obtain a robust neural model. The SIFT44 descriptor is an interest point detector
who is capable of representing points robustly, in terms of scale and light variations. This method
gives us several key points and related to each key point we have a vector (descriptor) with 128
elements.
In order to compare the SIFT descriptor in a CNN with our sparse features, we performed a
descriptor reduction from 128 to 4 elements. As shown in Fig. 5, we created 4 images with the
principal components of the descriptors, for better visibility of the features, the black tone in the
images corresponds to zero values and the features are depicted in red. The reduction was
computed by using PCA45 method.

Fig. 5 PCA-SIFT features for pre-hispanic, colonial and modern buildings.

3.3 Design of the Architecture of a Deep CNN

The design of the Deep CNN architectures depends on the complexity of visual scenes to classify.
As we stated in Section 2, recently very deep architectures have been proposed yielding higher
accuracy. Nevertheless, there is also a trade-off between the efficiency, and complexity and
stability. Here we hypothesize that, by adding engineered features, we can reduce the number of
layers in the network while not decreasing the accuracy. Hence, instead of the popular network
from Krizhevsky18 with six convolutional layers, we propose only four layers. As shown in Fig. 6,
the proposed CNN architecture consists of four convolutional layers and two densely connected
layers. To introduce the images to the convolutional layers, each image is resized to 256 × 256
pixels, irrespectively of the initial height and width dimensions. The first convolutional layer filters
the 256 × 256 × 7 data volume with 96 kernels of⁡11 × 11 and 4-pixel stride. Then, the output
of the first convolutional layer is pooled and response-normalized to feed the second convolutional
layer which filters it with 256 kernels of⁡9 × 9, with a stride of 2 pixels. The third and the fourth
10
convolutional layers each use 384 kernels, of size 5 × 5 and 3 × 3, respectively, with a stride of
one pixel. At the end, two fully connected layers with 1000 and 4 outputs, respectively, are
included.

Fig. 6 Diagram of the CNN’s architecture, conformed by four convolutional layers, three pooling layers, two
normalization layers and two fully-connected layers at the end.

3.3 Details of learning

The network optimization is accomplished by stochastic gradient descent18,46 with a momentum


coefficient 𝜇 of 0.9 and a batch size of 256 samples. Here we followed the parameter settings from
Krizhevsky’s18 work. This algorithm updates the parameters 𝑊 of the objective 𝐿(𝑊𝑡 ) with the
rule:
𝑉𝑡+1 = 𝜇𝑉𝑡 − 𝛼∇𝐿(𝑊𝑡 ) (3)

𝑊𝑡+1 = 𝑊𝑡 + ⁡ 𝑉𝑡+1 (4)


where,
𝑊𝑡 denotes the parameters,
𝛼 denotes the learning rate, which is initialized with 0.01.

The training is done over GPU (NVIDIA k40m) only with the images from the Mex-Culture
database, pre-trained models are not used on this work. Instead of traditional tanh or sigmoid
neurons we use Rectified Linear Units18,47 (ReLUs), ReLUs are applied to every convolutional
layer and after the first fully-connected layer. We initialized the weights of each layer with a zero-
mean Gaussian distribution with a standard deviation of 0.01. We initialized the biases on all the
layers with the constant 0.

4 Experiments

For the classification task we use a dataset with images from the real world. The construction of
the training/validation dataset was necessary to train our CNN in this context.

4.1 The Dataset

We created a new dataset that includes images of the most important architectonic structures of
Mexico divided in three classes: pre-hispanic, colonial and modern. The images were extracted
11
from videos recovered from Mexican cultural institutions. As shown in Fig. 7, the dataset contains
an important variety of images with different type of content in terms of quality and buildings
perspective, and also a fourth class other with images of non-architectural content type.

This class was added due to the necessity of discarding images that do not contain buildings in
scene. In real-life scenario human errors during the annotation process are possible, specifically if
the images are extracted from video documents. Our database consolidation process48 allows a
collaborative annotation environment, but even with an independent cross-checking of annotated
images we cannot avoid human errors.

The image resolution was fixed to 256 × 256 pixels for the CNN’s input. During image selection
task, the quality of the image and the quality of the content are the most important characteristics
for an image to be included into a class of our taxonomy48. Training and validation datasets are
conformed during a collaborative manual image classification.

The current dataset consists of 16,000 images divided in training/validation sets and split into four
classes. The count is shown in the following table 1 with the columns depicting the classes of our
taxonomy and rows corresponding to the type of the set. Examples of images from Mexican
buildings dataset are presented in Fig. 7.

Table 1 Original images per class on the dataset.

Type Pre-hispanic Colonial Modern Other Total


Training 2,000 2,000 2,000 2,000 8,000
Validation 2,000 2,000 2,000 2,000 8,000

12
Fig. 7 Images in the architectural dataset.

4.2 Data augmentation

In this work we used two types of transformations, denoted by T, for data augmentation. With the
aim of enriching the variability of each class in training dataset, several transformations are
generated from each original image. The sparse features are computed on all images, both original
ones and transformed images.

We use rotations of −10°, −5°, 5°, and 10°, the images were rotated from with regard to their
centers in the image plane in a clockwise or counterclockwise direction for positive and negative
angles. Most videos are created by common users and there is a high probability of video
degradation by movements. Commonly, recording devices are not attached to a tripod or to an
accessory that allows a good stability for recording without chattering or rotations. Second, we

13
perform mirror flip transformations, i.e., vertical, horizontal and both. Thus, including the four
rotations, we have seven transformations for each image.

As we have in our database, public datasets of natural images as ImageNet49 or Oxford50 buildings
dataset has a balance in the classes.
We use transformations to increase the number of samples in every class for training. Table 2
shows the number of images per class after data augmentation.

Table 2 Augmented data with transformations, images and transformations are present in this dataset.

Type Prehispanic Colonial Modern Other Total


Training 16,000 16,0000 16,000 16,000 64,000
Validation 2,000 2,000 2,000 2,000 8,000

In this augmented dataset we have transformed images and original images, each class contains
2,000 images, thus we generate 14,000 of transformed images per class. For other class, we take
2,000 images and create their transformations. We do not include transformed images into the
validation dataset.

4.3 Experiments and results

In this section we present the classification experiments and results by using the CNN described
in Sec. 3.3 and the CNN presented in Krizhevsky’s work. In order to analyze the results when
using the concatenation of sparse features in the input volume, we evaluated the performance of
the CNNs architectures with the combination of two input data configurations: a training dataset
of RGB images with transformations and also this dataset with sparse features. The addition of
sparse features increases the training volume and the deepness of each training sample. The
numbers of images for training and validation are shown in Table 3.

Table 3. Images in the training and validation datasets, transformations were used to increase the data in training
dataset.
Data type Prehispanic Colonial Modern Other
Original training images 2,000 2,000 2,000 2,000
Transformed images 14,000 14,000 14,000 14,000
Total 16,000 16,000 16,000 16,000
Data type Prehispanic Colonial Modern Other
Original validation images 2,000 2,000 2,000 2,000
Total 2,000 2,000 2,000 2,000

Training details. The training parameters were the same for all experiments. The learning rate
(𝑙𝑟) was decreased automatically by steps (every 1,000 iterations), starting from 0.1 in function of
eq. 5.
𝑖 (5)
𝑙𝑟 = 𝑒 −𝛾

14
where 𝛾 indicates how much the learning rate should change when we reach the next step of the
training and 𝑖 is the iteration number.

This allows not to skip the optimum, which would result in the decrease of accuracy. For 64,000
images in training dataset we need to pass 250 batches of 256 images to reach an epoch, an epoch
corresponds to the processing of the whole training dataset. A single batch processing corresponds
to an iteration. After every two training epochs, a validation of the model was performed. The
maximum number of iterations was fixed experimentally to 20,000. For large weights, we set a
weight decay regularization with the value 0.0005 as in Krizevsky’s18 work. The implementation
was done with Caffe framework46.

In order to compare the effectiveness of our task-dependent sparse features in the framework of
Deep Learning, we use three different configurations as input to the CNN. As shown in table 4,
we use RGB pixel values, which is a standard approach of the state-of-the-art18,46, our Sparse
Features concatenated with RGB primary values, as described in section 3.1 and finally, the most
popular sparse features, SIFT descriptors, reduced by Principal Component Analysis to the
dimension (4) of our sparse features and also concatenate them with RGB values. Transformed
images are included in all trainings excluding validation dataset, which contains only original
images as shown in table 3. Trainings were performed by using the same training parameters but
using different CNN architectures: that one of Krizhevsky18 et al. and our architecture, which is
less deep.

The training accuracy (computed on validation dataset) is depicted in table 4 below. For training
3, we reached 𝟖𝟖. 𝟎𝟏%, this is our best result for proposed architecture. One can see that the result
is stable after relatively few thousands of iterations. As shown in Fig. 8 we fixed the number of
iterations up to 20,000, which are still much lower numbers than hundreds of thousands met in
referenced literature.

In the last experiment, we compared Krizhevsky’s architecture and our reduced-layer architecture.
Remind that Krizhevsky's18 consists of six convolutional layers and two densely connected layers
at the end, while our architecture contains less layers: only four convolutional layers are included
instead of six. In this training we reached an accuracy of 𝟓𝟑. 𝟐𝟖%, a similar accuracy with respect
to our architecture of four convolutional layers and two fully-connected layers trained in the fifth
experiment (see Table 4).

Table 4. Results of style recognition with test dataset.

Experiment CNN DB conf. iterations accuracy Training time


1 Proposed RGB 20,000 86.10% 10hr 42min
18
2 Krizhevsky RGB 20,000 74.24% 11hr 45min
3 Proposed RGB + SF 20,000 88.01% 10hr 40min
18
4 Krizhevsky RGB + SF 20,000 68.70% 11hr 40min
5 Proposed RGB + SIFT 20,000 65.00% 10hr 31min
6 Krizhevsky18 RGB + SIFT 20,000 53.28% 10hr 41min

15
Here it is important to note that the addition of sparse features increases the volume of information
related to each sample in the dataset but not the number of samples. Instead, transformations
increase the number of images in the dataset.

Table 5 shows the class-confusion matrix of the training with the best performance; it is very useful
to detect the evolution of the class-learning over the training phase. On the main diagonal, the
percentage of correct predictions of each class can be seen individually. Each class presents a
percent of confusion with others due to the inter-class similarity between some images in the
database. In pre-hispanic images we can see a lot of content of nature, sky or vegetation as in other
class and also we have this kind of content in other class images without pre-hispanic architectures.
This explains the strongest confusion with other class (see the first row of the table). The ideal
behavior of our classifier, must be the discrimination of architectures from other class.
Table 5. Class-confusion matrix which gives an overall accuracy of 88.01%.

Predicted label
Pre-hispanic Colonial Modern Other
Pre-hispanic 89.16% 01.17% 01.02% 08.65%
Real label

Colonial 02.34% 87.95% 04.24% 05.47%


Modern 04.33% 06.57% 83.39% 05.72%
Other 01.11% 01.11% 02.78% 91.96%

Fig. 8 Accuracy of training three, different color for each class.

5 Discussion and future work

In this paper we have trained a network to classify images of architectural style of Mexican
buildings in three different classes, pre-hispanic, colonial, and modern. We also detect the absence
of architectural structures classifying images as other type of content. The proposed CNN
architecture demonstrates that the depth of a network has a high relation with the amount of

16
training data and the complexity of the problem. Also, we proposed the use of sparse features to
increase the information and introduce some specific geometrical features in the training of the
network, improving results.
Our relatively shallow network with only four convolutional layers achieves better accuracy of
88.01% than Krizhevsky’s network with six convolutional layers, yielding 68.70% with the same
sparse engineered features.

The use of sparse features together with primary RGB values of pixels allows for a slight increase
of accuracy of 1.91%. We also state that our sparse features largely outperform PCA-SIFT features
of the same dimension: 88.01% vs 65.00%. We also experienced that the accuracy could be
improved by adding images to the training dataset. We proposed a data augmentation to enrich the
classes from a real-world situation with a poor dataset.

Additionally, we proved that the accuracy of the CNN depends directly on the quality of the
training data content and ground truth annotation. The dataset has to be pure, and the labeling
errors must be minimized to improve the quality of results. Nevertheless, to take human errors into
account we considered other class our classification problem.

The perspectives of this work are numerous. Indeed, we have experienced a kind of “early fusion”
approach when the network ingests different kinds of features: primary or task-dependent
engineered features. Hence other fusion strategies remain to be explored, at the upper layers of the
network. Furthermore, a detailed analysis of learned convolutional filters has to be done, the
analysis of feature maps, in order to add neurons inhibition mechanisms highlighting most
important features in each layer. We are also open for adding specific sparse features such as shape
descriptors in the input layers of the network.

References

1. Digital Single Market. Project factsheets - Digital Single Market - European Commission.
https://ec.europa.eu/digital-single-market/en/node/77423?page=1 (Accessed 30 Jun.
2016).
2. Axes-project.eu. AXES. http://www.axes-project.eu/ (Accessed 30 Jun. 2016).
3. Isearch-project.eu. I-SEARCH | A unIfied framework for multimodal content SEARCH.
http://www.isearch-project.eu/isearch/ (Accessed 30 Jun. 2016).
4. CultAR. CultAR Platform. http://www.cultar.eu/overview/cultar-platform/ (Accessed 30
Jun. 2016).
5. Digital meets Culture. DECIPHER. http://www.digitalmeetsculture.net/heritage-
showcases/decipher/decipher/ (Accessed 30 Jun. 2016).
6. Europeana.eu. Europeana Collections. http://www.europeana.eu (Accessed 30 Jun. 2016).
7. Presious.eu. The PRESIOUS Project | Presious. http://www.presious.eu (Accessed 30 Jun.
2016).
8. Cultura-strep.eu. CULTivating Understanding and Research through Adaptivity.
https://www.cultura-strep.eu (Accessed 30 Jun. 2016).
9. Paths-project.eu. Personalized access to cultural heritage spaces. http://www.paths-
project.eu/ (Accessed 30 Jun. 2016)
17
10. Prestoprime.org. PrestoPRIME. http://www.prestoprime.org/ (Accessed 30 Jun. 2016).
11. I-treasures.eu. i-Treasures | Capturing the intangible. http://i-treasures.eu/ (Accessed 30
Jun. 2016).
12. Eagle-network.eu. Eagle portal. http://www.eagle-network.eu/ (Accessed 30 Jun. 2016).
13. Tagcloudproject.eu. TAG CLOUD project. http://www.tagcloudproject.eu/ (Accessed 30
Jun. 2016)
14. S. Alletto, D. Abati, G. Serra and R. Cucchiara, “Exploring Architectural Details through
a Wearable Egocentric Vision Device,” in Sensors 2016, 16(2) (2016).
15. D. Picard, P. H. Gosselin and M. Gaspard, “Challenges in Content-based Image Indexing
of Cultural Heritage Collections: Support vector machine active learning with applications
to text classification,” in IEEE Signal Process. Mag. 32(4), pp. 95-102 (2015).
16. Conacyt.mx. Horizon2020. http://www.conacyt.mx/pci/index.php/pe/programa-marco-de-
investigacion-y-desarrollo-tecnologico-de-la-union-europea/horizon-2020 (Accessed 30
Jun. 2016).
17. Agence-nationale-recherche-fr. Projet MEX-CULTURE. http://www.agence-nationale-
recherche.fr/?Projet=ANR-11-IS02-0001 (Accessed 30 Jun. 2016).
18. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing systems, pp.
1097-1105 (2012).
19. A. Salvador, M. Zeppelzauer, D. Manchon-Vizuete, A. Calafell and X. Giro-i-Nieto,
“Cultural Event recognition with visual ConvNets and temporal models,” in CVPR
Workshops 2015, pp. 36-44 (2015) [10.1109/CVPRW.2015.7301334].
20. M. Liu, X. Liu, Y. Li, X. Chen, A. G. Hauptmann, S. Shan, “Exploiting Feature
Hierarchies with Convolutional Neural Networks for Cultural Event Recognition,” in
ICCV Workshops 2015, pp. 274-279 (2015) [10.1109/ICCVW.2015.44].
21. G. Shalunts, Y. Haxhimusa and R. Sablatnig, “Architectural Style Classification of
Building Facade Windows,” in Advances in Visual Computing, pp. 280–289 (2011).
22. W.-T. Chu y M.-H. Tsai, “Visual Pattern Discovery for Architecture Image Classification
and Product Image Search,” in Proceedings of the 2Nd ACM International Conference on
Multimedia Retrieval, New York, NY, USA (2012) [10.1145/2324796.2324831].
23. B. Zhang, Y. Song, S. Guan and Y. Zhang, “Historic Chinese Architectures Image
Retrieval by SVM and Pyramid Histogram of Oriented Gradients Features,” in
International Journal of Soft Computing, pp. 19-28 (2010) [10.3923/ijscomp.2010.19.28].
24. M. Y. Yang y W. Förstner, “Regionwise Classification of Building Facade Images”, in
Photogrammetric Image Analysis, U. Stilla, F. Rottensteiner, H. Mayer, B. Jutzi, y M.
Butenuth, Eds. Springer Berlin Heidelberg, pp. 209–220 (2011).
25. C. Doersch, S. Singh, A. Gupta, J. Sivic, y A. A. Efros, “What makes Paris look like
Paris?”, in Commun. ACM, 58(12), pp. 103–110 (2015).
26. P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun, “OverFeat:
Integrated Recognition, Localization and Detection using Convolutional Networks,” in
International Conference on Learning Representations (ICLR2014), CBLS,
(OpenReview), (2014).

18
27. Convolutional Neural Networks (LeNet) | DeepLearning 0.1 Documentation.
http://deeplearning.net/tutorial/lenet.html (Accessed 29 Sept. 2016).
28. R. B. Girshick, J. Donahue, T. Darrell, J. Malik, “Region-based convolutional networks
for accurate object detection and segmentation,” in IEEE Trans. Pattern Anal. Mach.
Intell. 38(1), pp. 142-158 (2016) [10.1109/TPAMI.2015.2437384].
29. P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural
networks applied to visual document analysis,” in Seventh International Conference on
Document Analysis and Recognition, pp. 958–963 (2003)
[10.1109/ICDAR.2003.1227801].
30. G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S. Z. Li, and T. Hospedales, “When Face
Recognition Meets with Deep Learning: An Evaluation of Convolutional Neural Networks
for Face Recognition,” in 2015 IEEE International Conference on Computer Vision
Workshop (ICCVW), pp. 384–392 (2015) [10.1109/ICCVW.2015.58].
31. M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and Transferring Mid-level Image
Representations Using Convolutional Neural Networks,” in 2014 IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
[10.1109/CVPR.2014.222].
32. K. Simonyan, A. Vedaldi, A. Zisserman, C.J.C. Burges, L. Bottou, M. Welling, Z.
Ghahramani, and K.Q. Weinberger, “Deep Fisher Networks for Large-Scale Image
Classification”, in Advances in Neural Networks for Large-Scale Image Classification, pp
163-171 (2013).
33. J. Yosinski, J. Clune, Y. Bengio, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,
and K. Q. Weinberger and H. Lipson, “How transferable are features in deep neural
networks?,” in Advances in Neural Information Processing Systems 27, Eds. Curran
Associates, Inc., pp. 3320–3328 (2014).
34. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-Scale
Video Classification with Convolutional Neural Networks,” in 2014 IEEE Conference on
Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
[10.1109/CVPR.2014.223].
35. Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep Convolutional Ranking for
Multilabel Image Annotation,” ArXiv (2013).
36. Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yan-Tao Zheng,
"NUS-WIDE: A Real-World Web Image Database from National University of
Singapore," in ACM International Conference on Image and Video Retrieval, Greece
(2009) [10.1145/1646396.1646452].
37. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich, “Going Deeper With Convolutions,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1-9 (2015)
[10.1109/CVPR.2015.7298594].
38. M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Reading Text in the Wild with
Convolutional Neural Networks”, in International Journal of Computer Vision, 116(1), pp.
1-20 (2016) [10.1007/s11263-015-0823-z].

19
39. S. Chaabouni, J. Benois-Pineau, O. Hadar, Ch. Ben Amar, “Deep Learning for Saliency
Prediction in Natural Video,” arXiv:1604.08010 (2016).
40. C. Harris and M. Stephens, “A combined corner and edge detector,” in Alvey vision
conference, 15, p. 50 (1988) [10.5244/C.2.23].
41. J. Canny, “A computational approach to edge detection,” in Pattern Anal. Mach. Intell.
IEEE Trans. On, 6 , pp. 679–698 (1986).
42. D. H. Ballard, “Generalizing the Hough transform to detect arbitrary shapes,” in Pattern
Recognition, 13(2), pp. 111–122 (1981) [10.1016/0031-3203(81)90009-1].
43. Mathworks.com. Find edges in intensity images | MATLAB EDGE.
http://www.mathworks.com/help/images/ref/edge.html#buo5g3w-6 (Accessed 29 Sept.
2016).
44. D. G. Lowe, “Object recognition from local scale-invariant features,” in The Proceedings
of the Seventh IEEE International Conference on Computer Vision, 1999, 2, pp. 1150–
1157 (1999).
45. Mathworks.com. Principal component analysis of row data | MATLAB PCA.
https://www.mathworks.com/help/stats/pca.html (Accessed 29 Sept. 2016).
46. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, y T.
Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding”, in Proceedings
of the 22nd ACM international conference on Multimedia, pp. 675–678 (2014)
[10.1145/2647868.2654889].
47. V. Nair, and Geoffrey E. Hinton, “Rectified linear units improve restricted boltzmann
machines,” in Proc. 27th International Conference on Machine Learning, pp. 807-814
(2010).
48. A. Montoya Obeso, L. A. Oropesa Morales, L. Fernando Vázquez, S. I. Cocolán Almeda,
A. Stoian, M. S. García Vázquez, L. M. Z. Fuentes, J. Y. Montiel Perez, S. de la O Torres,
y A. A. Ramírez Acosta, “Annotations of Mexican bullfighting videos for semantic index”,
in Optics and Photonics for Information Processing IX, 9598, pp. 959815-959815–14
(2015).
49. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, “ImageNet: A Large-Scale
Hierarchical Image Database,” in IEEE Computer Vision and Pattern Recognition (CVPR)
(2009) [10.1109/CVPR.2009.5206848].
50. J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large
vocabularies and fast spatial matching,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (2007) [10.1109/CVPR.2007.383172].

20

View publication stats

Das könnte Ihnen auch gefallen