PDF

UNNA: A Unified Neural Network
for Aesthetic Assessment

Larbi Abdenebaoui∗ , Benjamin Meyer∗ , Albert Bruns∗ and Susanne Boll†
∗ OFFIS Institute for Information Technology, Oldenburg, Germany
Email: firstname.lastname@offis.de
† University of Oldenburg, Oldenburg, Germany
Email: firstname.lastname@uni-oldenburg.de
Abstract—Automatic photo assessment is a high emerging re- for ’isolated’ datasets making it hard to understand the rela-
search field with wide useful ’real-world’ applications. Due to the tionship between them and to benefit from the complementary
recent advances in deep learning, one can observe very promising information. Following a unifying approach, we propose in
approaches in the last years. However, the proposed solutions are
adapted and optimized for ’isolated’ datasets making it hard to this paper a learning model that integrates the knowledge from
understand the relationship between them and to benefit from three benchmark datasets.
the complementary information. Following a unifying approach, Figure 1 illustrates the benefit of such a network, where we
we propose in this paper a learning model that integrates the
knowledge from different datasets.
can predict simultaneously the technical quality, the degree of
We conduct a study based on three representative benchmark the high-level aesthetic and a detailed photographic aesthetic
datasets for photo assessment. Instead of developing for each rule information.
dataset a specific model, we design and adapt sequentially a The approach followed in this paper is based on ideas
unique model which we nominate U N N A. UNNA consists of
a deep convolutional neural network, that predicts for a given of transfer learning. When training convolutional neural net-
image three kinds of aesthetic information: technical quality, works on large and different datasets, it has been observed that
high-level semantical quality, and a detailed description of the first layers of such networks have something in common.
photographic rules. Due to the sequential adaptation that exploits The learned features appear to be general and not specific
the common features between the chosen datasets, UNNA has to a given dataset. Transfer Learning proposes to use this
comparable performances with the state-of-the-art solutions with
effectively less parameter. The final architecture of UNNA gives phenomenon and re-employ or transfer such trained features
us some interesting indication of the kind of shared features as from one convolutional neural network to another one. Due
well as individual aspects of the considered datasets.
Rank: 1 Tech.: high Rank: 2 Tech.: high
I. I NTRODUCTION Balancing Element
Color Harmony
The recognition of high-level semantics of images such as Content
Depth of Field
object detection, emotion recognition, and aesthetics assess- Light
ment gain more and more interest to be part of image retrieval Object
Rule of Thirds
systems in the last years [1], [2], [3]. At that, aesthetics Vivid Color
assessment which deals with the automatic judgment of beauty Rank:3 Tech.: low Rank: 4 Tech.: high
of images is and remains a complex task. The complexity Balancing Element
Color Harmony
emerges from the philosophical aspects of aesthetics in the Content
world of art in general, involving social, cultural and personal Depth of Field
Light
issues. This makes it especially hard to define standard met- Object
Rule of Thirds
rics for measuring the beauty of photographic images. One Vivid Color
approach to deal1 with this complexity is to follow a data-
driven approach, which proposes to acquire a large number Fig. 1: An example of an application of UNNA to four images
of human performed judgment and use a machine learning from the same objects with different shooting setups. The
methods to learn the mean rating values from the generated images are ordered (from Rank 1 to Rank 4) based on the
data. In the literature, one can find different studies that follow values of the high-level output. The cells with three possible
this approach with promising results ( see e.g., [4], [5], [6]). colors indicate the degree of following the corresponding
However, the proposed solutions are adapted and optimized photographic aesthetic rule, green for high, white for neutral
and red for low degree. The technical output (Tech) is used to
1 we deal with the complexity of aesthetics in order to find a technical determine pixel-levels noises. For example, the image which
metrics that learn from a chosen group of people how they judge the beauty was ranked in position 3 (Rank 3) has a low resolution,
of images. How representative such a group for a given task is a challenging
research question therefore, it was predicted correctly with low technical quality.
978-1-5386-7021-7/18/$31.00
c 2018 IEEE
to the generality of such features, both networks can have together with their corresponding score distributions and the
different target datasets and tasks. In the proposed unification, calculated mean value.
the common features are used in order to reduce the number
of layers of the whole network. That is, the resulting network C. AADB
is composed of common and specific blocks to the underlying AADB (Aesthetics and Attributes Database) contains real
datasets. We demonstrate that it is possible to develop a unified scenes images collected from Flickr. The collected images are
network with a large number of common parts reducing in that annotated by means pf Amazon Mechanical Turk (AMT)2 .
way the whole number of parameters but leading to state-of- The annotation contains eleven scores corresponds each to
the-art performance. one of the following attributes: interesting content, object
The rest of this paper is organized as follows. Section II emphasis, good lighting, color harmony, vivid color, shallow
gives an overview of the three datasets AVA, AADB, and depth of field, motion blur, rule of thirds, balancing element,
TID2013. In Section III, we introduce the development steps repetition and symmetry. The attributes have been specified
that we followed in order to construct the unified solution. based on consulting of professional photographers. AADB
Section IV evaluates the performance of the developed solution dataset contains 10,000 labeled images in total.
and compares it with the state-of-art solutions. The paper ends Figure 4 illustrates three examples from the AADB datasets
up with a conclusion. representing three aesthetics qualities categories.
II. DATASETS III. A DAPTATION S TEPS
In a data-driven approach, the quality of the chosen dataset This section describes in details the four steps followed in
and in our case the set of the chosen dataset influence order to obtain the unified network. The first step consists
dramatically the quality of the result. This section gives a short of choosing an adequate initial network architecture as well
description of the datasets employed for the training purposes. as adequate initial weights. The three left steps corresponds to
the adaptation of the network sequentially in the three datasets
A. TID2013
AVA,TID2013 and AADB. Following a transfer learning ap-
TID2013 (Tampere image Database 2013) [7] is a database proach, we start with the largest aesthetic dataset AVA. As
originally created for the evaluation of image quality assess- reported in Section IV the order thereafter is irrelevant. The
ment metrics compared to human perception. Based on 25 adaption consists of the sequential extension of the architecture
reference images, the images in the database are generated of the initial network as well as the training in the respective
by means of 24 types of distortions with 5 intensity levels dataset. In each step we had to make decisions concerning the
each, leading to 3000 images in total. Each distorted image hyperparamers to be chosen. The token decisions are based
has a label that corresponds to the ”mean opinion score” on a combination between knowledge gained from literature,
which is calculated based on 985 experiments with different analysis of the underlying datasets, and empirical experiments.
human raters from different countries. Each experiment cor- One important criteria that we follow during the whole design
responds to nine comparisons. In each comparison, the task and adaptation process is to find a good balance between the
was to choose the best one out of two distorted images. The size and the performance of the resulting network.
preferred image then receives one point. The winning points
are summed-up through the nine comparisons and averaged A. Initial Network
over the whole experiments leading to a mean score between In the underlying study, we employ a variant of the efficient
zero and nine for each image. Figure 2 illustrates a reference convolutional neural network classes Mobilenet [8] as starting
image together with two distorted images generated using architecture. Mobilenet is composed of 28 Layers. The first
different levels of Gaussian blurring and their corresponding layer corresponds to a standard fully convolution with a
mean scores. batchnorm [9] and is followed by a rectified linear activation
(ReLU)[10] , the next 26 Layers consists each of a depthwise
B. AVA
separable convolution (dsc) operation as introduced in [11].
AVA (Aesthetic Visual Analysis)[4] is a large-scale dataset A dsc consists itself of two kinds of convolution operations, a
for aesthetic visual analysis that contains about 250,000 im- depthwise convolution followed by a pointwise convolution.
ages. The images are collected originally from the social The channels of the input are filtered independently using
network www.dpchallenge.com. The images were voted from kernels of size 3×3. The pointwise convolution then performs
the underlying community of amateur and professional pho- a 1 × 1 convolution combining the outputs from the depthwise
tographers as a response to different photographic challenges. convolution. That is, the operations behind dsc can be inter-
A vote corresponds to a score between 1 and 10. The number preted as filtering and then combining the inputs into a new set
of votes per image ranges from 78 to 549 with an average of of outputs. It was shown that this kind of convolution is effi-
210 votes. Each image is then associated with a distribution of cient compared to regular convolutions where both operations,
ratings. Based on this distribution, it is possible to calculate the e.g.filtering, and combining, are performed simultaneously.
mean assessment and the standard deviation over the raters.
See Figure 3 for three example images from the AVA dataset 2 www.mturk.com
Fig. 2: Example of image distortion in the TID2013 dataset. The left image represents a reference image, the images in the
middle and the left are generated respectively using Gaussian blur of low and high standart deviation
Fig. 3: Example of images from the AVA dataset representing three categories, high, middle and low aesthetic qualities. Each
image is labeled with a distribution of 10 elements representing how many peoples which score in [1 . . . 10] have chosen.
Balancing Element
Color Harmony
Content
Depth of Field
Light
Object
Rule of Thirds
Vivid Color
Fig. 4: Example of images from the AADB dataset, representing three categories, high, middle and low aesthetic qualities. the
left image has high scores for the most attributes (green cells), the middle image can be considered as an image with middle
aesthetics quality since the most attributes have middle score (white cells). The right image has too many low scores (red cells)
having therefor a low aesthetics quality.
We employ a model that has been trained on the ILSVRC- separable convolutions as feature generator. This size seems
2012-CLS (ILSVRC) image classification dataset[12]. The to be a good choice regarding the results discussed in Section
model expects normalized images of size 224 × 224 × 3. IV
A tail of the initial network is used as feature generator so
that its output is employed as input for the next adaptation B. AVA Adaptation
steps. The decision to use a features generator learned from Starting from the output of the features generator, the AVA
a very large dataset namely the ILSVRC is based on a Part consists of 23 separable depth filters followed by a
future-oriented design. In addition to aesthetics assessment, fully-connected layer with 10 neurons each having a soft-
we would like to integrate other prediction capabilities to max activation. That is, the goal consists of learning the
our network, such an emotion and object recognition. Such distribution of the scores as available in the AVA dataset. It
a feature generator can be used as a common part for all these was shown in recent works that considering the distributions
recognition targets. In this paper we use the first five depthwise of scores instead of the mean score has a positive impact
stage, we fine-tune the whole AVA part. The first stage allows
us to explore rapidly different hyper-parameter combinations.
Furthermore, the results obtained using two stages were more
accurate than updating all the weights from the scratch i.e.
only the second stage.
Our architectural design for the AVA-Part has been guided
by means of previous works from literature. Precisely, The
whole number of 28 dsc including the feature generator part
is based mainly on the results of N IM A architecture[6] which
shows the ability of such an architecture to incorporates the
AVA dataset knowledge in an efficient way.
Fig. 5: Illustration of a depthwise separable convolution block.

Assuming the input layer has size DI × DI × C and the
output layer a desired size of DO × DO × F , in the first step, Fig. 7: Illustration of the model design in step 1.
each channel i ∈ [1 . . . C] is filtered using a 3 × 3 depthwise
convolution 3 × 3convi , in the second step the results for the
different channels are combined with each others using F 1×1 C. TID2013 Adaptation
convolutions. For incorporating a technical assessment of a given image
we use the same initial network as described in section A.
Then, three depthwise separable convolution blocks are added
to train the network on the TID2013 dataset (see fig. 8). After
these blocks we perform a global average pooling. Finally the
output section is represented by a fully connected layer with
1 output that represents the TID2013 score. The network is
then compiled using the mean squared error loss-function.
Fig. 6: The initial model corresponds to a pretrained mobilenet
with 28 depthwise separable convolutions (dsc) and a Fully
connected layer (F C). The feature-generator part corresponds
to the first five dsc, i.e., dsci for i ∈ [1 . . . 5]
on the performance [6], [13], [14]. The loss function that

delivers the best results is a simplified form of the normalized
Earth Mover’s Distance (EMD) as introduced in [15] and
adapted for deep learning purposes in [16]. Given a predicted
score distribution p =< p1 . . . p10 > and a target score
t =< t1 . . . t10 >, the EM D loss is given as Fig. 8: Illustration of the model design in step 2.
10
X 2
EM D(p, t) = (CDFi (p) − CDFi (t)) (1) Evaluating different architectures: To assess which network
i=1 architecture of the TID-part is the most efficient regarding
Where CDFi is the cumulative density function of the parameter size and accuracy we conducted five experiments.
i−th elementPof the underlying score distribution given by These evaluations are especially necessary considering the
i
CDFi (x) = k=1 xi for x ∈ {p, t}. Beside the minimization large structure of the total network. Keeping each part as
of the distances between two score distributions, An important incomplex as possible is therefore desirable. By individually
advantage of the EM D loss function is that it also takes in changing the number of convolution blocks after the initial
consideration the order relationship between the elements in a feature generator, we were able to shrink (or increase) the
score distribution. number of trainable parameters and investigate the results on
In order to train the network, we followed a two stages the accuracy.
procedures. In the first stage, we train only the fully connected The network for each individual experiment was constructed
layer freezing the weights of all previous layers. In the second and compiled as described before. Only the number of dsc
Fig. 10: Illustration of the model constructed in step 4.
Fig. 9: Experiments with different numbers of blocks in the

TID part. Figure 9a shows the overall learning curve of the
different architectures. Figure 9b depicts a narrowed down
view of the last 30 epochs.
blocks after the feature generator was incremented as de-

scribed above. For each architecture, the network was then
trained over 70 epochs with 10 repetitions.
Interim Result: To measure which architecture provides
the most efficient result, we can look at the learning curve
depicted in 9. We can see that using only 1 convolution block
results in a faster learning performance as opposed to using
more blocks. However, the val loss value stagnates after
about 30 epochs. The two-block architecture delivers the best
results here with a minimum loss of 0.254 for the validation
subset.
Fig. 11: Experiments with different numbers of blocks in the
D. AADB Adaptation AADB part. Figure 11a shows the overall learning curve of
the different architectures. Figure 11b depicts a narrowed down
In this step, we build and train our last part, the AADB Part. view of the last 10 epochs.
Like the previous Parts, the AADB Part consists of a set of
dsc layers, followed by a fully connected layer that produces
the desired outputs (aesthetic attributes). The prediction of the part, the better the whole performance, however, 3 dscs layers
attributes is formulated as a regression problem with mean seems to be a gut choice, since from two dscs layers the
squared error as loss function. difference between the performances is acceptable regarding
Fixing the number dscs including the common parts to 28, the resulting costs if we add more dscs.
we explored different architectural designs. Our final AADB
Part contains two dsc layers and gets its input from the layer IV. E VALUATION AND D ISCUSSION
AV A21 as illustrated in Figure 10. In this section, we evaluate the UNNA model and compare
Figure 11 summarizes an experiment that explores the the results with other works.
tradeoff between the number of dscs in AADB part and the in order to compare the performance of our network with
performance of the resulting network on the AADB dataset. individual solutions on the three datasets AVA, AADB, and
For each size, the experiment is repeated 10 times and the TID2013, we report the Spearmans rank correlation coefficient
average is reported. It can be shown that the larger the AADB [17] denoted ρ between the predicted scores and the ground
TABLE I: AADB correlation results: Sperman’s rank correla- design and train a network with a reasonable tradeoff between
tions for the AADB attributes. The scores marked with * are the number of parameters and performances. The first em-
best results for the considered attributes attribute. ployment of the developed solution is very promising, Among
attribute malu et al. [18] Kong et al.[5] ours the prediction of the aesthetic quality, the network delivers a
Balancing Elements 0.220* 0.186 0.179 kind of reasoning based on predefined photographic rules and
Content 0.508 0.584* 0.572 reporting also the technical qualities of the images. In future
Color Harmony 0.471 0.475 0.48* work, we attempt to extend the network to recognize more
Depth of Field 0.479 0.495 0.496* semantic information such emotions or objects. The underlying
Light 0.443* 0.399 0.401 goal is to develop a solution able to predict personalized
Object Emphasis 0.602 0.666* 0.652 aesthetics tastes.
Rule of Thirds 0.225 0.178 0.232*
R EFERENCES
Vivid Colors 0.648 0.681 0.685*
[1] C. Cui, H. Fang, X. Deng, X. Nie, H. Dai, and Y. Yin, “Distribution-
oriented Aesthetics Assessment for Image Search,” Aug. 2017, pp.
TABLE II: AVA and TID2013 correlation results: Sperman’s 1013–1016.
rank correlations for the AVA and TID2013 mean scores. [2] S. Li, S. Purushotham, C. Chen, Y. Ren, and C.-C. J. Kuo, “Measur-
The scores marked with * are best results for the considered ing and Predicting Tag Importance for Image Retrieval,” CoRR, vol.
abs/1602.08680, 2016.
attributes attribute. [3] W. Wei-ning, Y. Ying-lin, and J. Sheng-ming, “Image Retrieval by
Emotional Semantics: A Study of Emotional Space and Feature Ex-
Dataset NIMA(mobilenet) [6] NIMA(Inception-v2) [6] ours
traction,” in 2006 IEEE International Conference on Systems, Man and
AVA 0.518 0.636* 0.62 Cybernetics, vol. 4, Oct. 2006, pp. 3534–3539.
TID2013 0.698 0.750 0.902* [4] N. Murray, L. Marchesotti, and F. Perronnin, “AVA: A large-scale
database for aesthetic visual analysis.” IEEE, Jun. 2012, pp. 2408–
2415.
[5] S. Kong, X. Shen, Z. Lin, R. Mech, and C. Fowlkes, “Photo Aes-
truth scores over the test data for each dataset. ρ measures the thetics Ranking Network with Attributes and Content Adaptation,”
correlation between the ranking between predicted and ground arXiv:1606.01621 [cs], Jun. 2016, arXiv: 1606.01621.
[6] H. T. Esfandarani and P. Milanfar, “NIMA: Neural Image Assessment,”
truth scores independently of any monotonic transformation of CoRR, vol. abs/1709.05424, 2017.
the predicted scores. [7] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola,
Tables II, and I report our results as well as result from, to B. Vozel, K. Chehdi, M. Carli, F. Battisti, and C. C. Jay Kuo, “Image
database TID2013: Peculiarities, results and perspectives,” Signal Pro-
the best of our knowledge, state-of-the-art solutions that use cessing: Image Communication, vol. 30, pp. 57–77, Jan. 2015.
the same metrics on the three chosen datasets. For experiments [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
concerning the testing data over AVA and TID, we compare T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convo-
lutional Neural Networks for Mobile Vision Applications,” CoRR, vol.
our result with the results of two variations of the NIMA abs/1704.04861, 2017.
architecture, namely mobilenet and Inception-v2. For the AVA [9] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
dataset, our model performs clearly better than the mobilenet network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
version, however, compared with the larger architecture of [10] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted
NIMA our result is narrowly worse. Taking into consideration Boltzmann Machines,” p. 8.
the fact that inception-v2 is about eight-time larger than our [11] L. Sifre and P. Mallat, “Rigid-motion scattering for image classification,”
Ph.D. dissertation, Citeseer, 2014.
model than our result can be considered as very close to state- [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
of-the-art solutions. For the TID2013 test data, our model Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
performs clearly better than both architectures of NIMA using scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
significantly fewer parameters. A mobilenet is three times [13] Z. Wang, F. Dolcos, D. Beck, S. Chang, and T. S. Huang, “Brain-
larger than the feature generator combined with the TID Part. Inspired Deep Networks for Image Aesthetics Assessment,” CoRR,
For the AADB dataset our results are also very close to the vol. abs/1601.04155, 2016, bibtex: wang brain-inspired 2016. [Online].
Available: http://arxiv.org/abs/1601.04155
considered works of malu et al. [18] and Kong et al. [5]. with [14] Y. L. Hii, J. See, M. Kairanbay, and L. K. Wong, “Multigap: Multi-
a cheaper architecture. pooled inception network with text augmentation for aesthetic prediction
Regarding this results, the final architecture of UNNA of photographs,” in 2017 IEEE International Conference on Image
Processing (ICIP), Sep. 2017, pp. 1722–1726.
can be interpreted as follows. The learned features from a [15] E. Levina and P. Bickel, “The Earth Mover’s distance is the Mallows
large dataset can be used as start feature generator for the distance: some insights from statistics,” in Proceedings Eighth IEEE
three chosen datasets. The datasets AVA and AADB share a International Conference on Computer Vision. ICCV 2001, vol. 2, 2001,
pp. 251–256 vol.2.
large number of characteristics, in contrast, the knowledge of [16] L. Hou, C.-P. Yu, and D. Samaras, “Squared Earth Mover’s
TID2013 can be well represented by means of a relatively Distance-based Loss for Training Deep Neural Networks,” CoRR, vol.
small network without common part with AVA and AADB. abs/1611.05916, 2016.
[17] A. Szczepaska, “Research Design and Statistical Analysis, Third Edition
V. C ONCLUSION by Jerome L. Myers, Arnold D. Well, Robert F. Lorch, Jr.” International
Statistical Review, vol. 79, no. 3, pp. 491–492, Dec. 2011.
In this paper, we have introduced a convolutional neural [18] G. Malu, R. S. Bapi, and B. Indurkhya, “Learning Photography Aes-
network that incorporates the unification of three benchmark thetics with Deep CNNs,” CoRR, vol. abs/1707.03981, 2017.
datasets using the ideas of transfer learning. We sequentially

PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PDF

Hochgeladen von

Copyright:

Verfügbare Formate

UNNA: A Unified Neural Network

for Aesthetic Assessment

Fig. 5: Illustration of a depthwise separable convolution block.

on the performance [6], [13], [14]. The loss function that

Fig. 9: Experiments with different numbers of blocks in the

blocks after the feature generator was incremented as de-

Das könnte Ihnen auch gefallen