Email: firstname.lastname@uni-oldenburg.de
Abstract—Automatic photo assessment is a high emerging re- for ’isolated’ datasets making it hard to understand the rela-
search field with wide useful ’real-world’ applications. Due to the tionship between them and to benefit from the complementary
recent advances in deep learning, one can observe very promising information. Following a unifying approach, we propose in
approaches in the last years. However, the proposed solutions are
adapted and optimized for ’isolated’ datasets making it hard to this paper a learning model that integrates the knowledge from
understand the relationship between them and to benefit from three benchmark datasets.
the complementary information. Following a unifying approach, Figure 1 illustrates the benefit of such a network, where we
we propose in this paper a learning model that integrates the
knowledge from different datasets.
can predict simultaneously the technical quality, the degree of
We conduct a study based on three representative benchmark the high-level aesthetic and a detailed photographic aesthetic
datasets for photo assessment. Instead of developing for each rule information.
dataset a specific model, we design and adapt sequentially a The approach followed in this paper is based on ideas
unique model which we nominate U N N A. UNNA consists of
a deep convolutional neural network, that predicts for a given of transfer learning. When training convolutional neural net-
image three kinds of aesthetic information: technical quality, works on large and different datasets, it has been observed that
high-level semantical quality, and a detailed description of the first layers of such networks have something in common.
photographic rules. Due to the sequential adaptation that exploits The learned features appear to be general and not specific
the common features between the chosen datasets, UNNA has to a given dataset. Transfer Learning proposes to use this
comparable performances with the state-of-the-art solutions with
effectively less parameter. The final architecture of UNNA gives phenomenon and re-employ or transfer such trained features
us some interesting indication of the kind of shared features as from one convolutional neural network to another one. Due
well as individual aspects of the considered datasets.
Rank: 1 Tech.: high Rank: 2 Tech.: high
I. I NTRODUCTION Balancing Element
Color Harmony
The recognition of high-level semantics of images such as Content
Depth of Field
object detection, emotion recognition, and aesthetics assess- Light
ment gain more and more interest to be part of image retrieval Object
Rule of Thirds
systems in the last years [1], [2], [3]. At that, aesthetics Vivid Color
assessment which deals with the automatic judgment of beauty Rank:3 Tech.: low Rank: 4 Tech.: high
of images is and remains a complex task. The complexity Balancing Element
Color Harmony
emerges from the philosophical aspects of aesthetics in the Content
world of art in general, involving social, cultural and personal Depth of Field
Light
issues. This makes it especially hard to define standard met- Object
Rule of Thirds
rics for measuring the beauty of photographic images. One Vivid Color
approach to deal1 with this complexity is to follow a data-
driven approach, which proposes to acquire a large number Fig. 1: An example of an application of UNNA to four images
of human performed judgment and use a machine learning from the same objects with different shooting setups. The
methods to learn the mean rating values from the generated images are ordered (from Rank 1 to Rank 4) based on the
data. In the literature, one can find different studies that follow values of the high-level output. The cells with three possible
this approach with promising results ( see e.g., [4], [5], [6]). colors indicate the degree of following the corresponding
However, the proposed solutions are adapted and optimized photographic aesthetic rule, green for high, white for neutral
and red for low degree. The technical output (Tech) is used to
1 we deal with the complexity of aesthetics in order to find a technical determine pixel-levels noises. For example, the image which
metrics that learn from a chosen group of people how they judge the beauty was ranked in position 3 (Rank 3) has a low resolution,
of images. How representative such a group for a given task is a challenging
research question therefore, it was predicted correctly with low technical quality.
978-1-5386-7021-7/18/$31.00
c 2018 IEEE
to the generality of such features, both networks can have together with their corresponding score distributions and the
different target datasets and tasks. In the proposed unification, calculated mean value.
the common features are used in order to reduce the number
of layers of the whole network. That is, the resulting network C. AADB
is composed of common and specific blocks to the underlying AADB (Aesthetics and Attributes Database) contains real
datasets. We demonstrate that it is possible to develop a unified scenes images collected from Flickr. The collected images are
network with a large number of common parts reducing in that annotated by means pf Amazon Mechanical Turk (AMT)2 .
way the whole number of parameters but leading to state-of- The annotation contains eleven scores corresponds each to
the-art performance. one of the following attributes: interesting content, object
The rest of this paper is organized as follows. Section II emphasis, good lighting, color harmony, vivid color, shallow
gives an overview of the three datasets AVA, AADB, and depth of field, motion blur, rule of thirds, balancing element,
TID2013. In Section III, we introduce the development steps repetition and symmetry. The attributes have been specified
that we followed in order to construct the unified solution. based on consulting of professional photographers. AADB
Section IV evaluates the performance of the developed solution dataset contains 10,000 labeled images in total.
and compares it with the state-of-art solutions. The paper ends Figure 4 illustrates three examples from the AADB datasets
up with a conclusion. representing three aesthetics qualities categories.
II. DATASETS III. A DAPTATION S TEPS
In a data-driven approach, the quality of the chosen dataset This section describes in details the four steps followed in
and in our case the set of the chosen dataset influence order to obtain the unified network. The first step consists
dramatically the quality of the result. This section gives a short of choosing an adequate initial network architecture as well
description of the datasets employed for the training purposes. as adequate initial weights. The three left steps corresponds to
the adaptation of the network sequentially in the three datasets
A. TID2013
AVA,TID2013 and AADB. Following a transfer learning ap-
TID2013 (Tampere image Database 2013) [7] is a database proach, we start with the largest aesthetic dataset AVA. As
originally created for the evaluation of image quality assess- reported in Section IV the order thereafter is irrelevant. The
ment metrics compared to human perception. Based on 25 adaption consists of the sequential extension of the architecture
reference images, the images in the database are generated of the initial network as well as the training in the respective
by means of 24 types of distortions with 5 intensity levels dataset. In each step we had to make decisions concerning the
each, leading to 3000 images in total. Each distorted image hyperparamers to be chosen. The token decisions are based
has a label that corresponds to the ”mean opinion score” on a combination between knowledge gained from literature,
which is calculated based on 985 experiments with different analysis of the underlying datasets, and empirical experiments.
human raters from different countries. Each experiment cor- One important criteria that we follow during the whole design
responds to nine comparisons. In each comparison, the task and adaptation process is to find a good balance between the
was to choose the best one out of two distorted images. The size and the performance of the resulting network.
preferred image then receives one point. The winning points
are summed-up through the nine comparisons and averaged A. Initial Network
over the whole experiments leading to a mean score between In the underlying study, we employ a variant of the efficient
zero and nine for each image. Figure 2 illustrates a reference convolutional neural network classes Mobilenet [8] as starting
image together with two distorted images generated using architecture. Mobilenet is composed of 28 Layers. The first
different levels of Gaussian blurring and their corresponding layer corresponds to a standard fully convolution with a
mean scores. batchnorm [9] and is followed by a rectified linear activation
(ReLU)[10] , the next 26 Layers consists each of a depthwise
B. AVA
separable convolution (dsc) operation as introduced in [11].
AVA (Aesthetic Visual Analysis)[4] is a large-scale dataset A dsc consists itself of two kinds of convolution operations, a
for aesthetic visual analysis that contains about 250,000 im- depthwise convolution followed by a pointwise convolution.
ages. The images are collected originally from the social The channels of the input are filtered independently using
network www.dpchallenge.com. The images were voted from kernels of size 3×3. The pointwise convolution then performs
the underlying community of amateur and professional pho- a 1 × 1 convolution combining the outputs from the depthwise
tographers as a response to different photographic challenges. convolution. That is, the operations behind dsc can be inter-
A vote corresponds to a score between 1 and 10. The number preted as filtering and then combining the inputs into a new set
of votes per image ranges from 78 to 549 with an average of of outputs. It was shown that this kind of convolution is effi-
210 votes. Each image is then associated with a distribution of cient compared to regular convolutions where both operations,
ratings. Based on this distribution, it is possible to calculate the e.g.filtering, and combining, are performed simultaneously.
mean assessment and the standard deviation over the raters.
See Figure 3 for three example images from the AVA dataset 2 www.mturk.com
Fig. 2: Example of image distortion in the TID2013 dataset. The left image represents a reference image, the images in the
middle and the left are generated respectively using Gaussian blur of low and high standart deviation
Fig. 3: Example of images from the AVA dataset representing three categories, high, middle and low aesthetic qualities. Each
image is labeled with a distribution of 10 elements representing how many peoples which score in [1 . . . 10] have chosen.
Balancing Element
Color Harmony
Content
Depth of Field
Light
Object
Rule of Thirds
Vivid Color
Fig. 4: Example of images from the AADB dataset, representing three categories, high, middle and low aesthetic qualities. the
left image has high scores for the most attributes (green cells), the middle image can be considered as an image with middle
aesthetics quality since the most attributes have middle score (white cells). The right image has too many low scores (red cells)
having therefor a low aesthetics quality.
We employ a model that has been trained on the ILSVRC- separable convolutions as feature generator. This size seems
2012-CLS (ILSVRC) image classification dataset[12]. The to be a good choice regarding the results discussed in Section
model expects normalized images of size 224 × 224 × 3. IV
A tail of the initial network is used as feature generator so
that its output is employed as input for the next adaptation B. AVA Adaptation
steps. The decision to use a features generator learned from Starting from the output of the features generator, the AVA
a very large dataset namely the ILSVRC is based on a Part consists of 23 separable depth filters followed by a
future-oriented design. In addition to aesthetics assessment, fully-connected layer with 10 neurons each having a soft-
we would like to integrate other prediction capabilities to max activation. That is, the goal consists of learning the
our network, such an emotion and object recognition. Such distribution of the scores as available in the AVA dataset. It
a feature generator can be used as a common part for all these was shown in recent works that considering the distributions
recognition targets. In this paper we use the first five depthwise of scores instead of the mean score has a positive impact
stage, we fine-tune the whole AVA part. The first stage allows
us to explore rapidly different hyper-parameter combinations.
Furthermore, the results obtained using two stages were more
accurate than updating all the weights from the scratch i.e.
only the second stage.
Our architectural design for the AVA-Part has been guided
by means of previous works from literature. Precisely, The
whole number of 28 dsc including the feature generator part
is based mainly on the results of N IM A architecture[6] which
shows the ability of such an architecture to incorporates the
AVA dataset knowledge in an efficient way.