Contents
1 Introduction and previous concepts
1.1 Motivation and objectives . . . . . . . . . . .
1.2 Artificial Intelligence . . . . . . . . . . . . . .
1.2.1 History . . . . . . . . . . . . . . . . .
1.3 Feedforward neural networks . . . . . . . . .
1.3.1 Perceptrons . . . . . . . . . . . . . . .
1.3.2 Perceptron output with step function
1.3.3 An example . . . . . . . . . . . . . . .
1.3.4 Layers . . . . . . . . . . . . . . . . . .
1.3.5 Network training and sigmoid neurons
1.4 Learning with gradient descent . . . . . . . .
1.4.1 Cost function . . . . . . . . . . . . . .
1.4.2 Backpropagation algorithm equations
1.4.3 Backpropagation algorithm steps . . .
1.5 Types of layers . . . . . . . . . . . . . . . . .
1.5.1 Convolutional layers . . . . . . . . . .
1.5.2 An example . . . . . . . . . . . . . . .
1.5.3 Pooling Layers . . . . . . . . . . . . .
1.5.4 Rectifier linear units layers . . . . . .
1.5.5 Local Response Normalization layers .
1.5.6 Softmax layers . . . . . . . . . . . . .
1.6 Why are CNN so effective on image data? . .
1.7 Tensorflow . . . . . . . . . . . . . . . . . . . .
1.8 Programming model . . . . . . . . . . . . . .
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
3
4
4
4
5
5
6
7
8
8
9
11
13
13
14
15
16
16
16
17
17
17
2 Related work
2.1 Universal approximation of functions
2.2 Recurrent Neural Networks . . . . .
2.3 Deep Belief Networks (DBNs) . . . .
2.4 Random forest . . . . . . . . . . . .
2.5 Deep Dream . . . . . . . . . . . . . .
2.6 Image Inpainting . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
18
18
21
22
23
3 Methodology
3.1 The dataset . . . . . . . . . . . . . . . .
3.2 Hardware setup . . . . . . . . . . . . . .
3.3 Implemented and customized programs .
3.4 Transfer learning . . . . . . . . . . . . .
3.4.1 Bottlenecks . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
24
25
25
27
27
4 Results
4.1 Single softmax layer network . . . .
4.2 Convolution . . . . . . . . . . . . . .
4.3 Convolution and pooling . . . . . . .
4.4 Two convolutional and pooling layers
4.4.1 Normalization Layers . . . .
4.5 Image size augmentation . . . . . . .
4.6 Overfitting . . . . . . . . . . . . . .
4.7 Data amount augmentation . . . . .
4.8 Batch size . . . . . . . . . . . . . . .
4.9 Extracted features . . . . . . . . . .
4.10 Retrain ImageNet Model . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
28
29
29
31
32
32
33
33
34
35
38
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1.1
39
The idea of a computer able to acquire, process, analyze, and understand images in a human level, known as computer vision, has been a challenge already
over the past 40 years. Lately the results of research in this field are evolving so
fast that in some particular classification tasks computers perform as good as
humans (e. g. MNIST digit classification, [8]). This improvements have been
possible mainly thanks to the new computing capabilities that allow us to process big quantities of data very fast, a phenomenon known as big data. Among
others, an important advance was the possibility to use models s.a. artificial
neural networks (ANNs), which boosted the predictive quality of computer vision and other artificial intelligence applications.
The purpose of this work is to get do discover some insights of convolutional
neural networks, a special kind of ANNs, also the reasons why they have a particular structure, and how good they perform classification on a self elaborated
image data set. We first do a general contextualization and introduction to
ANNs , and their parts, and then we will describe how to train a fully connected network with gradient descent method. Then we check how essential is
2
every part of the network by running a training algorithm with different setups.
On the way we get introduced to the recently launched library Tensorflow and
discuss obtained results.
1.2
Artificial Intelligence
Artificial neural networks is one of the trend research topics in artificial intelligence (AI). Inside the main faced goals in which intelligence simulation is split,
which are deduction, knowledge representation, planning, natural language processing (communication), perception, motion and manipulation and learning,
neural networks belong to the last one, more specifically called machine learning (ML). In that branch of AI we study algorithms that improve automatically
through experience, by means of approximating functions according to a given
data, usually called training data thats why when running ANNs algorithms
we say that the network is learning or being trained.
Machine learning is split between supervised, unsupervised and reinforcement learning. In unsupervised learning the objective is to infer a function
to describe hidden structure from unlabeled data, f.e. finding similarity between each data point, clustering data or reducing its dimension. Examples of
unsupervised learning are k-means, maximum likelihood estimation, principal
component analysis. Supervised learning consists of using labeled training data
to estimate a map that returns the right label when receiving new data according to the same pattern as in the training set. Convolution neural networks
are an example of supervised learning, where for example we classify images by
their label according to a labeled training data set. In reinforcement learning,
programs are rewarded by taking actions in an environment so as to maximize
some notion of cumulative reward.
Theres disagreement about whether biological foundations are important for
continuing the development of AI as it happens with ANNs, which are inspired
in biological neurons, or it should be a completely independent research field the
same way bird biology has no contribution in most of aeronautical engineering
ideas. Until now AI research has been mostly statistical related. In many AI
specific tasks, e.g. recognizing a song, where a fingerprint of the audio frequencies is generated, a purely statistical method produces results similar that a
human would give. But are we then doing just a classical statistics work? We
could find some differences between classical statistics and AI, which are:
The dimensions of the data. In classical statistics we have low dimension
data sets, e.g. less than 100 dimensions, in AI we can have much more
than that.
In classical statistics we have a lot of noise in the data, which might make
it difficult to find a structure, in AI the noise is not sufficient to hide the
structure of the data when properly processed.
In classical statistics theres not much structure in the data and if there
is, it can be represented with a simple model, with not many parameters,
in AI the structure is too complicated to be represented with a simple
model with few parameters.
Usually the objective when doing classical statistics is to reveal a structure hidden by noise, in AI the objective is to get to present a complicated
3
History
The first scientific approach to artificial neural networks was done by Warren
McCulloch and Walter Pitts [1] in 1943. Their objective was to mathematically
formalize the behavior that brain neurons have when we perform logical reasoning, or when reading our senses inputs.
In 1940 neuroscientist Donald Hebb proposed a theory for the adaptation of the
neurons in the brain during the learning process, which was named after him,
hebbian learning. He already thought of the idea of weights in connections
between neurons, which appear in present artificial neural network (ANN) models. The model states that the weight between two neurons increases if the two
neurons activate simultaneously, and reduces if they activate separately. This
is not exactly the principle of convolutional neural networks (CNN) but it was
a start.
In 1948 Alan Turing suggested a model of computation called unorganized
machine, thinking of the human cortex of an infant, which is largely random
initially but can be trained to perform particular tasks. In this model Turing
defined A-type machines, which consisted of randomly connected networks of
NAND logic gates, and B-type machines, which where built taking A-type machines and substituting inter-node connections with structures called connection
modifiers, which where made of A-type nodes. This connection modifiers where
supposed to undergo appropriate interference, mimicking education.
Frank Rosenblatt first created an electronic device called perceptron in
1960, which represented a neuron as a logical gate with weights and bias. The
utility of the model though was not observable due to lack of computing resources, thats why the idea of neural network was not trend till recent times.
It was in 1975 when Paul Werbos [4] thought of applying backpropagation
algorithm to find an optimal solution for the parameters in a neural network,
which greatly improved the problem-solving capability of a neural network, and
is now the state of the art for image recognition. After this advance many research in this field was done again and the goodness of the predictions gradually
improved till present times where ANN are state of the art for image recognition, achieving human level precision and substituting methods such as support
vector machine or random forest, as it happened in ImageNet 2012 contest
where winner team of image classification used a convolutional neural network
[2] standing out of all other methods. Next we introduce the components of a
convolutional neural network in present time.
1.3
1.3.1
An artificial neural network is a graph, where nodes are a special kind of logical gates called perceptrons or artificial neurons, which have some parameters that allow their behavior change so that patterns can be recognized in data
sets.
They are called neural networks because
the original model was supposed to be inspired in animal neurons, which have the
4
&%
*
1.3.2
An example
P l l1
input vector, and zjl = k wjk
ak + blj is the weighted input to the activation
function for neuron j in layer l .
1.3.4
Layers
Hidden
layers
Output
layer
The first layer is called input layer, which receives the information that has
to be processed by the network (in our case image pixel intensities). Coming
up next are hidden layers. In this example picture we only show two hidden
layers but we can find networks with 12 hidden layers, for example. Hidden
layers process the input layer outputs to give the output layer a final result.
1.3.5
Our purpose is to get the network to give the output result we want when
we give a determined input. For this we will proceed with a method called
network training or learning. The idea of training the network consists of
giving a big amount of input data together with the expected output results,
and adapt the network parameters to fit as good as possible these expected
results. Lets say we want the network to recognize if faces are appearing or
not in a picture, then we will give as input many pictures that are containing
a face and many pictures that dont, and the labels of the images telling if
a face actually appears. For every picture we gradually modify weights and
biases of our network so that the output is every time more frequently the same
as the label. To do that we use gradient descent method which we explain
below. After the training, the network should be able to recognize if theres a
face or not in a new input picture with a high accuracy. The above described
perceptrons are an intuitive approximate simple idea of neuron model that was
developed into sigmoid neurons, which we define below. The output function of
a sigmoid neuron is not as the one described in (1), since with a step function, the
behavior of the network would be chaotic when modifying weights and biases.
We introduce then sigmoid neurons, which compute their output with sigmoid
function. The output a of a sigmoid neuron is defined with sigmoid function
which is a smooth-shaped version of the step function.
(z) = tanh(z) =
1
1 + ez
(2)
The reason why such a function is chosen for the perceptron output is its
smooth shape and the property that makes it similar to step function: for input
values multiplied by weights that are much greater than bias (i.e. z ) the
output is close to 1, and equivalently for (z ) the output is close to 0.
The important difference with step function is that this time when we slightly
change the weights and biases of a perceptron the output is going to slightly
change too, due to sigmoid function continuity and smoothness. This is going
to allow us to search for the optimal weights and biases of each perceptron
s.t. we get the targeted output when giving an input by using gradient descent
method. Putting together the definitions of neural network and sigmoid neuron,
(3)
k
l
where wjk
is the weight of the kth neuron of l 1th layer activation into jth
neuron of lth layer and blj is the bias of the jth neuron in the lth layer.
1.4
1.4.1
We denote the target output or desired output of the network when x is the
input by y(x), and the neuron output by a(x, w, b)=a(z) (the desired output
doesnt depend on the weights and biases of neurons but the neuron output
does). We want the training algorithm to determine which weights and biases
approximate best the outputs a(x, w, b) to y(x) for all inputs x.
We define a cost function (also called loss function) as a measurement of the
goodness of the fit of a neural network with weights and biases w and b to a
target y(x). First we have the quadratic cost function:
Cx (w, b) =
X 1
(a(x, w, b) y(x))2
2n
x
(4)
Where n is the number of training inputs and the sum is over each input x. In
the quadratic cost function we can easily observe the main properties that any
cost function should have:
It is positive in all the function domain.
The more outputs of the network are different from the label, the higher
is the value taken by the function.
P
The cost function can be written as an average C = n1 x Cx over cost
functions Cx for individual training examples, x.
It can be written as a function of the outputs from the neural network.
We will write from now on a instead of a(z) and y instead of y(x) to ease
the reading.
Cost functions are defined with the objective to find some weight values w
and biases b such that the output a is as frequent as possible the same as the
target y, equivalently, to find a minimum of the function C by varying w and
b. We could use an analytic method by solving the equation matching gradient
to zero to find local minimums and check which one is the lowest, but since
the number of variables is going to be very large and the shape of the function
tends to be pretty complicated that method would be too costly and we would
probably not get close to the real minimum, thats why we are going to use
gradient descent method instead.
Gradient descent method consists of gradually get closer to a local or absolute
minimum (w0 , b0 ) of the function by means of subtracting the gradient scaled by
a small value called learning rate, based on the fact that in a multidimensional
8
scalar function the gradient vector indicates the direction of maximum growth,
so the opposite vector indicates the maximum descent. So in each step of the
gradient descent method the weights and biases would change following:
(w, b)n+1 = (w, b)n Cx (w, b)
(5)
Where w is the weights vector, b the biases, the learning rate, x is a fixed
input and C(x, w, b) the cost function.
1.4.2
Now the question is, how do we calculate the gradient of this cost function,
of which we dont know even the concrete expression? The answer is to use
backpropagation algorithm. Before describing backpropagation algorithm we
need to define a couple of equations.
We define the error of a neuron j in layer l as the variation of the cost function
with respect to the weighted inputs plus bias in that neuron:
jl :=
C
zjl
(6)
(7)
Claim 1. Let L be a the number of layers of a neural network and j one of its
neurons, then we have the following equality for the neuron error:
jL =
C 0 L
(zj )
aL
j
(8)
in matrix form
L = a C 0 (z L )
Proof. Lets first apply the definition of error for a neuron in layer L.
jL =
C
zjL
(9)
We have to develop the right term of the equality until we get to the right term
in equation (8). If we derive the cost function applying the chain rule and taking
L
in account that the activations aL
k of neurons in layer L depend of zj we get
the intermediate step:
X C aL
k
jL =
(10)
L z L
a
j
k
k
where the sum is over all the neurons in the output layer. The activation of
a neuron in a given layer only depends of the input that receives that same
aL
neuron, not the other neurons of layer so the term zLk is equal to zero when
j
k 6= j. In consequence we have
jL =
C aL
j
L
aL
z
j
j
9
(11)
aL
j
zjL
Claim 2. The errors of two consecutive neural network layers are related by
the following equality:
l = ((wl+1 )T l+1 ) 0 (z l )
(12)
C
zjl
X C z l+1
k
l+1 z l
z
j
k
k
=
=
X z l+1
k
zjl
kl+1
(13)
(14)
(15)
(16)
If we differentiate we get
zkl+1
l+1 0 l
= wkj
(zj )
zjl
substituting back in previous expression we get
X
l+1 l+1 0 l
jl =
wkj
k (zj )
(17)
(18)
10
(19)
Proof. For the first equality we differentiate the expression of zjl = wjl xlj T +
P l l1
ak + blj ) with respect to blj
blj (equivalently k wjk
C
blj
C zjl
zjl blj
(20)
C
1 = jl
zjl
(21)
C zjl
l
zjl wjk
= jl
=
zjl
l
wjk
P l l1
k wjk
ak + blj
l
wjk
l
= al1
k j
1.4.3
(22)
(23)
(24)
(25)
Now that we have shown all the necessary equations, we can list the steps of
backpropagation algorithm to calculate gradient of the cost function. Denoting
ax,l = (al1 , ..., aln )as the vector of neurons activations in layer l when x is the
input, where alj is defined in .
Input: For all neurons in input layer, set neuron values a1 to the corresponding values of pixel intensities in the example image.
Feedforward:
For each layer l {2, ..., L 1} do:
z l = wl al1 + bl and alj = (z l )
Output error L : Calculate the output error
L = a C 0 (z L )
Backpropagation: After calculating the last layer error we backpropagate it until first layers:
l = ((wl+1 )T l+1 ) 0 (z l ) for l in {L 1, ..., 2}
Output gradient components: We compute the components of the
cost function gradient as given in claim 3.
C
= jl ;
blj
This process is done for all examples x in a given subset of the training set
usually called batch, and then the weights are updated (gradient descent
step)
11
X x,l x,l1 T
(a
)
m x
X x,l
bl bl
m x
wl wl
(26)
(27)
Notice that the output error is very simple to calculate in case its a quadratic
cost function:
L = a C 0 (z L ) = (aL y)(z L )(1 (z L ))
(28)
since
0 (z) =
ez
1
1 + ez 1
=
= (z)(1 (z))
z
2
z
(1 + e )
(1 + e ) (1 + ez )
(29)
1X
y
=
(31)
wj
n x
(z) 1 (z) wj
y
(1 y)
1X
0 (z)xj
(32)
=
n x
(z) 1 (z)
1X
0 (z)xj
((z) y)
(33)
=
n x (z)(1 (z))
1X
=
xj ((z) y)
(34)
n x
and with respect to the bias:
C
1X
=
((z) y)
b
n x
(35)
Anyway when we use linear neurons, that is neurons with a linear activation
function (not constant), the neuron saturation doesnt happen, because their
derivative is not zero or asymptotically close to it, so in that case we could use
quadratic cost function.
12
1.5
Types of layers
Convolutional layers
Sliding the local receptive field, also called kernel to the right by one neuron, or
any number of neurons defined as stride, we connect the obtained new region
with the next hidden neuron, by saving its activation, and do so for the whole
layer.
13
The previous operation is called convolution, which gives the name to the network model. We notice that the weights and biases are shared for all local
receptive fields, so with this process we are checking in which degree a feature is
present all across the image, and slightly modify it during the learning process.
Also this way we greatly reduce the number of parameters compared with the
fully connected network, and get a more meaningful information for each neuron
in the hidden layer.
We call the map from local receptive fields to a hidden layer a feature map.
For a convolutional layer we can have many feature maps, this way we can recognize different shapes on the images. So to say each feature map tells us if in
a region a given pattern is present or not, with a real value between 0 and 1.
Having the kernel defined as above, the output of a convolutional layer would
have smaller dimensions than the input, but theres cases when this is not the
case. In some cases we use an enhancement of the layer on the edges with mean
values called padding to filter the layer and get an output of the same size and
shape as the layer.
1.5.2
An example
The previous layer would have another a feature map to recognize if the
above features are present or not.
14
and so on for every lower level of abstraction until we get to the input layer
with the image data.
1.5.3
Pooling Layers
After a convolution layer we usually have pooling layers, which simplify the
information of the previous layer. A commonly used one is max-pooling layer,
which takes the maximum value of the activation in a given region, say 2 2
neurons of previous layer.
al+1
jk = max{a2j+m,2k+n }m,n(0,1)
(37)
mation of the feature maps, if they appear or not in a approximate part of the
image, since we dont care about the exact position of a feature when we are
looking for patterns.
15
1.5.4
As we explained before, when using sigmoid output for all the neurons in a layer
it can happen that the state of many neurons becomes saturated, due to the
shape of this output function. The rectifier layers are characterized by having
their neuronss activation function defined as
f (x) = max(0, x)
(38)
Using this kind of output we will avoid saturation, so this kind of layers are
usually combined with convolutional layers with sigmoid function. A smooth
approximation to the rectifier is the analytic function
f (x) = ln(1 + ex )
(39)
which is called the softplus function. This layers have the property of accelerating the learning process, that is achieving a lower cost value in less steps.
1.5.5
min(N 1,i+n/2)
X
i
j
aix,y = zx,y
/ k +
(zx,y
)2
(40)
j=max(0,in/2)
i
where zx,y
is the activity of a neuron computed by applying kernel i at position
(x, y) and then applying ReLU nonlinearity aix,y is the response normalized
i
. The sum runs over n adjacent kernel maps at the same spatial
activity of zx,y
position and N is the total number of kernels in the layer. Details about other
parameters can be found in [3].
1.5.6
Softmax layers
Softmax layers transform the activations from previous layer into a probability
distribution, keeping the same information of the activations. Each neuron of
the softmax layer has the following activation function:
L
ezj
aL
=
P
L
j
zk
ke
(41)
The activations of the previous layer zjL are not necessarily between 0 and 1
and summing 1 for the whole layer, so with the softmax layer we get sure that
we have a better representation of the probability that the image belongs to a
particular class. We will actually not count softmax as a layer helping to train
the network, but a layer that helps to make the classification results human
readable.
16
1.6
1.7
Tensorflow
To run the needed operations for training a neural network we used Googles
recently launched open source machine learning library Tensorflow. TensorFlow
is an open source software library for numerical computation using data flow
graphs. It substitutes previous libraries with similar purposes such as Caffe,
Torch, Theano, DL4J or SciKit-Learn. In each graph nodes represent mathematical operations, from simple ones like matrix multiplication and addition to
more complex like convolution or softmax. Graph edges represent the multidimensional data arrays (tensors) communicated between them. The system is
flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been
used for conducting research and for deploying machine learning systems into
production across more than a dozen areas of computer science and other fields,
including speech recognition, computer vision, robotics, information retrieval,
natural language processing, geographic information extraction, and computational drug discovery. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with
a single API.
1.8
Programming model
17
two tensors of type float versus add of two tensors of type int32). A kernel is a
particular implementation of an operation that can be run on a particular type of
device (e.g., CPU or GPU). A TensorFlow binary defines the sets of operations
and kernels available via a registration mechanism, and this set can be extended
by linking in additional operation and/or kernel definitions/registrations.
A tensor in Tensorflows is understood as a typed, multidimensional array.
A variety of tensor element types are supported, including signed and unsigned
integers ranging in size from 8 bits to 64 bits, IEEE float and double types, a
complex number type, and a string type (an arbitrary byte array).
2
2.1
Related work
Universal approximation of functions
The way artificial neural networks have evolved until today, where they are
useful to solve many classification problems with good results was heuristic, it
was not analytically and with deductive steps determined that neural networks
could properly model certain types of data such as audio and video, but it can be
analytically proven that linear combinations of sigmoid functions can uniformly
approximate any continuous function, which tells that we could approximate
any data set with neural networks. Details can be found in [6]. In spite of this
formalization of the function approximation capability of ANN it is accepted
that they have a black box nature in terms of the feature extraction. It is not
exactly known the interpretation of the weights and biases learned, although we
could observe that basic shapes that might be present in images are identified
as features.
2.2
In our work we used all the time feedforward neural networks, which propagate
the activations in one direction, but it is also important to remark that theres
other types of commonly used ANNs as recurrent neural networks, in which
connection form a directed cyclic graph.
2.3
The main condition to use CNNs is to have labeled data, which in most of
cases in life doesnt happen. Sometimes we have similar kind of problems, but
need to be solved in an unsupervised way, and to do this we can use deep
belief networks, which are the unsupervised learning version of artificial neural
networks. Another inconvenient with CNNs and backpropagation is that weights
and biases can get stuck in a poor local optima, making the model stay far from
good prediction results. So to overcome this limitations Smolensky [7] thought
of a network that learn hidden patterns on the data. So the idea is to have
only one visible layer, many hidden and infer states of hidden variables for
some visible variables states, being later able to generate new visible variables
samples. In case of images, we would learn the probability of some features
appearing in a given image, without it being labeled.
DBNs are composed by Restricted Boltzmann Machines (RBMs). RBMs
are simpler than CNNs version of ANNs that learn a probability distribution
18
over a set of inputs. In case of image sets, the network learns a set of features
given an input image dataset. This can be used to initialize deep neural networks features values. RBMs only have 2 layers, a visible one with m neurons,
(in this method also called units) and a hidden one with n units, with binary
boolean values. The same way as it happens in CNNs, theres a weight matrix
W = (wi,j ) of size m n ,where wi,j determines the weight of connection between visible unit vi and hidden unit hj and also biases, ai for visible units and
bi for hidden units. In RBMs we have a function that associates a scalar value
called energy to each configuration of the variables:
E(v, h) =
m
X
ai vi
i=1
n
X
bj hj
j=1
m X
m
X
vi wi,j hj
(42)
i=1 j=1
in matrix notation,
E(v, h) = aT v bT h v T W h
(43)
Learning corresponds to modifying that energy function so that its shape has
desirable properties. We would like plausible or desirable configurations to have
low energy. We also have a probability distribution for each configuration
of the network, which depends of the energy function:
P (v, h) =
1 E(v,h)
e
Z
(44)
P
E(v,h)
being Z =
a normalizing constant to ensure the probability
(v,h) e
distribution sums 1. The sum is over all possible configurations of visible and
hidden units. Plausible configurations should have a higher probability value,
that is a energy function value close as possible to 0. In a similar way we have
the probability of a given visible units vector is the normalized sum of energy
functions exponential over all possible hidden units configurations.
P (v) =
1 X E(v,h)
e
Z
(45)
n
Y
P (hj |v)
(47)
j=1
m
X
i=1
19
!
wi,j vi
(48)
and
P (vi = 1|h) = ai +
n
X
wi,j hj
(49)
j=1
vV
(51)
vV
20
Composing together many RBMs, making each hidden layer, the visible layer of
another RMB we form a deep belief network, which is able to extract features
in data of different levels of abstraction. To train a Deep Belief Network we
would proceed as follows:
Given a input data sample X we would train a restricted Boltzmann machine on X to obtain its weight matrix, W . Then we would use it as the
weight matrix between the lower two layers of the network.
Then we would transform X by the RBM to produce new data sample
X 0 , either by sampling or by computing the mean activation of the hidden
units.
Next we repeat the procedure with X X 0 for the next pair of layers,
until the top two layers of the network are reached.
At last we would fine-tune all the parameters of this deep architecture
with respect to a proxy for the DBN log-likelihood, or with respect to
a supervised training criterion (after adding extra learning machinery to
convert the learned representation into supervised predictions, e.g. a linear
classifier).
2.4
Random forest
2.5
Deep Dream
Have you ever thought so long in something or someone that have the feeling for
a second you see it even when its not there? This a curious application of Deep
Neural Networks, the generation of images reminding to hallucinations. The
idea in Deep Dream is to maximize the activations of certain layers features
in a network that is already trained, and mix the detected features with an
input image. To do this it is used gradient ascent, which is the opposite idea of
gradient descent, instead of subtracting gradient to weights and biases in each
step we add it, to get a higher activation value. The result of doing this on
an image of Barcelonas skyline from Parc G
uell with an Inception NeuralNets
layer that detects the presence of canines and other animals is this:
22
2.6
Image Inpainting
Another curious application of neural networks with image data is, a common
process done by humans which is to imagine the completion of a missing part
of an image. This promising idea, was taken to reality in [9], and was accepted
this years Computer Vision and Pattern Recognition contest in Las Vegas and
represents a big step for computer vision. In figure 8 we can see the impressive
results:
23
Methodology
Our goal was to build our own CNN, being inspired in examples that already
perform prediction effectively and understand their architecture. To do that we
use Tensorflow library and get ideas from its available examples.
3.1
The dataset
24
The aim of the class other is to make the model able to tell if the picture
doesnt belong to any of the food classes. The dataset is composed both from
Instagram photos and web images. Instagram photos have been obtained from
the Instagram API, filtered with user defined tags and manually purged. As
user defined tags are very noisy this method proved to be inefficient and very
time-consuming. In order to facilitate the generation of more ground truth
annotations and a larger training dataset we also obtained images from Google
Images through the Custom Google Search API. This method, which allowed to
automatically annotate a bigger set of images, turned out to be very useful as
almost all the retrieved images showed the desired food category and minimum
manual purge was required. The first model that we are going to build is going
to be a single layer neural network. The images of the dataset have no specific
size or format, we store them in a .bin file with records with information of
32x32 pixels with 3 channels RGB. Then to feed the network we randomly crop
them into 24x24 pixel images, to expand the data set size.
3.2
Hardware setup
The models were trained over a high-end server with a quadcore Intel i7-3820 at
3.6 GHz with 64 GB of DDR3 RAM memory, and 4 NVIDIA Tesla K40 GPU
cards with 12 GB of GDDR5 each, connected through a PCIe 3.0 in x16 mode
(containing two PCIe switches). The machine runs a GNU/Linux system, with
Linux kernel 3.12 and NVIDIA driver 340.24. We performed experiments with
different configurations (downscaling sizes, data augmentation, different number and composition of layers, different layers geometry, etc.). We also tested
aspects with no impact in the classification accuracy but with practical implications such as different input formats (TFRecords, compressed numpy arrays,
etc.) or different hardware configurations (one or more CPUs and GPUs, etc).
Our runs include an extensive set of conffigurations; for brevity, when those parameters were shown to be either irrelevant or to have negligible effect, we use
default values. Each experimental conffiguration was repeated at least 5 times.
Unless otherwise stated, we report median values in seconds.
3.3
results in a file. The graph is saved in a file .ckpt so it can be read when
evaluating the model. During the training, loss value, steps and execution
time are saved.
Train network and track precision - ManuNet train eval.py:
A modified version of previous program that saves the prediction precision
using both train and test data separately. Model inferences are grouped
in a scope to allow reusing variables.
Evaluate model - ManuNet eval.py:
Returns the precision of our models inference over test data.
Evaluation sample - ManuNet eval sample.py:
Performs model inference over the desired number of batches and saves
the images with their predicted and correct label.
Generate confusion matrix - ManuNet eval byclasses.py:
This program performs inference of imagess labels over the desired number of batches and returns a matrix where each entry aij is the portion of
examples that where labeled as class i and predicted as class j, this way
values on the diagonal aii give the precision for class i.
Extract features - ManuNet get features.py:
Performs inference and saves extracted features in convolutional layer kernel as images.
3.4
Transfer learning
Bottlenecks
4
4.1
Results
Single softmax layer network
The first and simplest possible approach that weve taken was to train the network with a single softmax layer, with a single matrix product of the image data
with the weights and addition of biases.
Since our dataset is quite noisy and not very
large, without extracting features of different levels of abstraction, the first result is
not going to be very good, if it does learn
something it could be considered already an
achievement.
After running the experiment we can see
the results of running stochastic gradient descent with a single layer network in figure.
Figure 10: 1 softmax layer ANN Clearly we are not getting close to a minimum of the cost function, since after some
steps the cost is barely decreasing.
When checking the precision of the predictions, feeding the network with
600 test images and dividing the correct classifications between the total classifications, we get a disaster score of 44% (44 out of 100 images are classified
correctly).
Figure 11: Training loss in each step for a network with no hidden layers.
So only with a softmax layer receiving the weight product with the input layer
plus biases, the model does learn something, since a random classification would
get around 10% precision, but it doesnt get much more better than that. Taking
a look at the loss value evolution we leave this helloworld experiment and
dont spend time in doing experiments with more steps and go on with the
layers that give the name to the networks we are studying. So lets see how
does the classification improve with a convolution and later pooling layer.
28
4.2
Convolution
After the simplest approach of having a network with no hidden layers we see
how does the model do with only one hidden convolutional layer.
The layer is going to extract 64 5x5
neurons kernels, with a stride of a
single neuron on each direction and
a padding which will make the output of the convolution layer have the
same size of the input layer.
We
think convolution layer as a prism since
it extracts 64 features presence infor- Figure 12: Convolution network simple
mation for every part of the image, structure
this information we save in a 3d-vector
(24x24x64).
The loss value continues decreasing for a longer continued training (in simplest model it almost stopped decreasing in the first 10.000 steps) and ends up
having an average value of 0.3. Training time was of 2 hours 56 minutes for
completing 50.000 steps, which we choose as a training length sufficient for loss
stabilization. The predictions do as expected a jump in precision to 85.5% of
accuracy in contrast with previous model result.
Figure 13: Loss function values during training, in steps and time
4.3
Now we check the effect of adding a pooling layer after the convolutional layer.
This time cost decreases quite faster and
reaches close to 0.16 value after 50.000
training steps in contrast with the 0.3 of
the previous model. In terms of precision
we get up to 88% goodness score, which
is actually not bad taking in account that
the images are not quite simple as for ex- Figure 14: Convolution network with
1 conv. layer and pooling
ample handwritten digits, and our network has only 2 layers. If we randomly
choose a sample of the classifications we see that effectively more than 8 out of
10 images are classified in the right group. The real label of the picture is after
L: and the prediction computed by the network is after P:.
29
L: 0, P: 0
L: 1, P: 1
L: 2, P: 2
L: 3, P: 3
L: 4, P: 8
L: 5, P: 5
L: 6, P: 6
L: 7, P: 6
L: 8, P: 8
L: 9, P: 9
Also we can see the precision results in the confusion matrix, which tells us from
each class, which portion of the predictions where correct and which where in
wrong classes, where rows are real class labels and columns are class predictions.
For example we can observe that 15% of the fried eggs where classified as sushi.
That might be cause of the similarity of colors and shapes (white ovals surrounded by black are present in both classes). So maybe we should rise the
number of layers and features detected in our network, or change other parameters to let the network tell the difference between those classes. A part of this
theres already no notable confusion (greater than 10%) between other classes.
Table 1: Confusion matrix
0
1
2
3
4
5
6
7
8
9
0
0.87
0.00
0.08
0.02
0.02
0.01
0.00
0.00
0.01
0.00
1
0.02
0.85
0.00
0.02
0.00
0.01
0.02
0.02
0.00
0.00
2
0.00
0.02
0.77
0.00
0.02
0.00
0.00
0.00
0.01
0.04
3
0.02
0.04
0.00
0.78
0.05
0.00
0.03
0.02
0.01
0.00
4
0.00
0.00
0.00
0.06
0.72
0.00
0.02
0.02
0.01
0.00
5
0.08
0.04
0.11
0.04
0.02
0.94
0.00
0.00
0.06
0.10
6
0.02
0.00
0.02
0.00
0.00
0.01
0.90
0.05
0.01
0.00
7
0.00
0.02
0.00
0.06
0.02
0.01
0.02
0.85
0.04
0.00
8
0.00
0.02
0.03
0.02
0.15
0.01
0.01
0.05
0.83
0.00
9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.86
and pooling layer the network can already classifies images with a clear color
and shape pattern, but has difficulties with classes that contain many colors
and complicated shapes so clearly the idea of convolution together is the main
key of the learning process of the network. In terms of time, it took us 2 hours
and 24 minutes to train the network using the GPU cluster.
It is to remark that in this model we are estimating a function with a total
of 32x32+5x5x64=2624 parameters, so we can imagine the complexity of the
computation, this is what we meant by the differences between classical statistics
and machine learning.
4.4
So now that we saw that the main improvement comes after having a convolutional layer, lets see what happens if we have another big improvement after
adding a second convolutional layer.
Here we have the confusion matrix results when training a network with 2
convolutional layers.
31
Table 2: Confusion matrix for network with 2 convolution and normalization layers
0
1
2
3
4
5
6
7
8
9
0
0.86
0.00
0.10
0.02
0.02
0.01
0.00
0.02
0.00
0.00
1
0.00
0.79
0.02
0.04
0.00
0.01
0.00
0.03
0.00
0.00
2
0.02
0.00
0.70
0.00
0.00
0.00
0.00
0.00
0.00
0.00
3
0.00
0.04
0.03
0.82
0.04
0.00
0.05
0.01
0.02
0.00
4
0.00
0.00
0.03
0.04
0.85
0.00
0.02
0.00
0.00
0.00
5
0.08
0.05
0.07
0.00
0.02
0.93
0.00
0.00
0.02
0.09
6
0.02
0.02
0.00
0.02
0.00
0.02
0.91
0.06
0.00
0.05
7
0.00
0.04
0.00
0.04
0.02
0.00
0.00
0.85
0.01
0.00
8
0.00
0.05
0.00
0.02
0.04
0.01
0.02
0.03
0.94
0.00
9
0.02
0.00
0.05
0.00
0.00
0.00
0.00
0.00
0.00
0.86
Normalization Layers
If we add a normalization layer after each of the pooling, as its done in cited
models, surprisingly our accuracy decreases again to 86.8% so we look which
other network configuration values can be modified to get better results. Another parameter we used by reference of previous works is the cropping size,
which was 32x32 for all images, but maybe using such small versions of the
pictures is preventing our network to recognize some features that need more
resolution to be recognized.
4.5
So next approach will be to train the network with two convolutional, pooling
and normalization layers with a higher resolution version of the same dataset.
We choose a size that is not going to do the computation too slow but the
difference of resolution is quite noticeable, which is going to be 48x48 after
cropping step. With this size, the precision of the predictions rises to 89.99%,
which is quite a significant improvement. So we can state than higher resolution
with more convolutional layers gives better classification results. Also it takes a
high computation time to proceed with these experiments. With the last one (2
convolutionals, 48x48 pixels) it took 8 hours 21 minutes to complete the 50.000
steps of training. The confusion matrix show us some classes predictions now are
really highly accurate, but still some of them lack of a complete understanding
of the patterns by the neural network.
32
0
1
2
3
4
5
6
7
8
9
0
0.83
0.00
0.02
0.00
0.00
0.01
0.00
0.02
0.00
0.00
1
0.00
0.82
0.02
0.00
0.00
0.00
0.02
0.03
0.00
0.00
2
0.04
0.00
0.88
0.02
0.05
0.00
0.02
0.00
0.01
0.04
3
0.02
0.06
0.00
0.83
0.02
0.00
0.02
0.02
0.01
0.00
4
0.02
0.00
0.00
0.04
0.81
0.00
0.02
0.00
0.00
0.00
5
0.07
0.04
0.08
0.00
0.02
0.96
0.02
0.02
0.00
0.09
6
0.02
0.00
0.00
0.04
0.00
0.01
0.86
0.00
0.01
0.00
7
0.00
0.04
0.00
0.02
0.02
0.00
0.05
0.90
0.00
0.00
8
0.00
0.04
0.00
0.04
0.07
0.01
0.00
0.02
0.96
0.00
9
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.87
Table 3: Confusion matrix of 2 convolution layers model trained with 48x48 images
4.6
Overfitting
The plotted loss values are calculated with train data, but it would be good to
see how does the classification precision evolves during the training process, to
know if the network actually learning concepts behind the data, or is it only
memorizing the training data set.
Figure 19: Precision of classifications with training data set in green and test data set
in reed
To do that we take a look the evolution of the precision with training and
test data. In figure 19 we can see that theres a point when the precision of
the classifications stops improving, that means that our network is overfitting,
or some of the neurons are saturated. To avoid this we can try with adding
normalization layer.
4.7
33
4.8
Batch size
Initially we set a commonly used batch size, which is 128 examples/batch, but
we wanted to see how this does affect training time and precision of classification.
We compare the training evolution and results for the last network model which
consisted of two convolutional layers each followed by pooling and normalization.
In figure 20 we see that the loss takes slightly less oscillating and lower values
from step 30.000 on. We can see it more clearly in a close up of the last 1.000
steps of training in figure 21.
Figure 20: Training loss with batch size of 128 examples in blue and 64 in red.
34
Figure 21: Training loss on 1000 last steps for batch size of 128 examples in blue and
64 in red.
The big difference comes for training time, half lower for lower batch size,
we can see the comparison in minutes in figure 22.
Figure 22: Training loss depending on time for batch sizes 128 in blue and 64 in red.
4.9
Extracted features
Here are some extracted features by our network, in this case the 64 kernels of
the first convolutional layer (5x5 weights). As it happens with CNNs we can not
now why a particular shape and kind of features is learned during the process,
35
but intuitively in some of them it looks like the network learns basic shapes to
recognize the boundary of the objects in the images.
Figure 23: Extracted kernels in first convolutional layer, in a model with 2 convolutional layers.
For each of the previous features, we get a tensor of the shape of the image
with the activations for this tensor, as explained in section 1.5.1, since we are
using a padding to get the output with the same shape of the input.
36
Figure 24: Output of first convolution layer for the model 4.4
For example in figure 24 we can see the output of some images after convolution
for the first extracted feature, for each color channel.
If we take a look to a sample of the classifications of test images for this deeper
model, we notice that the network is indeed understanding some patterns within
the images. Lets see first a sample of correctly classified images:
37
Figure 25: Some of the correct classificied pictures by last built model.
But also lets take a look at the error to see the level of bad understanding in
some cases:
Figure 26: Some of the wrong classified pictures by last built model.
So as we can see, still theres a lot to improve in our model, since for example
in a MNIST data set classification by a state of the art model, errors rely mostly
in images that are really hard to classify correctly by a human eye, but here it is
not the case. Of course our images are much more complicated than digits and
training set is smaller, but for sure further work in the direction we are (more
layers, more data quality) would let us achieve a much better result.
4.10
As we explained before, the last shot was to use a pretrained network that
recognizes 1000 classes contained in Imagenet dataset. To do this we use Inception v-3 model provided by Tensorflow and this time we achieved a precision of
91.2% in only 17 minutes. So definitely we can state than knowing patterns
of many image classes greatly helps learning faster and better new classes as it
happens with biological neural networks.
Summing up we can see 27 that although transfer learning from a large
previously trained model gives the best results, our last self trained model was
not far from it and it helped us to understand some insights of the network that
would be more difficult to understand without seeing and tuning the source code
for building the model.
38
First thing that can be said is that this work was helpful to get introduced in
a family of algorithms that have amazing results in some particular problems
and now I can understand how they work. After different unexpected accuracy
values and training times, we can say that although some of convolutional neural
networks insights are still not completely clear, they do work quite good for
image classification. Using the feature extraction program we can visualize
an idea of what convolution does, which is to learn some edges, colors and
points that help to learn and then identify the learned shapes of objects. Some
statements we can conclude with are, that we can not say that deeper networks
are always going to give better classification results, since we gradually improved
the precision when adding layers, but it didnt always happened so as we saw
in the case of adding a normalization layer. Further work could be why it
happened so in our case. Also to remark, its observable that higher image
resolution is likely to improve classification goodness, as we saw on the last
results. When looking for documentation in this aspect we saw that many of the
parameters such as batch size or learning rate used in state of the art networks,
are determined as we proceeded, by trial and error so it is still to solve why a
specific number of layers and of which deepness works better. Another thing
we observed is that as expected, using a previously trained net that already
classifies 1000 classes gives a slightly better accuracy for our dataset than the
networks trained from the beginning only with our dataset (90% vs 91%), also
consuming much less time. With this work, taking in account that we didnt
use a very large data set, and with further reading we could say now that object
recognition in static images with CNN is close to being a solved problem. A
future work possibility would be to continue exploring other applications such as
online user behavior prediction, improving speech recognition or look in which
other research areas this family of machine learning algorithms can also be
useful.
39
References
[1] McCulloch, Warren; Walter Pitts (1943). A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics
[2] Krizhevsky, Alex (2009). Learning multiple layers of features from tiny images.
[3] A. Krizhevsky, I. Sutskever, G E. Hinton, Imagenet Classification with Deep
Convolutional Neural Networks
[4] Werbos, P.J. (1975). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.
[5] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric
Tzeng, Trevor Darrell (2013). A Deep Convolutional Activation Feature for
Generic Visual Recognition
[6] George Cybenko (1989). Approximation by Superpositions of a Sigmoidal
Function
[7] Smolensky, Paul (1986). Chapter 6: Information Processing in Dynamical
Systems: Foundations of Harmony Theory.
[8] Li Wan, Matthew Zeiler, Sixin Zhang, Yann LeCun, Rob Fergus. Regularization of Neural Networks using DropConnect.
[9] Deepak Pathak, Philipp Krahenb
uhl, Jeff Donahue, Trevor Darrell, Alexei
A. Efros. Context Encoders: Feature Learning by Inpainting.
40