Anda di halaman 1dari 18

26/11/2018 ANNT : Convolutional neural networks - CodeProject

13,777,487 membersChallenge has only 35 days left. Details here. Sign in

Our AI TensorFlow

Search for articles, questions, tips

articles Q&A forums stuff lounge ?

ANNT : Convolutional neural networks

Andrew Kirillov, 28 Oct 2018

5.00 (12 votes) Rate this:

The article demonstrates usage of ANNT library for creating convolutional ANNs and applying them to image classification tasks.

Download source code - 491 Kb

Theoretical background

Architecture of convolutional network

Convolutional layer
ReLU activation function
Pooling layer
Building convolutional neural network

Training convolutional network

Cross-entropy cost function

SoftMax activation function
ReLU activation function
Pooling layer
Convolutional layer

The ANNT library

Building the code

Usage examples

MNIST handwritten digits classification

CIFAR10 images classification

Conclusion 1/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject

This article continues the topic of artificial neural networks and their implementation in the ANNT library. The first article started with
basics and described feed forward fully connected neural networks and their training using Stochastic Gradient Descent and Error
Back Propagation algorithms. It then demonstrated application of this artificial neural network's architecture in number of tasks. One
of those was classification of handwritten characters from the MNIST database. Although being a simple example, it managed to
achieve about 96.5% accuracy on a test dataset. In this article we'll have a look at a different architecture of artificial neural
networks known as Convolutional Neural Networks (CNN). This type of networks is specifically designed for computer vision tasks
and outperforms classical fully connected neural networks when it comes to tasks like image recognition. As another sample
application will demonstrate, we'll get to about 99% accuracy on the handwritten characters classification.

Originally the convolutional neural network architecture was introduced by Yann LeCun when he published his work back in 1998.
However, it was left largely unnoticed in those days. It took 14 years to get big attention to convolutional networks when the
ImageNet competition was won by a team using this architecture. CNNs became very popular after that and were applied to many
computer vision applications resulting in development of variety of neural networks based on this architecture. These days state-of-
the-art convolutional neural networks achieve accuracies that outperform humans on many image recognition tasks.

Theoretical background
As in the case with feed forward fully connected artificial neural networks, the idea of convolutional networks was inspired by
studying nature - brain of mammals. Work by Hubel and Wiesel in the 1950s and 1960s showed that cats' and monkeys' visual
cortexes contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region
of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field. Neighbouring cells have
similar and overlapping receptive fields. Receptive field size and location varies systematically across the cortex to form a complete
map of visual space.

In their paper, they described two basic types of visual neuron cells in the brain that each act in a different way: simple cells and
complex cells. The simple cells activate, for example, when they identify basic shapes as lines in a fixed area and a specific angle.
The complex cells have larger receptive fields and their output is not sensitive to the specific position in the field. These cells
continue to respond to a certain stimulus, even though its absolute position on the retina changes.

In 1980, a researcher called Fukushima proposed a hierarchical neural network model, which was named neocognitron. This model
was inspired by the concepts of the simple and complex cells. The neocognitron was able to recognise patterns by learning about
the shapes of objects.

Later, in 1998, convolutional neural networks were introduced by Yann LeCun and his colleagues. Their first CNN was called LeNet-
5 and was able to classify digits from hand-written numbers.

Architecture of convolutional network

Before getting into the details of building a convolutional neural network, let's have a look at some of the building blocks, which are
either specific to this type of networks or got popularized when they have arrived. As it was seen from the previous article, many
concepts of artificial neural networks can be implemented as separate entities, which perform calculations for both – inference and
training phases. Since the core structure was already laid out in the article before, here we'll be just adding building blocks on top
and then stich them together.

Convolutional layer

Convolutional layer is the core building block of convolutional neural network. It does assume its input has 3-dimensional shape of
some width, height and depth. For the first convolutional layer it is usually an image, which most commonly has its depth of 1
(grayscale image) or 3 (color image with 3 RGB channels). For subsequent convolutional layers the input is represented by a set of
feature maps produced by previous layers (here depth is the number of input feature maps). For now, let's assume we deal with
inputs having depth of 1, which turns them into 2-dimensional structures then.

So, what the convolutional layer does is essentially an image convolution with some kernel. It is a very common image processing
operation, which is used to achieve variety of results. For example, it can be used to make images blurry or make them sharper. But 2/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
this is not what convolutional networks are interested in. Depending on the kernel in use, image convolution can be used to find
certain features in images – vertical or horizontal edges, corners, angles or more complex features like circles, etc. Recall the idea
of simple cells in the visual cortex?

Let's see how convolution is calculated. Suppose we have n (height) by m (width) matrices K (kernel) and I (image). Then it can be
written as dot product of those matrices, where kernel matrix is flipped horizontally and vertically.

For example, if we have 3 by 3 matrices K and I, the convolution of those can be calculated this way:

The above is the way how convolution is defined when it comes to signal processing. Kernel is flipped vertically and horizontally
there. A more straight forward calculation would be just a normal dot product of the K and I matrices, without any flipping. This
operation is called cross-correlation and defined this way:

When it comes to signal processing, convolution and cross-correlation have different properties and are used for different purpose.
However, when it comes to image processing and neural networks the difference becomes subtle and cross-correlation is often
used instead. For neural networks it is really not important at all. As we'll see later, those "convolution" kernels are actually the
weights, which neural network needs to learn. So, it is up to the network to decide which kernel to learn - flipped or not. With this in
mind, we'll keep it simple and use cross-correlation then. Note: further in the article anywhere "convolution" is mentioned, we'll
assume normal dot product of two matrices, i.e. cross-correlation.

OK, we now know how to calculate convolution for two matrices of the same size or kernel and image of the same size. However, in
image processing it is rarely the case. Kernel is usually a square matrix of size 3 by 3 or 5 by 5 or 7 by 7, etc. While image can be
of any size. So how is image convolution calculated then? To calculate image convolution the kernel is moved across the entire
image and the weighted sum is calculated at every possible location of the kernel. In image processing this concept is known as
sliding window. Calculations start at the top left corner of the image and convolution is calculated between the kernel and
corresponding image area of the same size. Then kernel is shifted right by one pixel and another convolution is calculated. It is then
repeated until convolution is calculated at every position of the row. Once it is done, the kernel is moved to the start of the next row
of pixels and the process continues further. When entire image is processed, we get a feature map - values of individual
convolutions at every possible location of the image.

The picture below illustrates the process of calculating image convolution. For an image of 8x8 in size and kernel of 3x3, we get a
feature map of 6x6 in size - convolution is calculated only at those locations, where kernel fits entirely into the image. The picture
below highlights few regions of the source image and their corresponding convolution values in the resulting feature map.

The above 3x3 kernel is designed to look for object's left edges (or presence of a straight vertical line on the right from the center of
the sliding window). High positive values in the resulting feature map indicate presence of the feature we are looking for. Zeros
mean absence of the feature. And for this particular example, negative values indicate presence of the "inverse" feature – object's
right edges. 3/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
As it was shown above, the output feature map gets smaller in size than the source image when convolution is calculated. And the
bigger kernel is used, the smaller feature map we get. For a kernel of nxm in size, the input image loses (n-1)x(m-1) in size. So, if
we would have 5x5 kernel in the above example, then the result feature map would get down to 4x4 in size. In many cases,
however, it is preferred to get output feature map of the same size as input. To obtain this, the source image needs to be padded
(usually with zeros). For example, if the source image is 8x8 in size and our kernel is 5x5 in size, then we would need to pad the
input, so it gets to 12x12 in size, i.e. 4 extra rows/columns added. This is usually done by adding 2 rows/columns on each side of
the input image.

So far we've discussed how to compute convolution mathematically and how to compute image convolution when it comes to image
processing. However, we are doing artificial neural networks, so we need to see how all the above is related to convolutional layers.
To keep it simple for now, let's use the example from the above – 8x8 input image convolved with 3x3 kernel, which gives us 6x6
feature map (output). In this case, our input layer has 64 nodes and our convolutional layer has 36 neurons. However, unlike with
fully connected layer, where each neuron of the layer is connected to all neurons of the previous layer, neurons of convolutional
layer are connected only to a small group of the previous layer's neurons. Each neuron in convolutional layer has as many
connections as the number of weights in the convolution kernel it implements, which is 9 connections in the above example (kernel
size 3x3). Since convolutional layer assumes the input has 2D shape (3D in general, but keeping it simple for this example), those
connections are done to a rectangular group of previous neurons, which is of the same shape as the kernel in use. The group of
connected previous neurons is different for each neuron of the convolutional layer, however it does overlap for the neighbouring
neurons. These connections are made in the same way, as pixels of the source image are chosen, when calculating image
convolution using sliding window approach. For example, looking at the above image demonstrating image convolution, we can see
which of the highlighted outputs on the feature map get connected to which inputs (highlighted with the same color).

Ignoring the fact that neurons of fully connected layers and convolutional layers have different number of connections to the
previous layer and that these connections have certain structure, both layers essentially do the same – calculating weighted sum of
inputs to produce outputs. There is one more difference though. Unlike with fully connected layers, where each neuron has its own
weights, neurons of convolutional layers share them. So, if a layer does one single 3x3 convolution (in practice it does more than
one, but keep it for later), it just has one set of weights, i.e. 9, which are shared between each neuron for calculating weighted sum.
And, although it was not yet mentioned before, convolutional layers also add bias value to the weighted sum, which is also shared.
The table below summarizes the difference between fully connected and convolutional layers and provides some numbers for the
above example.

Fully connected layer Convolutional layer

No assumptions about input structure Input is assumed to have 2D shape (3D in general)
Each neuron is connected to a small rectangular group of neurons in
Each neuron is connected to all neurons of the previous
the previous layer; number of connections equal to number of
weights in convolution kernel
64 connections each
9 connections each
Each neuron has its own weights and bias value Weights and bias value are shared
2304 weights and 36 bias values 9 weights and 1 bias value

For now, we've kept things simple and assumed that both input and output of convolutional layer have 2D shape. However, it is not
the case in general. Instead, both input and output have 3D shape. First, lets start with the output. In practice, each convolutional
layer computes more than a single convolution. The number of convolutions it does is a configurable parameter, which is set when
designing artificial neural network. Each convolution uses its own set of weights (kernel) and bias value and so produces a different
feature map. As it was mentioned before, different kernels can be used to look for different features – lines at different angles,
curves, corners, etc. And so, it is often desired to get a number of feature maps, which highlight presence of different features.
Calculation of those maps is simple - the process of calculating convolution for the given input is repeated multiple times with
different kernel's weights/bias every time. Translating it to artificial neurons' world, we are simply adding additional groups of
neurons into the convolution layer, which are connected to inputs in the same way as in the case with single kernel. Having same
connection pattern, these groups of neurons share different weights and bias values though. Coming back to the example
described before, suppose we configure our convolution layer to do 5 convolutions, 3x3 each. In this case number of outputs
(number of neurons) is 36*5=180 – 5 groups of neurons organized into 2D shape and repeating same connection pattern. Each
group of neurons shares its own set of weights/bias, which gives us 45 weights and 5 bias values in total for the layer.

Now let's discuss 3D nature of inputs. If we speak about the very first convolutional layer, then its input will be some sort of image,
most likely. Most of the time it will be either grayscale image (2D data) or color RGB image (3D data). If we speak of subsequent
convolutional layers, then input's depth will be equal to the number of feature maps (number of convolutions) calculated by the
previous layer. When input gets higher depth, the number of neurons in convolutional layer is not growing. Instead, number of
connections with the previous layer is growing. In fact, convolution kernels get 3D shape as well and have nxmxd size, where d is
the depth of input. Translating it to neurons' world again, we can think of it as if each neuron gets additional connections to every
feature map input contains. In the case of 2D input, each neuron was connected to nxm (kernel size) rectangular area of the input.
In the case of 3D input, however, each neuron is connected to number (d) of such areas, which are coming from the same location,
but from different input feature maps.

Since we've generalized convolutional layers to 3D inputs/outputs and also mentioned bias values, we can update our convolution
formula, which is computed at every possible location (x, y) of the kernel within the input features. 4/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject

To complete with convolutional layers for now, let's summarize on the parameters used to configure them. When creating fully
connected layer, we use only two parameter - number of inputs and number outputs (neurons in the layer). When creating
convolutional layers though, we don't need to specify number of outputs. Instead we describe the shape of inputs, hxwxd, and the
shape and number of kernels, nxm@z. So, we have 6 numbers: w – width of input feature maps (image), h – height of input feature
maps, d – depth of input (number of feature maps), m – width of kernels, n – height of kernels, z – number of kernels (number of
output feature maps). The actual size of kernels depends on the input specification and so we get z kernels of nxmxd in size. And
the size of output then becomes (h-n+1)x(w-m+1)xz (here we assume input is not padded and kernel is applied only at valid

We'll get back to convolutional layers again when it comes to training them. The above, however, should give an idea of how output
is calculated on the inference phase (computing output of a trained network).

ReLU activation function

The next building block to describe is ReLU activation function. It is not something new or specific to convolutional neural networks.
However, it was popularized a lot with the rise of deeper neural networks. And this is where convolutional networks usually fit.

One of the problems deep neural networks experience is known as vanishing gradient problem. When training artificial neural
network using gradient-based learning algorithms and backpropagation, each of the neural network's weights receives an updated
proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the
gradient value can be so small, so it effectively prevents the weight from changing its value. One of the causes of this problem is
the use of traditional activation functions such as sigmoid and hyperbolic tangent. These functions have gradient in the (0, 1) range,
with values close to zero on the majority of function's domain. And since error's partial derivatives are calculated using chain rule, it
means that for a n-layer network there will be n multiplications of these small numbers, meaning gradient decreases exponentially
with n. As the result, "front" layers of a deep network train very slowly, if at all.

The ReLU function is defined as f(x)=max(0, x). Its biggest advantage is that it has constant derivative equal to 1 for values of x
greater than zero. As the result, it allows better gradient propagation, which speeds up training of deeper artificial neural networks.
Also, it is more computationally efficient, making it faster to compute in comparison with sigmoid or hyperbolic tangent.

ReLU function Sigmoid function

Although ReLU function does have some potential problems as well, so far it looks like the most successful and widely-used
activation function when it comes to deep neural networks.

Pooling layer

It is a common practice to follow convolutional layer with a pooling layer. The objective of this layer is to down-sample input feature
maps produced by the previous convolutions. By reducing the spatial size of inputs, we also reduce the amount of parameters and
computation in the neural network. This also helps in controlling overfitting – less parameters means less chance to overfit.

The most common pooling technique is the MAX pooling with 2x2 filter and stride 2. For the nxm input feature map, it produces a
n/2xm/2 map by replacing every 2x2 region in the input with a single value – maximum value of the 4 values in that region. These
regions don't overlap, but adjacent to each other, since the filter is moved horizontally and vertically with the step size (stride) equal 5/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
to its size. Below is example of applying MAX pooling to the 6x6 input (colored cells highlight source values of the MAX operator
and the corresponding result).

MAX pooling is not the only pooling technique. Another common one is Average pooling, which calculates average values of the
source regions instead of taking their maximum value.

Pooling layers also can be configured with different size of the filter and stride value. For example, some applications use 3x3 filter
with stride 2. Such configuration creates an overlapping pattern of pooling regions, since the filter's step size is smaller than its size.
Making stride value greater than filter size is uncommon however, since some features may get lost completely.

One important thing to mention about pooling layers is that they operate with 2D feature maps and don't affect depth of the input.
So, if input contains 10 feature maps produced by previous convolutional layer, for example, the pooling is applied individually to
each map. As the result, it produces same number of feature maps, but smaller in size.

Building convolutional neural network

As we now have the most common building blocks, we can put them together into a convolutional neural network. Although there
are some network architectures, which are based entirely on convolutional layers, it is a rare case. Most of the time convolutional
networks only start with convolutional layers, which perform initial features' extraction, and then followed by fully connected layers,
which perform final classification.

As an example, below is the architecture of LeNet-5 convolutional neural network, which was first described by Yann LeCun and
applied to classification of hand-written digits. It takes a 32x32 grayscale image as its input and produces a vector of 10 values –
probabilities of belonging to certain class (digits from 0 to 9). The table below summarizes the architecture of the network,
dimensions of layers’ outputs and number of trainable parameters (weights + biases).

Layer type Trainable parameters Output size

Input image 32x32x1
Convolution layer 1, 6 kernels of 5x5 in size
156 28x28x6
ReLU activation
MAX pooling 1 14x14x6
Convolution layer 2, 16 kernels of 5x5 in size
416 10x10x16
ReLU activation
MAX pooling 2 5x5x16
Convolution layer 3, 120 kernels of 5x5 in size 3120 1x1x120
Fully Connected layer 1, 120 inputs, 84 outputs
10164 84
Sigmoid activation
Fully Connected layer 2, 84 inputs, 10 outputs
850 10
SoftMax activation

With only 14706 trainable parameters, the structure of the above convolutional neural network is very simple. These days there are
much more complicated deep networks being developed, which include many millions of parameters to train.

Training convolutional network

So far we've discussed only the inference part of convolutional neural network, which is calculating its output for a given input.
However, the network needs to be trained first to get something meaningful out of it. When it comes to convolution operator in
image processing, the kernels there are usually handcrafted and serve specific purpose. Some kernels are used to find objects'
edges, some for making pictures sharper or blurry, etc. Very often it is a time-consuming process to design right kernel to perform
the task needed. With convolutional neural networks it is all different, however. When designing such network, we think about
number of layers, number and size of convolutions done, etc. But we don't set those convolution kernels. Instead, the network will
learn those during the training phase, since essentially those kernels are nothing more but weights – same as we have them in fully
connected layers. 6/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
Training of convolutional artificial networks is done using exactly the same algorithms as used for training of fully connected
networks – stochastic gradient descent and backpropagation. As it was demonstrated in the previous article, to calculate partial
derivatives of neural network's error with respect to its weights we can use chain rule. It allows us to define complete equations for
weights' updates of any trainable layer. However, this time we'll concentrate more on the error back propagation side of things and
instead of providing one big equation containing all parts of the chain rule, we'll provide smaller equation's, which are specific to
each building block of neural network – fully connected and convolutional layers, activation functions, cost functions, etc.

If we revisit chain rule from the previous article, we'll notice that every building block of a neural network calculates its error gradient
as partial derivative of its outputs with respect to its inputs and multiples it with error gradient coming from the block following it.
Remember we are moving backward, so calculations start at the last block and flow to previous blocks, i.e. the first block. The last
block on the training phase is always a cost function and so it computes error gradient as derivative of cost (its output) with respect
to neural network's output (input of the cost function). This can be defined the next way:

All other building blocks take the error gradient from the next block and multiply it with partial derivatives of their own outputs with
respect to inputs.

Before describing derivatives of the new building blocks, which we are going to use for convolutional networks, lets revisit
derivatives of the building blocks we've used for fully connected networks, but written in the new notation. First, we start with error
gradient of MSE cost function with respect to outputs of the network (yi – outputs produced by the network, ti – target outputs):

Now, when error gradient passes backward through sigmoid activation function, it gets recalculated this way (oi here is the output of
the sigmoid), which is gradient from the next block (whatever it is – it can be cost function or another layer in multi-layer network)
multiplied by sigmoid's derivative:

Alternative, if hyperbolic tangent is used as activation function, its derivative is used instead:

Now we need to propagate error gradient backward through a fully connected layer. Since every input is connected to every output,
we get a sum of partial derivatives (n is number of neurons in the fully connected layer, j is input's index, i is outpu's/neuron's

Since fully connected layer is a trainable layer, it needs not only to pass error's gradient backward to previous building block/layer,
but also update its weights. Using the above defined naming convention, the update rule for weights and biases can be written as
bellow (classical SGD):

All of the equations above is a quick repetition of the back propagation from the previous article. Why was it important? Well, first to
remind the basics. Second, to rewrite it in a different way, where each building block defines its own error's gradient back
propagation equation, which is independent of the other blocks. The way weights' update equation was given in the previous article
helps to understand the basics and how the chain rule works. But being one single equation makes it not generic at all. What if we
need different cost function instead of MSE? What if we need hyperbolic tangent or ReLU activation instead of sigmoid? The way it
is presented in this article makes it more flexible and allows mixing building blocks of artificial neural networks in various ways and
train them without assumptions on which layer is followed by which activation and which cost function is in use (well, more or less).
Plus, this presentation is more in sync with the actual C++ implementation, where different building blocks are implemented as
separate classes, taking care of their own calculations for the forward pass and backward pass during training. 7/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
Note: If all the above is not clear, however, it is recommended to go through the previous article.

Cross-entropy cost function

One of the most common uses of convolutional neural networks is image classification. Given an image, a network needs to classify
it into one of the mutually exclusive classes. For example, it can be hand written digits classification, where we have 10 possible
classes corresponding to digits from 0 to 9. Or a network can be trained to recognize objects like car, truck, ship, airplane, etc., and
so we'll have as many classes as we have types of objects. The main point in this type of classification is that each input image
must belong to one class only, i.e. we cannot have objects which are classified as both car and airplane.

When dealing with multi class classification problems, the designed artificial neural network has as many outputs, as the number of
classes we have. On the training phase, target outputs are one-hot encoded, i.e. represented with vector of zeros with only one
element set to value '1' at the index corresponding to the class. For example, for a task of 4-class classification, our target outputs
may look something like this: {0, 1, 0, 0} – class 2, {0, 0, 0, 1} – class 4, etc. None of the target outputs are allowed to have multiple
elements set to '1' or another non-zero value. This can be viewed as target probabilities, i.e. the {0, 1, 0, 0} output means that the
presented input belongs to class 2 with 100% probability and to other classes with probability of 0%.

When training, however, the actual neural network's outputs will look different though. It may provide an output something like {0.3,
0.35, 0.25, 0.1}, for example. Such output may have different meaning. For a trained network, it may mean the network was
presented with a tricky example and it is not very clear, but looks more like class 2 – the highest probability of 35%. Or, if we just
started training, it may mean little at all, other than "keep going".

And so, we need a cost function, which would tell us the amount of difference between target and the real output and direct
parameters' update of the neural network. When it comes to probabilistic models over mutually exclusive classes, we deal with
predicted and the ground-truth probabilities. In such cases, the common choice is the cross-entropy cost function, which has its
roots coming from the information theory. As it says, by minimizing cross-entropy, we want to minimize the amount of extra data
(bits), required for encoding some events appearing with probability distribution ti (target or real distribution) using some estimated
probabilities yi (which might be close, but no exactly). And to minimize the cross-entropy, we need to make our estimated
probabilities to be the same as the real probabilities – which is what we are looking for.

The cross-entropy cost function, the value we need to minimize, is defined as below (same as before – ti are target outputs, while yi
is the output provided by neural network):

Getting its derivative, the gradient of the cost function with respect to neural network's output is then calculated as:

Now we have the cross-entropy cost function instead of MSE and so we can move to other building blocks and see how error
gradient propagates backward.

SoftMax activation function

For the last layer's activation function of the neural network used for classification problem we could use the sigmoid function, which
we've already seen in the previous article and quickly repeated above. Its output is in the (0, 1) range and so can be interpreted as
probabilities between 0% and 100%. When neural network is trained with sigmoid in the output layer, it really may provide
probabilities close to the ground truth. However, since we deal with mutually exclusive classes, it may not always make perfect
sense. For example, provided a challenging example, a network may provide an output vector like this: {0.6, 0.55, 0.1, 0.1}. Yes,
looks like class 1 with probability of 60%! But probability of the class 2 is not too far away. And another problem is that if we sum the
four probabilities we've got, we get 1.35, which is 135%.

There are two problems we want to address. First, we definitely want to have sum of probabilities equal to 100%. Not more, not
less. Also, if we get a tricky example, which looks like class 1, but also seems close to class 2, can we really have a high certainty
of 60% that the classification is right?

To resolve the two issues above, we can use a different activation function, which is SoftMax. Same as sigmoid, it provides output
in the (0, 1) range. But unlike sigmoid, it does not operate on single values of the input vector, but on the entire vector, and so
makes sure the sum of the output vector equals to 1. The SoftMax function is defined the next way: 8/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject

If we would use SoftMax function instead of sigmoid for the above example (you can use inverse sigmoid to find the source input
values), the output vector would look different and make more sense – {0.316, 0.3, 0.192, 0.192}. As we can see, the sum of all
values equals to 1, which is 100%. And even though the 1st class seem to win, the probability of it is not that high - only 31.6%.

As for any other activation function, we need to define gradient back propagation equation for the SoftMax function. Here it is:

Going now further backward through the LeNet-5 neural network's architecture, we see fully connected layers and sigmoid
activation function. Equations for both were already defined above. So now it is time to address the other building blocks introduced
in this article.

ReLU activation function

As it was already mentioned above, ReLU activation function became a very popular choice for deeper neural networks, as it allows
much better propagation of error's gradient through the network. It is all due to its constant gradient equal to 1 for input values of
greater than zero. To complete ReLU activation, we also need to define its equation for gradient back propagation.

Pooling layer

Now it is time to propagate error's gradient backward through pooling layer. To make it simple, lets suppose we use 2x2 kernel with
stride 2 and we don't use input padding (we apply pooling to valid locations only). With this in mind, it means every value of the
output feature map is calculate based on 4 values of the input feature map.

Although pooling layers make assumption that input vectors represent 2D data, the math below will work with inputs/outputs as 1D
vectors. To make it all work, we'll define a i2j() function, which for the given index i of input vector returns corresponding index j of
output vector. Since each output is calculated based on 4 input values, it means there are 4 input indexes, for which i2j() will return
the same output index.

Let's start with Max Pooling. To define equation for error's gradient back propagation, we'll need one extra thing. On the forward
pass, when neural network's output is calculated, the pooling layer will also fill in the maxIndexes vector of the same length as
output vector. But, if output vector contains maximum value of the corresponding input values, the maxIndexes vector contains the
index of the maximum value. With all the above, we can define gradient back propagation equation for Max Pooling layer:

As for Average Pooling it is even simpler – the error gradient from the previous block is simply divided by the size of pooling
kernel, which is 4 our case:

Convolutional layer

Finally, it is time to define back propagation pass for convolutional layer. It is not much different from fully connected layer as long
as the fact of shared weights is kept in mind.

Let's then start with weights update of the convolutional layer. With fully connected layers it was simple – partial derivative of error
with respect to weight wi,j equals to error gradient coming from the next block multiplied by corresponding input value – δi(k+1)xj.
The reason for this is that each input/output connection is assigned its own weight in fully connected layer, which is not shared. 9/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
However, it is not the case in convolutional layer. The picture below demonstrates that every weight of convolution kernel is used for
many input/output connections. In the example below, the highlighted kernel's weights are used 9 times each – the kernel is applied
in 9 different positions within the input image. And so, the partial derivative of error with respect to weight will need to have 9 terms
as well – the number of times the weight is used.

Same as with pooling layers, we'll ignore here the fact that convolutional layers deal with 2D/3D data. Instead we'll assume that
inputs/outputs/kernels are plain vectors/arrays for now (this is what they end up in C++ anyway). And so, for the example above,
the 1st kernel's weight (highlighted in red) is applied to inputs {1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14, 15}, while the 4th weight is applied to
inputs {6,7,8,10,11,12,14,15,16}. Suppose that we have such vector of input indexes used by every weight, which we'll name
weightInputsi – input of the ith weight. Also, we'll define a function of two arguments i2o(i,j), which provides index of output value
for the ith weight and jth input. Here are few examples for the picture above, i2o(1,1)=1, i2o(4,6)=1, i2o(1, 11)=9 and i2o(4,16)=9.
With the above naming convention, the weights' update rule for convolutional network can be then defined the next way:

Does the above make sense? Well, the more you think about it, the more it will. All we do is taking error gradients for all the outputs
(since each kernel's weight is used to calculate all outputs) and multiply them by corresponding input. Yes, we have multiple
kernels. But, they are all applied in the same pattern, so even though we'll need to update weights of different kernels, the
weightInputs vectors stay the same. However, the i2o(i,j) is specific to each kernel. Or it can be extended with extra parameter –
kernel index.

Updating bias value is much simpler. Since each kernel/bias is used to calculate every output value, we'll just sum all error
gradients for the feature map produced by that kernel.

Note: both equations above are done per feature map/kernel, i.e. weights and bias value are not parameterized there with kernel

Now it is time to get the final equation for convolutional layer, which is for propagating error gradient backward through the network.
This means calculating partial derivatives of error with respect to inputs of the layer. Each input element can be used multiple times
to produce an output value of a feature map. It can be used as many times as the number of elements in convolution kernel
(number of weights). Some inputs can be used only for one output, though. For example, those are the inputs in corners of the input
2D feature map. But then we also need to keep in mind that every input feature map can be processed multiple times with different
kernels, which generate more output maps. Again, lets pretend it is all flat for now, no 2D/3D indexing. Then, let's assume we have
another set of helper vectors named inputOutputsi, keeping indexes of outputs, which the ith input contributes to. Finally, we'll
need the i2w(i, j) function, which provides index of the weight, which is used to connect ith input with jth output. Here are few
examples again for the above picture: i2w(1, 1)=1, i2w(6,1)=4, i2w(16,9)=4. With all this, we can define equation for propagating
error's gradient backward through convolutional layer.

Now it looks like the math is complete – we have everything we need to calculate both as the forward pass through convolutional
network, as the backward pass. If it still puzzles, confuses or leaves some uncertainty, go through it all again, think about it. Or dive
into the code to see relation between the math and implementation.

The ANNT library 10/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
Implementation of the convolutional artificial neural network in the ANNT library is heavily based on the design set by
implementation of fully connected networks described in the previous article. All the core classes are left as they were, only new
building blocks were implemented, which allow building them into convolutional neural networks. The new class diagram of the
library is shown below – not much of a difference.

Similar to the way it was set before, new building blocks take care of calculating their output on the forward pass and propagating
error gradient on the backward pass (as well as calculating initial weights' updates in the case of trainable layers). As the result, all
the code for neural network training is left unchanged.

And, as in the case with the rest of the code, the new building blocks utilize SIMD instructions wherever possible to vectorize
computations, as well as OpenMP to parallelize them.

Building the code

The code comes with MSVC (2015 version) solution files and GCC make files. Using MSVC solutions is very easy – every
example's solution file includes projects of the example itself and the library. So MSVC option is as easy as opening solution file of
required example and hitting build button. If using GCC, the library needs to be built first and then the required sample application
by running make.

Usage examples 11/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
After the long discussion about the theory and math of convolutional neural networks, it is time to get to practice and actually build
some of the networks for image classification tasks – hand written digits and different objects like cars, trucks, ships, airplanes, etc.
Note: none of these examples claim that the demonstrated neural network's architecture is the best for its task. In fact, none of
these examples even say that artificial neural networks is the way to go. Instead, their only purpose is to provide demonstration of
using the library.

Note: the code snippets below are only small parts of the example applications. To see the complete code of the examples, refer to
the source code package provided with the article (which also includes examples for fully connected neural networks described in
the previous article).

MNIST handwritten digits classification

The first example to have a look at is classification of hand-written digits from the MNIST database. The database contains 60000
examples for neural network training and additional 10000 examples for testing of the trained network. The picture below
demonstrates some of the examples of different digits to classify.

The convolutional neural network used in this example has the structure very similar to the LeNet-5 network mentioned above. The
difference is that we'll use slightly smaller network (well, actually a lot smaller, if we look at the number of weights to train), which
has only one fully connected network. Here is structure of the network we'll use:

Hide Copy Code

Conv(32x32x1, 5x5x6 ) -> ReLU -> AvgPool(2x2)
Conv(14x14x6, 5x5x16 ) -> ReLU -> AvgPool(2x2)
Conv(5x5x16, 5x5x120) -> ReLU
FC(120, 10) -> SoftMax

The configuration above tells the size of input for each convolutional layer and the size and number of convolutions they perform.
And for fully connected layer it tells number of inputs and outputs. Let's create the convolution neural network of the above structure

Hide Shrink Copy Code

// connection table to specify wich feature maps of the first convolution layer
// to use for feature maps produced by the second layer
vector<bool> connectionTable( {
true, true, true, false, false, false,
false, true, true, true, false, false,
false, false, true, true, true, false,
false, false, false, true, true, true,
true, false, false, false, true, true,
true, true, false, false, false, true,
true, true, true, true, false, false,
false, true, true, true, true, false,
false, false, true, true, true, true,
true, false, false, true, true, true,
true, true, false, false, true, true,
true, true, true, false, false, true,
true, true, false, true, true, false,
false, true, true, false, true, true,
true, false, true, true, false, true,
true, true, true, true, true, true
} );

// prepare a convolutional ANN

shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );

net->AddLayer( make_shared<XConvolutionLayer>( 32, 32, 1, 5, 5, 6 ) ); 12/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XAveragePooling>( 28, 28, 6, 2 ) );

net->AddLayer( make_shared<XConvolutionLayer>( 14, 14, 6, 5, 5, 16, connectionTable ) );

net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XAveragePooling>( 10, 10, 16, 2 ) );

net->AddLayer( make_shared<XConvolutionLayer>( 5, 5, 16, 5, 5, 120 ) );

net->AddLayer( make_shared<XReLuActivation>( ) );

net->AddLayer( make_shared<XFullyConnectedLayer>( 120, 10 ) );

net->AddLayer( make_shared<XLogSoftMaxActivation>( ) );

Looking at the code above, it is quite clear how the neural network's configuration stated above is translated into the code. Except
for one question – "What is the connection table we've got between the first and the second convolutional layers?" Yes, it was not
mentioned in the theory part, but is pretty easy to grasp. As we can see from the network's structure and the code, the first layer
does 6 convolutions and so produces 6 feature maps. While the second layer does 16 convolutions. In some cases, it is desired to
configure layer's convolutions in such way, that they operate only on the subset of input feature maps. As the code above suggests,
the first 6 convolutions of the second layer use different patterns of 3 feature maps produced by the first layer. Then the next 9
convolutions use different patterns of 4 feature maps. Finally, the last convolution uses all 6 feature maps of the first layer. This is
done to reduce the number of parameters to train and also make sure that different feature maps of the second layer are not all
based on the same input feature maps.

When the convolutional network is created, we can do the same as we did with fully connected network - create a training context,
specifying cost function and weights' optimizer, and then pass it all to a helper class, which runs training/validation loop and
completes it with testing.

Hide Copy Code

// create training context with Adam optimizer and Negative Log Likelihood cost function (since
we use Log-Softmax)
shared_ptr<XNetworkTraining> netTraining = make_shared<XNetworkTraining>( net,
make_shared<XAdamOptimizer>( 0.002f ),
make_shared<XNegativeLogLikelihoodCost>( ) );

// using the helper for training ANN to do classification

XClassificationTrainingHelper trainingHelper( netTraining, argc, argv );
trainingHelper.SetValidationSamples( validationImages, encodedValidationLabels,
validationLabels );
trainingHelper.SetTestSamples( testImages, encodedTestLabels, testLabels );

// 20 epochs, 50 samples in batch

trainingHelper.RunTraining( 20, 50, trainImages, encodedTrainLabels, trainLabels );

Below is the sample output of the application, which shows training progress and the final result - classification accuracy on the test
data set. We've got 99.01% accuracy, which seems to be a good improvement over fully connected neural network from the
previous article, which demonstrated 96.55% accuracy.

Hide Copy Code

MNIST handwritten digits classification example with Convolution ANN

Loaded 60000 training data samples

Loaded 10000 test data samples

Samples usage: training = 50000, validation = 10000, test = 10000

Learning rate: 0.0020, Epochs: 20, Batch Size: 50

Before training: accuracy = 5.00% (2500/50000), cost = 2.3175, 34.324s

Epoch 1 : [==================================================] 123.060s

Training accuracy = 97.07% (48536/50000), cost = 0.0878, 32.930s
Validation accuracy = 97.49% (9749/10000), cost = 0.0799, 6.825s
Epoch 2 : [==================================================] 145.140s
Training accuracy = 97.87% (48935/50000), cost = 0.0657, 36.821s
Validation accuracy = 97.94% (9794/10000), cost = 0.0669, 5.939s
Epoch 19 : [==================================================] 101.305s
Training accuracy = 99.75% (49877/50000), cost = 0.0077, 26.094s
Validation accuracy = 98.96% (9896/10000), cost = 0.0684, 6.345s
Epoch 20 : [==================================================] 104.519s 13/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
Training accuracy = 99.73% (49865/50000), cost = 0.0107, 28.545s
Validation accuracy = 99.02% (9902/10000), cost = 0.0718, 7.885s

Test accuracy = 99.01% (9901/10000), cost = 0.0542, 5.910s

Total time taken : 3187s (53.12min)

CIFAR10 images classification

The second example performs classification of color 32x32 images from the CIFAR-10 dataset. It contains 60000 images, of which
50000 are used for training and the other 10000 for testing. The images are divided between the next 10 class: airplane,
automobile, bird, cat, deer, dog, frog, horse, ship and truck. Few examples of those can be seen below.

As the above picture suggests, the CIFAR-10 dataset is much more complex than the MNIST hand-written digits. First, the images
are color. And second, they are much less obvious. Up to the point that if I was not told it is a dog, I would not say it myself. As the
result, the network's structure gets a bit bigger. Not that it becomes much deeper, but the number of performed convolutions and
trained weights is growing. Below is the structure of the network:

Hide Copy Code

Conv(32x32x3, 5x5x32, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
Conv(16x16x32, 5x5x32, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
Conv(8x8x32, 5x5x64, BorderMode::Same) -> ReLU -> MaxPool -> BatchNorm
FC(1024, 64) -> ReLU -> BatchNorm
FC(64, 10) -> SoftMax

Translating the above neural network's structure into the code gives the result below. Note: since ReLU(MaxPool) produces same
result as MaxPool(ReLU), we use the first as it reduces ReLU computation by 75% (although very negligible compared to the rest of
the network).

Hide Copy Code

// prepare a convolutional ANN
shared_ptr<XNeuralNetwork> net = make_shared<XNeuralNetwork>( );

net->AddLayer( make_shared<XConvolutionLayer>( 32, 32, 3, 5, 5, 32, BorderMode::Same ) );

net->AddLayer( make_shared<XMaxPooling>( 32, 32, 32, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 16, 16, 32 ) );

net->AddLayer( make_shared<XConvolutionLayer>( 16, 16, 32, 5, 5, 32, BorderMode::Same ) );

net->AddLayer( make_shared<XMaxPooling>( 16, 16, 32, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 8, 8, 32 ) ); 14/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject

net->AddLayer( make_shared<XConvolutionLayer>( 8, 8, 32, 5, 5, 64, BorderMode::Same ) );

net->AddLayer( make_shared<XMaxPooling>( 8, 8, 64, 2 ) );
net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 4, 4, 64 ) );

net->AddLayer( make_shared<XFullyConnectedLayer>( 4 * 4 * 64, 64 ) );

net->AddLayer( make_shared<XReLuActivation>( ) );
net->AddLayer( make_shared<XBatchNormalization>( 64, 1, 1 ) );

net->AddLayer( make_shared<XFullyConnectedLayer>( 64, 10 ) );

net->AddLayer( make_shared<XLogSoftMaxActivation>( ) );

The rest of the example application follows the same pattern as set by the other classification examples - training context is created
with required cost function and weights' optimizer and passed to helper class to run the training loop. Below is the example of its

Hide Copy Code

CIFAR-10 dataset classification example with Convolutional ANN

Loaded 50000 training data samples

Loaded 10000 test data samples

Samples usage: training = 43750, validation = 6250, test = 10000

Learning rate: 0.0010, Epochs: 20, Batch Size: 50

Before training: accuracy = 9.91% (4336/43750), cost = 2.3293, 844.825s

Epoch 1 : [==================================================] 1725.516s

Training accuracy = 48.25% (21110/43750), cost = 1.9622, 543.087s
Validation accuracy = 47.46% (2966/6250), cost = 2.0036, 77.284s
Epoch 2 : [==================================================] 1742.268s
Training accuracy = 54.38% (23793/43750), cost = 1.3972, 568.358s
Validation accuracy = 52.93% (3308/6250), cost = 1.4675, 76.287s
Epoch 19 : [==================================================] 1642.750s
Training accuracy = 90.34% (39522/43750), cost = 0.2750, 599.431s
Validation accuracy = 69.07% (4317/6250), cost = 1.2472, 81.053s
Epoch 20 : [==================================================] 1708.940s
Training accuracy = 91.27% (39931/43750), cost = 0.2484, 578.551s
Validation accuracy = 69.15% (4322/6250), cost = 1.2735, 81.037s

Test accuracy = 68.34% (6834/10000), cost = 1.3218, 122.455s

Total time taken : 48304s (805.07min)

As mentioned above, the CIFAR-10 dataset is definitely more complex. If we managed to get up to 99% test accuracy on MNIST
dataset, here we don't get even close to it – about 91% accuracy on training set and 68-69% on test/validation. Plus, it took 13
hours to run the 20 epochs. Just using CPU is definitely not enough for convolutional networks.

In this article we've covered the new extensions to the ANNT library, which allow building convolutional neural networks. At this
point it allows building only simple networks (more or less), where layers of the network follow each other sequentially. Building
more advanced popular architectures, which look more like a computational graph, is not yet supported so far. However, before
getting there, there are other features need to be implemented first. As the CIFAR-10 example demonstrates, once neural network
gets bigger, it requires more computational power for training. And here, using just CPU is not enough. These days GPU support is
a must, when it comes to deep learning. And so, this feature would get higher priority rather than supporting complex networks.

As fully connected and convolutional neural networks are covered now, the following step will be to go through some common
architectures of recurrent networks, which is the topic for the next article. In the meantime, all the latest code can be found on
GitHub, which will get updates as the library evolves further.

Links 15/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject
1. Kernel (image processing)
2. Image Convolution - Machine Learning Guru
3. Convolutional Neural Networks - Wikipedia
4. CS231n Convolutional Neural Networks for Visual Recognition
5. Convolutional Neural Networks from the ground up
6. Backpropagation In Convolutional Neural Networks
7. Vanishing gradient problem
8. ReLU activation function
9. LeNet-5 convolutional neural network
10. One Hot Encoding
11. Cross-entropy cost function
12. SoftMax activation function
13. Difference between SoftMax and Sigmoid functions
14. MNIST database of handwritten digits
15. CIFAR-10 dataset

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


About the Author

Andrew Kirillov
Software Developer IBM
United Kingdom

Started software development at about 15 years old and it seems like now it lasts most part of my life. Fortunately did not spend
too much time with Z80 and BK0010 and switched to 8086 and further. Similar with programming languages – luckily managed
to get away from BASIC and Pascal to things like Assembler, C, C++ and then C#. Apart from daily programming for food, do it
also for hobby, where mostly enjoy areas like Computer Vision, Robotics and AI. This led to some open source stuff like
AForge.NET and not so open Computer Vision Sandbox.

Going out of computers I am just a man loving his family, enjoying travelling, a bit of books, a bit of movies and a mixture of
everything else. Always wanted to learn playing guitar, but it seems like 6 strings are much harder than few dozens of keyboard’s
keys. Will keep progressing ...

You may also be interested in...

ANNT : Feed forward fully connected neural An Introduction to Support Vector Machine (SVM) 16/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject

networks and the simplified SMO algorithm

Multiple convolution neural networks approach for Neural Network for Recognition of Handwritten
online handwriting recognition Digits

Creating Service Monitor Application with .NET Implementation of Convolutional Neural Network
Core using Python and Keras

Comments and Discussions

You must Sign In to use this message board.

Search Comments

First Prev Next

Exception: Access violation

geoyar 4-Nov-18 13:58

Re: Exception: Access violation

Andrew Kirillov 5-Nov-18 0:08

Re: Exception: Access violation

geoyar 5-Nov-18 9:43

Re: Exception: Access violation

Andrew Kirillov 5-Nov-18 22:16

Re: Exception: Access violation

geoyar 6-Nov-18 10:51

Re: Exception: Access violation

Andrew Kirillov 6-Nov-18 13:00

Re: Exception: Access violation

geoyar 6-Nov-18 15:33

Impressive !
Neil Tsakatsa 4-Nov-18 8:13

Re: Impressive !
Andrew Kirillov 4-Nov-18 23:30

Re: Impressive !
Neil Tsakatsa 5-Nov-18 2:07

Re: Impressive ! 17/18
26/11/2018 ANNT : Convolutional neural networks - CodeProject

Andrew Kirillov 5-Nov-18 2:35

Re: Impressive !
Neil Tsakatsa 5-Nov-18 6:16

geoyar 3-Nov-18 13:11

Re: LeNet-5
Andrew Kirillov 5-Nov-18 0:05

ANNT.lib is not Win32 app

geoyar 30-Oct-18 12:28

Re: ANNT.lib is not Win32 app

Andrew Kirillov 31-Oct-18 0:34

MSVC and std::bind1st

geoyar 29-Oct-18 16:38

Re: MSVC and std::bind1st

Andrew Kirillov 29-Oct-18 23:39

Re: MSVC and std::bind1st

geoyar 30-Oct-18 12:41

Re: MSVC and std::bind1st

Andrew Kirillov 31-Oct-18 0:39

Re: MSVC and std::bind1st

geoyar 6-Nov-18 11:21

Re: MSVC and std::bind1st

Andrew Kirillov 6-Nov-18 13:03

Re: MSVC and std::bind1st

Andrew Kirillov 6-Nov-18 13:27

Very nice. Some thoughts on performance improvement

gstolarov 29-Oct-18 8:35

Re: Very nice. Some thoughts on performance improvement

Andrew Kirillov 29-Oct-18 23:49

Refresh 1 2 Next »

General News Suggestion Question Bug Answer Joke Praise Rant Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile

Selecione o idioma ▼
Web06 | 2.8.181124.1 | Last Updated 28 Oct 2018
Layout: fixed | Article Copyright 2018 by Andrew Kirillov
fluid Everything else Copyright © CodeProject, 1999-2018 18/18