ISSN: 2455-5703
Abstract
Remote Sensing (RS) image classification is one of the key research areas in the image processing field. The main important part
of this classification is the efficient extraction of features from the RS image. The feature extraction process is also a complex
process. In earlier days, there are some kind of features extracted like spectral features. But, while considering the spatial domain
of the RS image, it contains more information than the spectral features. So, spectral features dominated the classification area for
few years. Many researches were conducted to still improve the classification accuracy. Thus, it resulted in the extraction of
features using the different neural networks, which proved to increase the accuracy. This paper surveys and discuss the different
works at different duration carried out by researchers to extract the features using neural networks. Also, this survey provides a
marginal overview for the future research and improvements.
Keywords- Remote Sensing, Feature Extraction, Neural Networks, Spatial Feature, Spectral Feature
I. INTRODUCTION
In the recent years, the classification of remote sensing images is found to be a very attractive field for the researchers. The RS
image contains lots of information in every single pixel. So, using the remote sensing image the land use mapping and land cover
mapping is done. The land cover area is the earth cover which consist of forest, water, bare land, saline land, mountain range etc.
The land use is the land cover area converted into a built environment such as residential buildings, commercial buildings, transport
and agricultural land. To better understand the land cover/land use mapping let us consider the remotely sensed image of a
geographical location. The land cover/ land use has to be identified and classified.
Land-cover and land-use information are required for many different kinds of spatial planning, from urban planning at a
local level up to regional development. They play an important role in agricultural policy making. For proper management of
natural resources, the land-cover data is important. They are increasingly needed for the assessment of impacts of economic
development on the environment. Hence, at various geographical levels they are fundamental for guiding decision making. The
Earth’s surface is changing at different levels namely local, regional, national and global scales.
Land management and land planning needs the current status of the landscape. Understanding current land cover status,
it’s uses, and monitoring the timely changes is responsible for land management. Also, the reason for the changes in the land
condition can be found easily through land cover mapping. Keeping in mind these applications, the classification of RS images
has to be done efficiently. The neural networks are employed for obtaining the features from the RS images which in turn is used
for classifying each and every pixel of the image.
The remaining chapters are organised as follows: The chapter II discusses about some of the neural networks used for
extracting the features from the RS image. The chapter III discusses about the neural networks which are trained layer wise. Chapter
IV provides the conclusion of the study.
1) Training
The training process of a DBN can be divided into two stages: the pre-training stage and the fine-tuning stage. In the pre-training
stage, an unsupervised learning-based training is carried out in the down-up direction for feature extraction, while in the fine-
tuning stage, a supervised learning based up-down algorithm is used. The improved performance of the DBNs can be largely
attributed to the pre-training stage in which the initial weights of the network are learned from the structure of the input data.
Compared with the randomly initialized ones, these weights are closer to the global optima and can therefore bring better
performance.
1) Training
The hidden units act as latent variables (features) that allow the Boltzmann machine to model distributions over visible state vectors
that cannot be modelled by direct pairwise interactions between the visible units. The learning rule of DBM remains unchanged
even with hidden units. So, it’s possible to learn binary features to obtain higher-order structure in the data.
C. Stacked Autoencoders
The Stacked Autoencoders (SAE) [6] is stacking autoencoders into hidden layers by an unsupervised layer-wise learning algorithm
and then fine-tuned by a supervised method. The working of SAE is of three steps. First, train the first autoencoder by input data
and obtain the learned feature vector. Second, the feature vector of the former layer is used as the input for the next layer, and this
procedure is repeated until the training completes. Third, after all the hidden layers are trained, backpropagation algorithm (BP) is
used to minimize the cost function and update the weights with labelled training set to achieve fine-tuning.
1) Training
The stacked autoencoders use greedy layer-wise training to obtain parameter. To do this, first train the first layer on raw input to
obtain parameters for first layer weight and bias W(1,1),W(1,2),b(1,1),b(1,2). This first layer achieves vector consisting of
activation of the hidden units, A by transforming the raw input. Train the second layer on this vector to obtain parameters W(2,1),
W(2,2),b(2,1),b(2,2) for second layer weight and bias. The same procedure is repeated for the remaining layers, using the output
of each layer as input for the subsequent layer. By doing so, the parameters of each layer are individually trained. After which the
fine-tuning is done using back propagation.
Stacked Denoising Autoencoders (SDA) [8] is a denoised autoencoders. It is an autoencoder which has multiple layers except that
it's training is not same as a multi layered NN. It is unsupervised pre-training done layer by layer, as input is fed through. The
input may contain noise. The input is passed through the hidden layer. Output is generated and loss is calculated between the output
and the original input. The process continues until the loss is minimized. Then finally the full data is passed through the network
and the data present in the hidden layer is collected. This is the new input. Now, the collected input is taken and noise is passed to
it and the same procedure is followed thereafter. After the process is done with the last layer, the data collected in the last hidden
layer is now the new data.
1) Training
The network is trained to obtain input from a corrupted version of it. After the completion of pre-training to conduct feature
selection and extraction on the input from the preceding layer, a second stage of supervised fine-tuning can follow. Once the first
k layers are trained, the k+1-th layer can be trained because now it is possible to compute the code or latent representation from
the layer below. Then train the entire network as like training a multilayer perceptron. At this point, only consider the encoding
parts of each auto-encoder. This stage is supervised.
1) Training
The Gibbs sampler is used to train RBM. Randomly start with any one layer and perform Gibbs sampling to generate data from an
RBM. Once the states of the units in one layer are given, all the units in the other layers will be updated. This update process will
carry on until the equilibrium distribution is reached. Next, the weights within an RBM are obtained by maximizing the likelihood
of this RBM.
B. Autoencoder
Autoencoder (AE) is a simple 3-layer, unsupervised Machine learning algorithm neural network where output units are directly
connected back to input units [1]. Here, the number of hidden units is much less than the number of visible ones. It applies
backpropagation, setting the target values to be equal to the inputs. AE is trained to copy its input to its output. The hidden layer
is used to represent the input. AE is a one-hidden-layer feed-forward neural network similar to the MLP. The difference between
an MLP and an AE is that the aim of the AE is to reconstruct the input, while the purpose of the MLP is to predict the target values
with certain inputs. The numbers of nodes in the input layer and the output layer are identical. In the coding process, the AE first
converts the input vector x into a hidden representation h using a weight matrix ω, then in the decoding process, the AE maps h to
the original input vector to obtain x˜ with another weight matrix ω′. Theoretically, ω′ should be the transpose of ω. Parameter
optimization is adopted to minimize the average reconstruction error. Mean square errors (MSEs) are used to measure the accuracy
of reconstruction.
1) Training
The training process for an AE can also be divided into two stages: the first stage is to learn features using unsupervised learning
and the second is to fine-tune the network using supervised learning. To be specific, in the first stage, feed-forward propagation is
first performed for each input to obtain the output value x˜. Then squared errors are used to measure the deviation of x˜ from the
input value. Finally, the error will be backpropagated through the network to update the weights. In the fine-tuning stage, with the
network having suitable features at each layer, the standard supervised learning method is adopted and the gradient descent
algorithm is used to adjust the parameters at each layer.
There are three main types of layers to build CNN architectures: (1) the convolutional layer, (2) the pooling layer, and (3)
the fully-connected layer. The fully-connected layer is like the regular neural networks. And the convolutional layer is to perform
convolution many times. The pooling layer can be though as downsampling by the maximum of each 2 x 2 block of the previous
layer.
reduced to very small but it will not lower the performance. To extract more features, connect the same block to another neuron.
The depth in the layers is how many times we connect the same area to different neuron.
The stride means the shifting distance of the window. Let us consider an example, if we use stride 1 and window size 3 x
3 in 7 x 7 x 3 image without zero-pad, there are 5 x 5 x depth neurons in the next layer. If we change the stride 1 to stride 2 and
others remain the same, there are 3 x 3 x depth neurons in the next layer. So, if we use stride s, window size w x w in w x h image,
there are[(W-w)/s+1] x [(H-w)/s+1] x depth neurons in the next layer.
E. Parameters Sharing
For example, there are 32 x 32 x 5 neurons in the next layer with stride 1, window size 5 x 5 and with zero-pad, and the depth is 5.
Each neuron has 5 x 5 x 3 = 75 parameters (or weights). So, there are 75 x 32 x 32 x 5 = 384000 parameters in the next layer. The
idea is to share the parameters in each depth. That is 32 x 32 neurons in each depth use the same parameters. So, there are only 5
x 5 x 3 = 75 parameters in each depth and 75 x 5 = 375 parameters in total. It greatly decreases the number of parameters. By doing
so, the neurons in each depth in the next layer is just like applying convolution to the image.
F. Activation Function
In the traditional neuron model, we often use the sigmoid function for the activation function. Other choices for the activation
function are also available. One of them is Rectified Linear Units (ReLUs). The function is f(x) = max (0, x). Krizhevsky et al. [9]
compared the performances of using the ReLUs function and the sigmoid function as the activation function in CNNs. The ReLU
model needs less iteration time with the same training error rate.
G. Pooling Layer
Although locally connected networks and parameter sharing are used, there are still many parameters in the neural networks.
Compared with a relatively small dataset, it might cause overfitting. So, often the pooling layers is inserted to the networks. It can
progressively reduce the number of parameters and hence the computation time in the networks. It operates on depth of every
previous layer. It means that the depth of the next layer is the same as that of the previous layer. Also, the number of pixels can be
set when we move the window, or stride, as the convolutional layer. The pooling is explained with the Figure 2.
Note that there are two type of pooling layers. If the window size equal to stride, it is traditional pooling. If the window
size is larger than the stride, then it is overlapping pooling. In practice, the window size 2 2 and the stride size 2 is used in the
traditional pooling and use the window size 3 3 and the stride size 2 in the overlapping pooling. In additional to max pooling,
other functions can also be used. For example, to calculate the average of the window represent the value of the next layer, which
is called average pooling, and use L2-norm, which is called L2-norm pooling.
IV. CONCLUSION
The paper thus discusses different neural networks used previously for the extraction of features from the remote sensing image.
The paper also discusses how the individual neurons are trained and how the layer wise training happens for the different neural
networks. Every network has their own advantages and disadvantages. After a long and tedious effort by various researchers it is
found that the CNN works better with the extraction of features. But, still even the CNN have some practical disadvantages. So,
these issues have to be handled and overcome in future.
REFERENCES
[1] G. E. Hinton and R. S. Zemel, “Autoencoders, minimum description length, and helmholtz free energy,” Advances in neural information processing systems,
pp. 3, 1994.
[2] G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006.
[3] G. E. Hinton, S. Osindero, and Y. W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[4] Mohamed,A.,Dahl, G.,Hinton, G. “Deep belief networks for phone recognition,” Proc.NIPS Workshop, Dec. 2009.
[5] Salakhutdinov R and Hinton G. E., “Deep Boltzmann machines”, AISTATS’2009, pp. 448-455, 2009.
[6] Yu Qi, Yueming Wang, Xiaoxiang Zheng, Zhaohui Wu, “Robust feature learning by stacked autoencoder with maximum correntropy criterion ”, IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 11, pp. 3371-3408, 2010.
[7] Hinton G. E., “A practical guide to training restricted Boltzmann machines,” Technical Report UTML TR2010-003, Department of Computer Science,
University of Toronto, 2010.
[8] Vincent P., Larochelle H., Lajoie I., Bengio Y., and Manzagol P., “Stacked denoising autoencoders”, J. Machine Learning Res., vol. 11, pp. 3371-3408, 2010.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks”, in Proc. Adv. Neural Inf.Process.Syst.
Conf., 2012.