Anda di halaman 1dari 3

Cluster-Based Feature Extraction and Data Fusion in The Wavelet


Johannes R. Sveinsson, Magnus Orn Ulfarsson and Jon Atli Benediktsson

Department of Electrical and Computer Engineering, University of Iceland,

Hjardarhagi 2-6, Reykjavik, 1S-107, Iceland
E-mail:,, and benedikt(

ABSTRACT not lead to a significant decrease in overall classification

accuracy as compared to the one obtained in the origi-
This paper will concentrate on linear feature extraction
nal feature space. In this paper linear feature extraction
methods for neural network classifiers. The considered
method, based on cluster-based feature extraction of the
feature extraction method is based on discrete wavelet
wavelet coefficients for neural networks classifiers are dis-
transformations (.DWTS) and cluster-based procedure, i.e.,
cussed and applied in classification of multisource remote
cluster-based feature extraction of the wavelet coefficients
sensing and geographic data. The method is an extension
of remote sensing and geographic data is considered. The
to a method proposed by Pittner and Kamarthi [3].
cluster-based feature extraction is a preprocessing routine
that computes feature-vectors to group the wavelet coeff-
icients in an unsupervised way. These feature-vectors are 2. FEATURE EXTRACTION
then used as a mask or a filter for the selection of rep-
resentative wavelet coefficients that are used to train the Here we concentrate on linear feature extraction methods
neural network classifiers. In experiments, the proposed for neural networks and then leave the neural networks
feature extraction methods performed well in neural net- with the classification task.
works classifications of multisource remote sensing and
geographic data. 2.1, Wavelets

The discrete wavelet transform (DWT) [4] provides a trans-

1. INTRODUCTION formation of a signal from the time domain to the scale-
frequency domain. The DWT is computed on several with
The selection of variables is a key problem in pattern different time/scale-frequency resolutions. As each level
recognition and is termed feature selection or feature ex- of the transformation is calculated, there is a decrease in
traction [1]. However, few feature extraction algorithms temporal resolution and a corresponding increase in scale-
are available for neural networks [2]. Feature extraction frequency resolution. The full DWT for a time domain sig-
can, thus, be used to transform the input data and in nal in ,C2(finite energy), z(t), can be represented in terms
some way find the best input representation for neural net- of a shifted version of a scaling function @(t) and shifted
works. For high-dimensional data, large neural networks and dilated version of a so-called mother wavelet func-
(with many inputs and a large number of hidden neurons) tion @(t). The connection between the scalar and wavelet
are often used. The training time of a large neural net- functions with the scalar function at different scale are
work can be very long. Also, for high-dimensional data given by the two scale equations
the curse of dimensionality or the Hughes phenomenon
[1] may occur, Hence, it is necessary to reduce the input ~(t) = ~ h(k)@(2t - k) and ~(t) = ~ g(k)@(2t - k)
dimensionality for the neural network in order to obtain ke.z? kE2
a smaller network which performs well both in terms of (1)
training and test classification accuracies, This leads to where h(k) and g(k) are the finite low-pass and high-pass
the importance of feature extraction for neural networks, impulse responses for the DWT, respectively, The repre-
that is, to find the best representation of input data in sentation of the DWT can be written as
lower dimensional space where the representation does
This work was supported in part by the Research Fund of the
University of Icelancl and the Icelandic Research Council.

IEEE 867

0-7803-7033-3/01/$10.00 (C) 2001 IEEE

where wl,~ are the wavelet coefficients and Ul,k, j < jo
are the scaling coefficients. These coefficients are given
Table 1: Training and Test Samples for Information
by the inner product in .C2, i.e.
Classes in the Experiment on the Anderson River data.
Uj,k =< z(t), +J,k(t)> and Uj,~ =< z(~),+j,~(f)>. (3)
class # In form.t,on Class Tra, nmg S,.. Test S,.,
Here +J,k(t) = 2~/2#(2~t k) is a family of scalar func- 1 Douglas F,, (31-40rn) 971 1250
2 Dou@as F,, (21-30rn) 551 817
tions and +J,~(t) = 2~f2@(2~t k) family of wavelet 3 Douglas F,r + Other SpecvAs (31-40m) 548 701
3 Douglas F,, + LodEepole P,ne 542 705
function and with a right choices of these mother functions 5 Helmlock + Cedar (31-40m) 517 405
H 6 I Forest Clearings ~ 1260 ~
the family of them form an orthogonal basis for the signal Tot al .. .. K,.. u

space. The wavelet coefficients, wj~~,are then measure of

the signal content around time 23k and scaleffrequency
2-~ $0 and the scaling coefficients, uj,~ represent the lo-
cal mean around the time 2Jk. The DWT can be imple- Then the binary matrix Gb is obtained as
mented by a tree filter banks. Each stage oft he tree struc-
Gb = [~(gj,~ T)] (7)
ture consists then of low-pass, h(t), and high-pass filters,
g(t) each followed by a down-sampling of 2. Every time where @ is the step function @(x) = lforz>O and
down-sampling is performed, the signal length is reduced @(z) = O for x <0 and T is some threshold to be chosen.
by 2. The scale propagation is obtained by the output The 1s in the matrix Gb occur where the largest wavelet
from the low-pass branch goes through the same process coefficients occur in the matrix Bk and we can then use
of filtering and down-sampling. Thus the DWT has a nat- the row vectors of Gb as a mask or a filter for the se-
ural interpretatio:min terms of tree structure (filter banks) lection of representative wavelet coefficients for the data
in the time-scale/frequency domain. The wavelet coeffi- to be classified. This can be done by only select wavelet
cients wj,~ are the output of each of the high-pass branch coefficients that correspond to the 1s in the row vectors
of the tree structure and the scalar coefficients, U30,~, is of Gb as the inputs for the neural networks. The choice
the output of the low-pass branch of the last stage of the of the threshold T is used for the reduction of the 1s in
DWT. Gb and hence the reduction of wavelet coefficients which
are used as inputs for the neural networks, i.e. feature
2.2. Cluster-Based Feature Extraction extraction,
In this section preprocessing routine that computes feature-
vectors to group the wavelet coefficients that are going to 3. EXAMPLE AND DISCUSSION
be used to train neural network classifiers. This method
The data set used in experiments, the Anderson River
is an extension of a feature extraction method proposed
data set, is a multisource remote sensing and geographic
in [3].
data set made available by the Canada Centre for Re-
Assume that we have 1 representative signals, xi, of
mote Sensing (CCRS) [2,5]. Six data sources were used:
length 1 = 2n. The DWT is computed for all the 1 rep-
Airborne Multispectral Scanner System (11 spectral data
resentative signals, Zt. The wavelet coefficients are rep-
channels), Steep Mode Synthetic Aperture Radar (SAR)
resented by wj,~ and U30,~where j = 1, 2, . . . . j., respec-
(4 data channels), Shallow Mode SAR (4 data channels),
tively. Next the DWT coefficients are arranged into a
Elevation data (1 data channel), Slope data (1 data chan-
mat rix B = [bj,k] in the following form:
nel), and Aspect data (1 data channel), The six infor-
lwj,~l forj=l,.,,, joandl~k~~ (4) mation were used, as listed in Table 1. Here, training
bj,k = ~
{ forj = 1, ...,~oand~<k~$ samples were selected uniformly, giving 10% of the total
sample size. Test samples were then selected randomly
luj,,~l fork= 1
bjo+l,k =: ~ for $ ~~; + (5) from the rest of the labeled data.
{ 2 The number of features for the data was 22. A con-
Let p(A) and 0(.A) represent the sample mean and the jugate gradient perception (CGP) neural network with
standard deviation of the elements of a matrix A. R is one hidden layer was trained on the original data with a
the operator that when applied to any matrix A, reduces different number of features. The classification results are
the matrix by its last row. Next calculate the matrix G listed in Table 2. Since the original data had 22 features it
as was necessary, in order to apply the wavelet based feature
extraction method, to add zeros to the data so that the
G = [gj,/c]=
F:=lBm number of features became 32. The training set is trans-
formed with DWT down to level j.. One masking vector
-p (R (~:=1 Bin)) I). (6) is found for the training set and used to find the wavelet

0-7803-7031-7/01/$17.00 (C) 2001 IEEE 868

0-7803-7033-3/01/$10.00 (C) 2001 IEEE

Table 2: Classification Accuracies for The Original Data,
Feature extraction method for neural network classifiers
are proposed and applied in classification of multisource

remote sensing and geographic data. The method is based

: on cluster-based feature extraction in the wavelet domain.
The cluster-based method is an unsupervised preprocess-
ing routine that computes feature-vectors to group the
wavelet coefficients. Then, these feature-vectors are used
as a mask or a filter for the selection of representative
Table 3: C Iassification Accuracies for DWT. wavelet coefficients, i.e., representative feature, that are
used to train the neural network classifiers. For the appli-
cation of this approach, the data had to be zero-padded.
The method showed a great promise in efficiently extract-
ing important features for the multisource remote sensing
and geographic data, but it should be even more appro-
priate for data which have a number of features that is a
power of 2, i.e., so the data need not to be zero padded.
Also, the choice of the threshold T and hence the reduc-
coefficients as inputs to the neural networks. Then feature
tion of wavelet coefficients that are used as inputs for the
extraction is done by changing the threshold T and thus
neural networks after feature extraction, should be such
obtaining row vectors with different numbers of 1s. These
that the parameters of the classifier and the feature ex-
row vectors are then used as masks to find the wavelet co-
tractor are jointly optimized. That is a topic of future
efficients. The masking vectors for different number of
features are also used for the test set. For the DWT, a
sixt-tab Daubechi es wavelet filter ~G was used [4]. A con-
jugate gradient perception (CGP) neural network with Acknowledgments
one hidden layer was trained on the DWT transformed
This work was supported in part by the Research Fund of
input data with a,different number of features, The clas-
the University of Iceland. The Anderson River SAR/MSS
sification results for the DWT are listed in Table 3, When
data set was acquired, preprocessed, and loaned by the
the results in Table 2 are compared to the results in Ta-
Canada Centre for Remote Sensing, Department of En-
ble 3, it can be seen that higher overall accuracies were
ergy Mines, and Resources, of the Government of Canada.
obtained when 32 DWT features were used instead of the
22 original features. The differences are significant, i.e.,
more than 6 per cent for training data and more than 3 5. REFERENCES
per cent for test data. These results are very interesting
[1] K. Fukunaga, Introduction to Statistical Pattern
since they show the discrimination capability of the pro-
Recognition, 2nd edition, Academic Press, NY, 1992.
posed approach. The wavelet transformation by itself is
a linear transformation and should not improve classifi- [2] J.A. Benediktsson and J.R. Sveinsson, Feature ex-
cation accuracies. However, the cluster based approach traction for multisource data classification with arti-
introduces a technique which improves the accuracies. ficial neural networks, Int. J. Remote Sensing, vol.
From Table 3 it can be seen that the classification accu- 18, no. 4, pp. 727-740, 1997.
racies decreased when only 10 features are used instead of
the full 32 features. The training accuracies decrease by [3] S. Pittner and S.V. Kamarthi, Feature Extraction
almost 10 percent and the test accuracies by nearly 6 per From Wavelet Coefficients for Pattern Recognition
cent. On the other hand, the accuracies for 10 features Tasks, IEEE Trans. of PAMI, vol. 21, pp. 83-88,
are far higher than the received results when 10 features 1999.
are used for the original data in Table 2. The proposed [4] I. Daubechies, Ten Lecture on Wavelets, SIAM, USA,
approach gives more than a 10 per cent improvement in 1992.
overall training accuracy and nearly 9 per cent improve-
ment in overall test accuracies when compared to the 10 [5] G.M. Goodenough, Goldberg, M.P. Plunkett, and J.
feature results in Table 2. Similar behavior is seen when 5 Zelek, The CCRS SAR/MSS Anderson River Data
and 8 features am used. The accuracies for the proposed Set, IEEE Transactions on Geoscience and Remote
approach are significantly higher than the accuracies ob- Sensing, vol. GE-25, pp. 360-367, 1987.
tained when original data are used.

0-7803-7031-7/01/$17.00 (C) 2001 IEEE 869

0-7803-7033-3/01/$10.00 (C) 2001 IEEE