Anda di halaman 1dari 13

Applied Soft Computing 13 (2013) 654666

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Meta-cognitive RBF Network and its Projection Based Learning algorithm


for classication problems
G. Sateesh Babu, S. Suresh
School of Computer Engineering, Nanyang Technological University, Singapore

a r t i c l e

i n f o

Article history:
Received 2 February 2012
Received in revised form 24 May 2012
Accepted 31 August 2012
Available online 23 September 2012
Keywords:
Meta-cognitive learning
Self-regulatory thresholds
Radial basis function network
Multi-category classication
Projection Based Learning

a b s t r a c t
Meta-cognitive Radial Basis Function Network (McRBFN) and its Projection Based Learning (PBL) algorithm for classication problems in sequential framework is proposed in this paper and is referred to as
PBL-McRBFN. McRBFN is inspired by human meta-cognitive learning principles. McRBFN has two components, namely the cognitive component and the meta-cognitive component. The cognitive component
is a single hidden layer radial basis function network with evolving architecture. In the cognitive component, the PBL algorithm computes the optimal output weights with least computational effort by nding
analytical minima of the nonlinear energy function. The meta-cognitive component controls the learning
process in the cognitive component by choosing the best learning strategy for the current sample and
adapts the learning strategies by implementing self-regulation. In addition, sample overlapping conditions are considered for proper initialization of new hidden neurons, thus minimizes the misclassication.
The interaction of cognitive component and meta-cognitive component address the what-to-learn, whento-learn and how-to-learn human learning principles efciently. The performance of the PBL-McRBFN is
evaluated using a set of benchmark classication problems from UCI machine learning repository and two
practical problems, viz., the acoustic emission signal classication and the mammogram for cancer classication. The statistical performance evaluation on these problems has proven the superior performance
of PBL-McRBFN classier over results reported in the literature.
2012 Elsevier B.V. All rights reserved.

1. Introduction
Neural networks are powerful tools that can be used to approximate the complex nonlinear inputoutput relationships efciently.
Hence, from the last few decades neural networks are extensively
employed to solve real world classication problems [1]. In a classication problem, the objective is to learn the decision surface that
accurately maps an input feature space to an output space of class
labels. Several learning algorithms for different neural network
architectures have been used in various problems in science, business, industry and medicine, including the handwritten character
recognition [2], speech recognition [3], biomedical medical diagnosis [4], prediction of bankruptcy [5], text categorization [6] and
information retrieval [7]. Among various architectures reported in
the literature, Radial Basis Function (RBF) network gaining attention due to its localization property of Gaussian function, and
widely used in classication problems. Signicant contributions
to RBF learning algorithms for classication problems are broadly
classied into two categories: (a) Batch learning algorithms: Gradient descent based learning was used to determine the network

Corresponding author. Tel.: +65 6790 6185.


E-mail address: ssundaram@ntu.edu.sg (S. Suresh).
1568-4946/$ see front matter 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.asoc.2012.08.047

parameters [8]. Here, the complete training data are presented multiple times, until the training error is minimum. Alternatively, one
can implement random input parameter selection with least square
solution for the output weight [9,10]. In both cases, the number
of Gaussian functions required to approximate the true function is determined heuristically. (b) Sequential learning algorithms:
The number of Gaussian neurons required to approximate the
inputoutput relationship is determined automatically [1115].
Here, the training samples are presented one-by-one and discarded
after learning. Resource Allocation Network (RAN) [11] was the
rst sequential learning algorithm introduced in the literature. RAN
evolves the network architecture required to approximate the true
function using novelty based neuron growth criterion. Minimal
Resource Allocation Network (MRAN) [12] uses a similar approach,
but it incorporates error based neuron growing/pruning criterion.
Hence, MRAN determines compact network architecture than RAN
algorithm. Growing and Pruning Radial Basis Function Network
[13] selects growing/pruning criteria of the network based on the
signicance of a neuron. A sequential learning algorithm using
recursive least squares presented in [14], referred as an On-line
Sequential Extreme Learning Machine (OS-ELM). OS-ELM chooses
input weights randomly with xed number of hidden neurons and
analytically determines the output weights using minimum norm
least-squares. In case of sparse and imbalance data sets, the random

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

(a)

(b)

Metacognition

Monitoring

Control

Cognition

655

Metacognitive Component

Predicted
Output

Best learning
Strategy

Cognitive Component
(RBF Neural Network)

Fig. 1. (a) Nelson and Narens Model of meta-cognition and (b) McRBFN Model.

selection of input weights with xed number of hidden neurons


in the OS-ELM affects the performance signicantly as shown in
[16]. In neural-fuzzy framework, Evolving Fuzzy Neural Networks
(EFuNNs) [17] is the novel sequential learning algorithm. It has been
shown in [15] that the aforementioned algorithms works well for
the function approximation problems than the classication problems. A Sequential Multi-Category Radial Basis Function network
(SMC-RBF) [15] considers the similarity measure within class, misclassication rate and prediction error are used in neuron growing
and parameter update criterion. SMC-RBF has been shown that
updating the nearest neuron parameters in the same class as that
of the current sample helps in improving the performance than
updating a nearest neuron in any class.
Aforementioned neural network algorithms use all the samples
in the training data set to gain knowledge about the information
contained in the samples. In other words, they possess informationprocessing abilities of humans, including perception, learning,
remembering, judging, and problem-solving, and these abilities are
cognitive in nature. However, recent studies on human learning
has revealed that the learning process is effective when the learners adopt self-regulation in learning process using meta-cognition
[18,19]. Meta-cognition means cognition about cognition. In a metacognitive framework, human-beings think about their cognitive
processes, develop new strategies to improve their cognitive skills
and evaluate the information contained in their memory. If a radial
basis function network analyzes its cognitive process and chooses
suitable learning strategies adaptively to improve its cognitive process then it is referred to as Meta-Cognitive Radial Basis Function
Network (McRBFN). Such a McRBFN must be capable of deciding
what-to-learn, when-to-learn and how-to-learn the decision function from the stream of training data by emulating the human
self-regulated learning.
Self-adaptive Resource Allocation Network (SRAN) [20] and
Complex-valued Self-regulating Resource Allocation Network
(CSRAN) [21] address the what-to-learn component of metacognition by selecting signicant samples using misclassication
error and hinge loss error. It has been shown that the selecting
appropriate samples for learning and removing repetitive samples
helps in improving the generalization performance. Therefore, it is
evident that emulating the three components of human learning
with suitable learning strategies would improve the generalization
ability of a neural network. The drawbacks in these algorithms are:
(a) the samples for training are selected based on simple error criterion which is not sufcient to address the signicance of samples;
(b) the new hidden neuron center is allocated independently which
may overlap with already existed neuron centers leading to misclassication; (c) knowledge gained from past samples is not used;
and (d) uses computationally intensive extended Kalman lter for
parameter update. Meta-cognitive Neural Network (McNN) [22]
and Meta-cognitive Neuro-Fuzzy Inference System (McFIS) [23]
address the rst two issues efciently by using three components
of meta-cognition. However, McNN and McFIS use computationally intensive parameter update and does not utilize the past

knowledge stored in the network. Similar works using metacognition in complex domain are reported in [24,25]. Recently
proposed Projection Based Learning in meta-cognitive radial basis
function network [26] addresses the above issues in batch mode
except proper utilization of the past knowledge stored in the network and applied to solve biomedical problems in [2729]. In this
paper, we propose a meta-cognitive radial basis function network
and its fast and efcient projection based sequential learning algorithm.
There are several meta-cognition models available in human
physiology and a brief survey of various meta-cognition models are
reported in [30]. Among the various models, the model proposed
by Nelson and Narens in [31] is simple and clearly highlights the
various actions in human meta-cognition as shown in Fig. 1(a). The
model is analogous to the meta-cognition in human-beings and has
two components, the cognitive component and the meta-cognitive
component. The information ow from the cognitive component
to meta-cognitive component is considered monitoring, while the
information ow in the reverse direction is considered control.
The information owing from the meta-cognitive component to
the cognitive component either changes the state of the cognitive
component or changes the cognitive component itself. Monitoring
informs the meta-cognitive component about the state of cognitive
component, thus continuously updating the meta-cognitive components model of cognitive component, including, no change in
state.
McRBFN is developed based on the Nelson and Narens metacognition model [31] as shown in Fig. 1(b). Analogous to the Nelson
and Narens meta-cognition model [31], McRBFN has two components namely the cognitive component and the meta-cognitive
component as shown in Fig. 1(b). The cognitive component is a
single hidden layer radial basis function network with evolving
architecture. The cognitive component learns from the training
data by adding new hidden neurons and updating the output
weights of hidden neurons to approximate the true function. The
input weights of hidden neurons (center and width) are determined based on the training data and output weights of hidden
neurons are estimated using the projection based sequential learning algorithm. When a neuron is added to the cognitive component,
the input/hidden layer parameters are xed based on the input of
the sample and the output weights are estimated by minimizing
an energy function given by the hinge loss error as in [32]. The
problem of nding optimal weights is rst formulated as a linear programming problem using the principles of minimization
and real calculus [33,34]. The Projection Based Learning (PBL)
algorithm then converts the linear programming problem into a
system of linear equations and provides a solution for the optimal
weights, corresponding to the minimum energy point of the energy
function. The meta-cognitive component of McRBFN contains a
dynamic model of the cognitive component, knowledge measures
and self-regulated thresholds. Meta-cognitive component controls
the learning process of the cognitive component by choosing one of
the four strategies for each sample in the training data set. When a

656

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

sample is presented to McRBFN, the meta-cognitive component of


McRBFN measures the knowledge contained in the current training sample with respect to the cognitive component using its
knowledge measures. Predicted class label, maximum hinge error
and class-wise signicance are considered as knowledge measures of the meta-cognitive component. Class-wise signicance is
obtained from spherical potential, which is used widely in kernel
methods to determine whether all the data points are enclosed
tightly by the Gaussian kernels [35]. Here, the squared distance
between the current sample and the hyper-dimensional projection helps in measuring the novelty in the data. Since, in this
paper, McRBFN addresses classication problems, we redene the
spherical potential in class-wise framework and is used in devising the learning strategies. Using the above mentioned measures
the meta-cognitive component constructs two sample based learning strategies and two neuron based learning strategies. One of
these strategies is selected for the current training sample such that
the cognitive component learns the true function accurately and
achieves better generalization performance. These learning strategies are adapted by meta-cognitive component using self-regulated
thresholds. In addition, the meta-cognitive component identies
the overlapping/non-overlapping conditions by measuring the distance from nearest neuron in the inter/intra-class. The McRBFN
using the PBL to obtain the network parameters is referred to as,
Projection Based Learning algorithm for a Meta-cognitive Radial
Basis Network (PBL-McRBFN).
The performance of the proposed PBL-McRBFN classier is
evaluated using set of benchmark binary/multi-category classication problems from University of California, Irvine (UCI) machine
learning repository [36]. We consider ve multi-category and ve
binary classication problems with varying values of imbalance
factor. In all these problems, the performance of PBL-McRBFN
is compared against the best performing classiers available in
the literature using class-wise performance measures like overall/average efciency and a non-parametric statistical signicance
test [37]. The non-parametric Friedman test based on the mean
ranking of each algorithm over multiple data sets indicate the statistical signicance of the proposed PBL-McRBFN classier. Finally,
the performance of PBL-McRBFN classier has also been evaluated using two practical classication problems viz., the acoustic
emission signal classication [38] and the mammogram classication for breast cancer detection [39]. The results clearly highlight
that PBL-McRBFN classier provides a better generalization performance than the results reported in the literature.
The outline of this paper is as follows: Section 2 describes the
meta-cognitive radial basis network for classication problems.
Section 3 presents the performance evaluation of PBL-McRBFN classier on a set of benchmark and practical classication problems,
and compares with the best performing classiers available in the
literature. Section 4 summarizes the conclusions from this study.
2. Meta-cognitive radial basis function network for
classication problems
In this section, we describe the meta-cognitive radial basis function network for solving classication problems. First, we dene
the classication problem. Next, we present the meta-cognitive
radial basis function network architecture. Finally, we present the
sequential learning algorithm and summarize in a pseudo-code
form.
2.1. Problem denition
Given stream of training data samples, {(x1 , c1 ), . . ., (xt , ct ), . . . },
t ]T Rm is the m-dimensional input of the tth
where xt = [x1t , . . . , xm

sample, and ct (1, n) is its class label. Where n is the total number
of classes. The coded class labels (yt = [y1t , . . . , yjt , . . . , ynt ]T ) Rn
are given by:

yjt =

if

ct = j

j = 1, . . . , n
1 otherwise

(1)

The objective of McRBFN classier is to approximate the underlying decision function that maps xt Rm yt Rn . McRBFN begins
with zero hidden neuron and selects suitable strategy for each sample to achieve this objective. In the next section, we describe the
architecture of McRBFN and discuss each of these learning strategies in detail.
2.2. McRBFN architecture
McRBFN has two components, namely the cognitive component and the meta-cognitive component, as shown in Fig. 2. The
cognitive component is a single hidden layer radial basis function
network with evolving architecture starting from zero hidden neuron. The meta-cognitive component of McRBFN contains dynamic
model of the cognitive component, knowledge measures and
self-regulated thresholds. Meta-cognitive component controls the
learning process of the cognitive component by choosing one of
the four strategies for each sample in the training data set. When a
new training sample presented to the McRBFN, the meta-cognitive
component of McRBFN estimates the knowledge present in the new
training sample with respect to the cognitive component. Based
on this information, the meta-cognitive component controls the
learning process of the cognitive component by selecting suitable
strategy for the current training sample to address what-to-learn,
when-to-learn and how-to-learn properly.
We present a detailed description of the cognitive and the metacognitive components of McRBFN in the following sections:
2.2.1. Cognitive component of McRBFN
The cognitive component of McRBFN is a single hidden layered
feed forward radial basis function network with a linear input and
output layers. The neurons in the hidden layer of the cognitive
component of McRBFN employ the Gaussian activation function.
Without loss of generality, we assume that the McRBFN builds
K Gaussian neurons from t 1 training samples. For a given input
xt , the predicted output of the jth output neuron (
yjt ) of McRBFN is

yjt =

K


wkj htk ,

j = 1, . . . , n

(2)

k=1

where wkj is the weight connecting the kth hidden neuron to the
jth output neuron and htk is the response of the kth hidden neuron
to the input xt is given by

htk = exp

xt lk 2
(kl )2

(3)

where lk Rm is the center and kl R+ is the width of the kth hidden neuron. Here, the superscript l represents the corresponding
class of the hidden neuron.
The cognitive component uses Projection Based Learning (PBL)
algorithm for learning process. The strategy proposed here is similar to that of fast learning algorithm for single layer neural network
in [33,34]. The PBL algorithm is described as follows.
Projection Based Learning algorithm: The Projection Based
Learning algorithm works on the principle of minimization of
energy function and nds the optimal network output parameters

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

657

Fig. 2. Schematic diagram of McRBFN.

for which the energy function is minimum, i.e, the network achieves
the minimum energy point of the energy function.
The considered energy function is the sum of squared hinge loss
error at McRBFN output neurons. The energy function for ith sample
is dened as
Ji =

n

i
2

ej

i = 1, . . . , t

(4)

j=1

where eji is the hinge loss error dened as

eji

0
yji
yji 

if yji
yji > 1

j=1

yji )2
(yji 

n

j=1

Ji =

yji

i=1

n
t
1 

2
i=1 j=1

(9)

Equating the rst partial derivative to zero and re-arranging we get


t
K 


wkj hik

K


hik hip wkj =

t


hip yji

(10)

i=1

akp wkj = bpj ,

yji

0
K


j = 1, . . . , n

(11)

i = 1, . . . , t

(6)

AW = B

(12)

where the projection matrix A RKK is given by

2
wkj hik

p = 1, . . . , K;

which can be represented in matrix form as

For t training samples, the overall energy function is dened as

J(W) =

j = 1, . . . , n

k=1

k=1

t
1

p = 1, . . . , K;

Eq. (10) can be written as


(5)

yji < 1, the energy function for ith sample becomes


When yji

Ji =

J(W)
= 0,
wpj

k=1 i=1

j = 1, . . . , n

otherwise

n

0

partial derivative of J(W) with respect to the output weight to zero,


i.e.,

akp =

t


hik hip ,

k = 1, . . . , K;

p = 1, . . . , K

(13)

i=1

(7)

and the output matrix B RKn is

k=1

where hik is the response of the kth hidden neuron for ith training
sample.
The optimal output weights (W RKn ) are estimated such
that the total energy reaches its minimum.

bpj =

t


hip yji ,

p = 1, . . . ,

K; j = 1, . . . , n

(14)

i=1

(8)

Eq. (11) gives the set of K n linear equations with K n unknown


output weights W. We state the following prepositions to nd the
closed-form solution for these set of linear equations.

The optimal W* corresponding to the minimum energy point of


the energy function (J(W* )) is obtained by equating the rst order

Proposition 1. The responses of the hidden neurons are unique.


i.e.xi , when k =
/ p, hik =
/ hip ; k, p = 1, . . ., K, i = 1, . . ., t.

W := arg min

WRKn

J(W)

658

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

Proof. Let us assume that for a given xi , hip = hik ; when k =


/ p, this
assumption is valid if and only if
lp == lk AND pl == kl

(15)

But the pair of vectors lk and lp are allocated based upon the
selected signicant training samples for addition of neurons, these
signicant samples are selected using neuron growth criterion as
in Eq. (33). Neuron growth criterion uses maximum hinge error
(Et ) and class-wise signicance ( c ). c dened such that a new
neuron is added such that when there is no neuron present near
to the current sample which produces signicant output for the
current sample. So there are no two neuron centers are equal and
hence, the response of the kth and pth hidden neurons are not equal
for all samples. 
Proposition 2. The response of the each hidden neuron is non-zero
for at least few samples.
Proof. Let us assume that the response of kth hidden neuron is
0, i.e., hik = 0 xi . This is possible if and only if xi , or lk
, or kl 0
The input variables xi are normalized in a circle of radius 1 such
that |xj | < 1 ; j = 1, . . ., m. As shown in overlapping conditions of the
growth strategy in subsection 2.2.3 that hidden neuron centers are
allocated based upon the selected signicant training samples and
widths are determined based upon inter/intra class nearest neuron
distances which are nonzero positive values. Hence, the response
of the hidden neuron is non-zero for at least few samples. 
We state the following theorem, using the Propositions 1 and 2.
Theorem 1. The projection matrix A is a positive denite symmetric
matrix, and hence it is invertible.
Proof. From the denition of the projection matrix A given in Eq.
(13),
Apk =

t


hip hik ,

p = 1, . . . , K;

k = 1, . . . , K

(16)

i=1

it can be infer that the diagonal elements of the A are:


Akk =

t


hik hik ,

k = 1, . . . , K

(17)

i=1

From Proposition 2, the hidden neurons response are non-zero.


Therefore Eq. (17) can be written as
Akk =

t


|hik |2 > 0

(18)

i=1

Hence the projection matrix diagonal elements are non-zero, and


positive, i.e., Aikk R+ > 0.
The off-diagonal elements of the projection matrix (A) are:
Akj =

t

i=1

hik hij =

t


hij hik = Ajk

(19)

i=1

From Eqs. (17) and (19), it can be inferred that the projection matrix
A is a symmetric matrix.
A symmetric matrix is positive denite iff for any q =
/ 0,
qT Aq > 0. Let us consider an unit basis vector q1 RK1 such
that q11 = 1 and q12 q1K = 0, i.e., q1 = [1 0 0 0]T . Therefore,
qT1 Aq1 = A11 In Eq. (17), it was shown that k = 1, . . . , K, Akk R >
0. Therefore, A11 R > 0 qT1 Aq1 > 0. Similarly, for an unit basis
vector qk = [0 1 0]T , the product qTk Aqk is given by
qTk Aqk = Akk > 0;

k = 1, . . . , K

(20)

Let p RK be the linear transformed sum of K such unit basis


vectors, i.e., p = q1 t1 + + qk tk + + qK tK , where tk R is the transformation constant. Then,
pT Ap =

K


(qk tk )T A

k=1

K


(qk tk ) =

k=1

K


|tk |2 Akk

(21)

k=1

As shown in Eq. (17), Akk R > 0. Also, that |tk |2 R > 0 is evident.
Hence,
|tk |2 Akk R > 0; k = 1, . . . , K

K


|tk |2 Akk R > 0

(22)

k=1

Thus, the projection matrix A is positive denite, and hence it is


invertible. 
The solution for W obtained as a solution to the set of equa2

tions as given in Eq. (12) is minimum, if J/wlp 2 > 0. The second


derivative of the energy function (J) with respect to the output
weights is given by,

J(W)  i i  i 2
=
hp hp =
|hp | > 0
wlp 2
t

i=1

i=1

(23)

As the second derivative of the energy function J(W) is positive, the


following observations can be made from Eq. (23):
1 The function J is a convex function.
2 The output weight W* obtained as a solution to the set of linear
equations (Eq. (12)) is the weight corresponding to the minimum
energy point of the energy function (J).
Using the Theorem 1, the solution for the system of equations in Eq.
(12) can be determined as follows:
W = A1 B

(24)

2.2.2. Meta-cognitive component of McRBFN


The meta-cognitive component contains dynamic model of
the cognitive component, knowledge measures and self-regulated
thresholds. During the learning process, meta-cognitive component monitors the cognitive component and updates its dynamic
model of the cognitive component. When a new training sample
(tth) sample is presented to the McRBFN, the meta-cognitive component of McRBFN estimates the knowledge present in the new
training sample with respect to the cognitive component using
its knowledge measures. The meta-cognitive component uses predicted class label (
c t ), maximum hinge error (Et ), condence of
t
t
classier (p(c |x )) and class-wise signicance ( c ) as the measures
of knowledge in the new training sample. Self-regulated thresholds
are adapted to capture the knowledge presented in the new training
sample. Using the knowledge measures and self-regulated thresholds, the meta-cognitive component constructs two sample based
learning strategies and two neuron based learning strategies. One
of these strategies is selected for the new training sample such that
the cognitive component learn them accurately and achieves better
generalization performance.
The meta-cognitive component measures are dened as shown
below:
t
y ), the
Predicted class label (
c t ): Using the predicted output (
t
predicted class label (
c ) can be obtained as

ct = arg max yjt

(25)

j1,...,n

Maximum hinge error (Et ): The objective of the classier is to minit


mize the error between the predicted output (
y ) and actual output
t
(y ). In classication problems, it has been shown in [32,40] that

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

the classier developed using hinge loss error estimates the posterior probability more accurately than the classier developed using
mean square error. Hence, in

 McRBFN, we use the hinge loss error

et = e1t , . . . , ejt , . . . , ent

T

Rn dened as in Eq. (5).

The maximum absolute hinge error (Et ) is given by


Et =

 t
e 

max
j1,2,...,n

(26)

Condence of Classier (p(c t |xt )): The condence level of classication or predicted posterior probability is given as
p (j|xt ) =

min(1, max(1, y jt )) + 1
2

j = ct

(27)

Class-wise signicance ( c ): In general, the input feature (xt ) is


mapped on to a hyper-dimensional spherical feature space S using
K Gaussian neurons, i.e., xt H. Therefore, all H(xt ) lie on a hyperdimensional sphere as shown in [41]. The knowledge or spherical
potential of any sample in original space is expressed as a squared
distance from the hyper-dimensional mapping S centered at h0
[35].
In McRBFN, the center () and width () of the Gaussian neurons
describe the feature space S. Let the center of the K-dimensional
K
feature space be h0 = K1
h(k ). The knowledge present in the
k=1
t
new data x can be expressed as the potential of the data in the
original space, which is squared distance from the K-dimensional
feature space to the center h0 . The potential ( ) is given as
= ||h(xt ) h0 ||2

(28)

As shown in [35], the above equation can be expressed as


2
1
h(xt , lk ) + 2
K
K
K

= h(xt , xt )

k=1

K


h(lk , lr )

(29)

2
h(xt , lk )
K
K

(30)

k=1

Since we are addressing classication problems, the class-wise


distribution plays a vital role and it will inuence the performance
the classier signicantly [15]. Hence, we use the measure of the
spherical potential of the new training sample xt belonging to class
c with respect to the neurons associated to same class (i.e., l = c). Let
Kc be the number of neurons associated with the class c, then classwise spherical potential or class-wise signicance ( c ) is dened
as
K
1 
c

Kc

h(x

, ck )

cognitive component by selecting one of the following four learning


strategies for the new training sample.
Sample delete strategy: If the new training sample contains
information similar to the knowledge present in the cognitive
component, then delete the new training sample from the training data set without using it in the learning process.
Neuron growth strategy: Use the new training sample to add a
new hidden neuron in the cognitive component. During neuron
addition, sample overlapping conditions are identied to allocate
a new hidden neuron appropriately.
Parameter update strategy: The new training sample is used to
update the parameters of the cognitive component. PBL is used
to update the parameters.
Sample reserve strategy: The new training sample contains
some information but not signicant, they can be used at later
stage of the learning process for ne tuning the parameters of
the cognitive component. These sample may be discarded without learning or used for ne tuning the cognitive component
parameters in a later stage.
The principle behind these four learning strategies are described in
detail below:
Sample delete strategy: When the predicted class label of the
new training sample is same as the actual class label and the estimate posterior probability is close to 1, then the new training
sample does not provide additional information to the classier
and can be deleted from training sequence without being used in
learning process. The sample deletion criterion is given by

ct == ct AND p (ct |xt ) d

(32)

k,r=1

From the above equation, we can see that for Gaussian function

K
the rst term (h(xt , xt )) and last term (1/K 2 k,r=1 h lk , lr ) are
constants. Since potential is a measure of novelty, these constants
may be discarded and the potential can be reduced to

659

(31)

k=1

The spherical potential explicitly indicates the knowledge contained in the sample, a higher value of spherical potential (close
to one) indicates that the sample is similar to the existing knowledge in the cognitive component and a smaller value of spherical
potential (close to zero) indicates that the sample is novel.
2.2.3. Learning strategies
Meta-cognitive component devices various learning strategies using the knowledge measures and self-regulated thresholds,
which directly addresses the basic principles of self-regulated
human learning (i.e., what-to-learn, when-to-learn and how-tolearn). The meta-cognitive part controls the learning process in

The meta-cognitive deletion threshold (d ) controls number of


samples participating in the learning process. If one selects d
close to 1 then all the training samples participates in the learning process which results in over-training with similar samples.
Reducing d beyond the desired accuracy results in deletion of too
many samples from the training sequence. But, the resultant network may not satisfy the desired accuracy. Hence, it is xed at the
expected accuracy level. In our simulation studies, it is selected
in the range of [0.90.95]. The sample deletion strategy prevents
learning of samples with similar information, and thereby, avoids
over-training and reduces the computational effort.
Neuron growth strategy: When a new training sample contains
signicant information and the predicted class label is different
from the actual class label then one need to add a new hidden
neuron to represent the knowledge contained in the sample. The
neuron growth criterion is given by

c =/ ct OR Et a AND

t
c (x )

(33)

where c is the meta-cognitive knowledge measurement threshold and a is the self-adaptive meta-cognitive addition threshold.
The terms c and a allows samples with signicant knowledge
for learning rst then uses the other samples for ne tuning. If
c threshold is chosen closer to zero and the initial value of a
threshold is chosen closer to the maximum value of hinge error,
then very few neurons will be added to the network. Such a network will not approximate the function properly. If c threshold
is chosen closer to one and the initial value of a threshold is chosen closer to the minimum value of hinge error, then the resultant
network may contain many neurons with poor generalization
ability. Hence, the range for the meta-cognitive knowledge measurement threshold can be selected in the interval [0.30.7] and
the range for the initial value of self-adaptive meta-cognitive

660

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

addition threshold can be selected in the interval [1.31.7]. The


a is adapted as follows
a := a + (1 )E t

(34)

where is the slope that controls rate of self-adaptation and is set


close to one. The a adaptation allows McRBFN to add neurons
only when presented samples to the cognitive network contains
signicant information.
The new training sample may have overlap with other classes
or will be from a distinct cluster far away from the nearest neuron in the same class. Therefore, one need to identify the current
neuron status (overlapping with other classes or distinct cluster
in the same class) with respect to exiting neurons and initialize
the parameters of the new neuron (K + 1). The existing sequential
learning algorithms initialize width based on the distance with
nearest neuron and output weight as error based on the current
sample. The inuence of past samples is not considered in weight
initialization. Hence, it will affect the performance of the classier signicantly. The above mentioned issues are dealt in the
proposed McRBFN as
Inter/intra class nearest neuron distances from the current
sample for width determination.
Existing knowledge of past samples stored in the network as
neuron center is used to initialize the weight of new neuron.
Let nrS be the nearest hidden neuron in the intra-class and nrI be
the nearest hidden neuron in the inter-class. They are dened as
nrS = arg min xt lk ;
l==c;k

nrI = arg min xt lk 


l=
/ c;k

(35)

Let the Euclidian distances between the new training sample to


nrS and nrI are given as follows
dS = ||xt cnrS ||;

dI = ||xt lnrI ||

(36)

Using the nearest neuron distances, we can determine the


overlapping/no-overlapping conditions as follows:
Distinct sample: when a new training sample is far away from
c
both intra/inter class nearest neurons (dS >> nrS
AND dI >>
l ) then the new training sample does not overlap with any
nrI
class cluster, and is from a distinct cluster. In this case, the new
c
hidden neuron center (cK+1 ) and width (K+1
) parameters are
determined as
cK+1 = xt ;

c
K+1
=

(xt )T xt

c
K+1
= xt cnrS 

(37)

(38)

Minimum overlapping with the inter-class: when a new training sample is close to the inter-class nearest neuron compared
to the intra-class nearest neuron, i.e., the intra/inter class distance ratio is in range 11.5, then the sample has minimum
overlapping with the other class. In this case, the center of the
new hidden neuron is shifted away from the inter-class nearest
neuron and shifted towards the intra-class nearest neuron, and
is initialized as
cK+1 = xt + (cnrS lnrI );

c
K+1
= cK+1 cnrS 

cK+1 = xt (lnrI xt );

c
K+1
= cK+1 lnrI 

(40)

The above mentioned center and width determination conditions


helps in minimizing the misclassication in McRBFN classier.
When a neuron is added to McRBFN, based on the existing
knowledge of past samples stored in the network the output
weights are estimated using the PBL as follows:
The size of matrix A is increased from K K to (K + 1) (K + 1)

(41)


t

where h = ht1 , ht2 , . . . , hK is a vector of the existing K hidden


neurons response for new (tth) training sample. In sequential
learning samples are discarded after learning, but the information
present in the past samples are stored in the work. The centers
of neuron provides the distribution of past samples in feature
space. These centers can be used as pseudo-samples to capture
the effect of past samples. Hence, existing hidden neurons are
used as pseudo-samples to calculate aK+1 and aK+1,K+1 terms.
aK+1 R1K is assigned as
K+1


aK+1,p =

i=1

hiK+1 hip ,

= exp

p = 1, . . . , K where hip

li lp 2


(42)

(pl )2

and aK+1,K+1 R+ value is assigned as


aK+1,K+1 =

K+1


hiK+1 hiK+1

(43)

i=1

where  is a positive constant which controls the overlap of the


responses of the hidden units in the input space, which lies in
the range 0.5  1.
No-overlapping: When a new training sample is close to the
intra-class nearest neuron then the sample does not overlap
with the other classes, i.e., the intra/inter class distance ratio
is less than 1, then the sample does not overlap with the other
classes. In this case, the new hidden neuron center (cK+1 ) and
c
) parameters are determined as
width (K+1
cK+1 = xt ;

where  is center shift factor which determines how much center has to be shifted from the new training sample location. In
our simulation studies  value is xed to 0.1.
Signicant overlapping with the inter-class: When a new training
sample is very close to the inter-class nearest neuron compared
to the intra-class nearest neuron, i.e., the intra/inter class distance ratio is more than 1.5, then the sample has signicant
overlapping with the other class. In this case, the center of the
new hidden neuron is shifted away from the inter-class nearest
neuron and is initialized as

(39)

The size of matrix B is increased from K n to (K + 1) n

Bt(K+1)n =


T t
T

t
Bt1
Kn + h

bK+1

(44)

and bK+1 1 nR is a row vector assigned as


bK+1,j =

K+1


hiK+1 y ji ,

j = 1, . . . , n

(45)

i=1

where y i is the pseudo-output for the ith pseudo sample or hidden


neuron (li ) given as
y ji =

if

l=j

1 otherwise j = 1, . . . , n

(46)

Finally the output weights are estimated as

WtK
wtK+1

= At(K+1)(K+1)

Bt(K+1)n

(47)

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

where WtK is the output weight matrix for K hidden neurons, and
wtK+1 is the vector of output weights for new hidden neuron after

learning from tth sample. The inverse of a matrix At(K+1)(K+1) is


calculated recursively using matrix identities as

By

substituting

661

t1
Bt1 = At1 Wt1
+ ht
K &A

t
T

adding/subtracting the term h


reduced to

Wt1
K

h t = At

and

on both sides Eq. (56)

+ (ht )T ((yt )T ht Wt1


WtK = (At )1 (At Wt1
K
K ))

(57)

Finally the output weights are updated as


1 t
T t
T

+ At
WtK = Wt1
K

where

aK+1 At1
KK
(AtKK )

AtKK = At1 + ht
aTK+1

= (At1 )

and

AtKK

(At1 )

(48)
ht ,

 = aK+1,K+1

is calculated as

(ht )T ht (At1 )

1 + ht (At1 )

(49)

(ht )T

After calculating inverse of matrix in Eq. (47) using Eqs. (48)&(49),


the resultant equations are


WtK


IKK +

Wt1
+ At1
KK
K

wtK+1

At1
KK

aTK+1 aK+1

(50)

1 t
T t
T 
h

At1
KK

(51)

bK+1


(52)

Parameters update strategy: The current (tth) training sample


is used to update the output weights of the cognitive component
(WK = [w1 , w2 , . . . , wK ]T ) if the following criterion is satised.
c t == 
c t AND E t u

(58)

where et is the hinge loss error for tth sample obtained from Eq.
(5).
Sample reserve strategy: If the new training sample does not
satisfy either the deletion or the neuron growth or the cognitive
component parameters update criterion, then the current sample is pushed to the rear of the training sequence. Since McRBFN
modies the strategies based on the current sample knowledge,
these samples may be used in later stage.
Ideally, training process stops when no further sample is available in the data stream. However, in real-time, training stops
when samples in the reserve remains same.
2.3. PBL-McRBFN classication algorithm
To summarize, the PBL-McRBFN algorithm in a pseudo code
form is given in Pseudo code 1:

aTK+1 bK+1

1 t
T t
T
1
=
aK+1 Wt1
+ At1
h
y
KK
K


(53)

Pseudocode 1.
Algorithm.

Pseudo code for the PBL-McRBFN classication

Input: Present the training data one-by-one to the network


from data stream.
Output: Decision function that estimates the relationship
between feature space and class label.
START
Initialization: Assign the rst sample as the rst neuron(K=1).
The parameters of the neuron are chosen as shown in Eq. (37).
Start learning for samples t = 2, 3,...
DO
Meta-cognitive component computes the signicance of the sample
with
respect to the cognitive component:
Computes the cognitive component output 
y using Eq. (2).
t

where u is the self-adaptive meta-cognitive parameter update


threshold. If u threshold is chosen closer to 50% of maximum hinge error, then very few samples will be used for
adapting the network parameters and most of the samples
will be pushed to the end of the training sequence. The resultant network will not approximate the function accurately.
If a lower value is chosen, then all samples will be used in
updating the network parameters without altering the training
sequence. Hence, the range for the initial value of meta-cognitive
parameter update threshold can be selected in the interval
[0.40.7].
The u is adapted based on the hinge error as:
u := u + (1 )E t

(54)

where is the slope that controls the rate of self-adaption of


parameter update and is set close to one.
When a sample is used to update the output weight parameters, the PBL algorithm updates the output weight parameters as
follows:

J(WtK ) J(WtK ) Jt (WtK )


=
+
= 0,
wpj
wpj
wpj

p = 1, . . . , K;

j = 1, . . . , n

c , maximum hinge error Et ,


Finds the predicted class label 
condence of classier p (c t |xt ) and class-wise signicance
using Eqs.(25),(26) and (31).
Based on above calculated measures the meta-cognitive
component selects one of the following strategies:
Sample Delete Strategy:
t

IF 
c t == c t ANDp(c t |xt ) d THEN
Delete the sample from the sequence without learning.
Neuron Growth Strategy:

ct =
/ c t ORE t a AND c (xt ) c THEN
ELSEIF 
Add a neuron to the network (K = K+1).
Choose the parameters of the new hidden neuron using Eqs.
(37) to (52).
Update the self-adaptive meta-cognitive addition
threshold according to Eq. (34)
Parameters Update Strategy:
ELSEIF c t == 
c t ANDE t u THEN
Update the parameters of the cognitive component using Eq. (58)
Update the self-adaptive meta-cognitive update threshold according
to Eq. (54)
Sample Reserve Strategy:
ELSE
The current sample xt , yt is pushed to the rear
end of the sample stack to be used in future. They can be
later used to ne-tune the cognitive component parameters.
ENDIF

(55)
Equating the rst partial derivative to zero and re-arranging Eq.
(55), we get
(At1 + (ht )T ht )WtK (Bt1 + (ht )T (yt )T ) = 0

(56)

The cognitive component executes the above selected strategy.


ENDDO
END

In PBL-McRBFN, sample delete strategy address the what-tolearn by deleting insignicant samples from training data set,

662

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

Table 1
Description of benchmark data sets selected from UCI machine learning repository for performance study.
Data sets

No. of features

No. of classes

No. of samples
Training

I.F
Training

Testing

Image segmentation (IS)


IRIS
WINE
Vehicle classication (VC)
Glass identication (GI)

19
4
13
18
9

7
3
3
4
6

210
45
60
424a
109a

2100
105
118
422
105

0
0
0
0.1
0.68

0
0
0.29
0.12
0.77

HEART
Liver disorders (LD)
PIMA
Breast cancer (BC)
Ionosphere (ION)

13
6
8
9
34

2
2
2
2
2

70
200
400
300
100

200
145
368
383
251

0.14
0.17
0.22
0.26
0.28

0.1
0.14
0.39
0.33
0.28

Testing

Training samples are repeated three times randomly as suggested in [15].

neuron growth strategy and parameters update strategy address


the how-to-learn efciently by which the cognitive component
learns from the samples, and self-adaptive nature of meta-cognitive
thresholds in addition to the sample reserve strategy address the
when-to-learn by presenting the samples in the learning process
according to the knowledge present in the sample.
3. Performance evaluation of PBL-McRBFN classier
PBL-McRBFN classier performance is evaluated on benchmark multi-category and binary classication problems from UCI
machine learning repository. The performance is compared with
the best performing sequential learning algorithm reported in the
literature (SRAN) [20], batch ELM classier [16] and also with the
standard support vector machines [42]. The data sets are chosen
with varying sample imbalance. The sample imbalance is measured
using Imbalance Factor (I.F) as
I.F = 1

n
min N
N j=1...n j

(59)

where Nj is the total number of training samples belonging to the


n
N . The description of these data sets including
class j and N =
j=1 j
the number of input features, the number of classes, the number of samples in the training/testing and the imbalance factor are
presented in Table 1. From Table 1, it can be observed that the problems chosen for the study have both balanced and unbalanced data
sets and the imbalance factors of the data sets vary widely. Finally,
PBL-McRBFN classier is used to solve two real-world classication
problems: the acoustic emission signal processing for health monitoring data set presented in [38] and the mammogram classication
for breast cancer detection data set presented in [43].
All the simulations are conducted in MATLAB 2010 environment
on a desktop PC with Intel Core 2 Duo, 2.66GHz CPU and 3GB RAM.
For ELM classier, the number of hidden neurons are obtained using
the constructive-destructive procedure presented in [44]. The simulations for batch SVM with Gaussian kernels are carried out using
the LIBSVM package in C [45]. For SVM classier, the parameters
(c,) are optimized using grid search technique. The performance
measures used to compare the classiers are described below.
3.1. Performance measures
The class-wise performance measures like overall/average efciencies and a statistical signicance test on performance of
multiple classiers on multiple data sets are used for performance
comparison.
3.1.1. Class-wise measure
The confusion matrix Q is used to obtain the class-level performance and global performance of the various classiers. Class-level

performance is measured by the percentage classication (j )


which is dened as:
j =

qjj
Nj

100%

(60)

where qjj is the total number of correctly classied samples in the


class j and Nj is the total number of samples belonging to a class
j in the training/testing data set. The global measures used in the
evaluation are the average per-class classication accuracy (a ) and
the over-all classication accuracy (o ) dened as:
n


1
a =
j ,
n
n

o =

j=1

qjj
100%

(61)

j=1

3.1.2. Statistical signicance test


The classication efciency itself is not a conclusive measure
of an classier performance [37]. Since the developed classier
is compared with multiple classiers over multiple data sets, the
Friedman test followed by the Benferroni-Dunn test is used to
establish the statistical signicance of PBL-McRBFN classier. A
brief description of the conducted test is given below.
Friedman Test: It is is used to compare multiple classiers (L)
j
over multiple data sets (M). Let ri be the rank of the jth classier
on the ith data set. Under the null-hypothesis, which states that
all the classiers are equivalent and so their average rank Rj (Rj =

1/M i ri ) over all data sets should be equal, the Friedman statistic
is given by
12M
2F =
L(L + 1)

2

+
1)
L(L
Rj2

(62)

which follows the 2 (Chi-square distribution) distribution with


L 1 degrees of freedom. A 2 distribution is the distribution of a
sum of squares of L independent standard normal variables.
Iman and Davenport showed that Friedmans statistic ( 2F ) is
more conservative and derived a better statistic [46]. It is given by
FF =

(M 1) 2F
M(L 1) 2F

(63)

which follows the F-distribution with L 1 and (L 1)(M 1)


degrees of freedom is used in this paper. F-distribution is dened
as the probability distribution of the ratio of two independent 2
distributions over their respective degrees of freedom. The aim of
the statistical test is to prove that the performance of PBL-McRBFN
classier is substantially different from the other classiers with a
condence level of value 1 . If calculated FF > F/2,(L1),(L1)(M1)

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

or FF < F1/2,(L1),(L1)(M1) , then the null-hypothesis is rejected.


The Statistical tables for critical values can be found in [47].
Post-hoc Test: The Benferroni-Dunn test [48] is a post-hoc test
that can be performed after rejection of the null-hypothesis. It is
used to compare PBL-McRBFN classier against all the other classiers. This test assumes that the performances of two classiers
are signicantly different if the corresponding average ranks differ
by at least the Critical Difference (CD), i.e, (Ri Rj ) > CD then classier i performs signicantly than classier j. The critical difference
is calculated using

CD = q

L(L + 1)
6M

(64)

where critical
values q are based on the Studentized range statistic
divided by 2 as given in [37].
3.2. Performance evaluation on UCI benchmark data sets
The class-wise performance measures (average/overall) testing
efciencies, number of hidden neurons and samples used for PBLMcRBFN, SRAN, ELM and SVM classiers are reported in Table 2. The
Table 2 contains results of both the binary and the multi-category
classication data sets from UCI machine learning repository. From
Table 2, we can see that PBL-McRBFN classier performs slightly
better than the best performing SRAN classier and signicantly
better than ELM and SVM classiers on all the 10 data sets. In
addition, the proposed PBL-McRBFN classier requires fewer samples to learn the decision function and develops compact neural
architecture to achieve better generalization performance.
Well balanced data sets: In IS, IRIS, WINE data sets, the generalization performance of PBL-McRBFN is approximately 2% more
than SRAN classier and 3 4 % more than ELM and SVM classiers. On IS data set proposed PBL-McRBFN uses fewer samples to
achieve 2% improvement over SRAN and proposed PBL-McRBFN
achieves approximately 3 4 % improvement over ELM and SVM
classiers. Similar to IS, on IRIS and WINE data sets, PBL-McRBFN
uses fewer samples with less number of neurons to achieve better
generalization performance. PBL-McRBFN classier achieves better
generalization performance using meta-cognitive learning algorithm, which selects appropriate samples to used in learning based
on the current knowledge. Also, deletes many redundant samples
to avoid over training. For example, in IS data set, PBL-McRBFN
uses only 89 samples out of 210 training samples to build the best
classier.
In order to highlight the above-mentioned advantages of proposed PBL-McRBFN classier, we conduct a simulation study in
ELM classier with only training samples used by PBL-McRBFN
classier. On IS data set, PBL-McRBFN classier selects the best 89
samples for training and these samples are used in batch learning
ELM algorithm and we refer this classier as ELM* .
The testing performance of ELM* classier (which uses the best
89 samples sequence) is better than the original ELM classier
developed using 210 training samples. Also, ELM* achieves better
generalization performance with smaller number of hidden neurons (ELM* requires only 32 hidden neurons to achieve 92.14%
testing efciency whereas ELM requires 49 hidden neurons to
achieve 90.23%). This study clearly indicates that sample deletion
strategy present in PBL-McRBFN helps in achieving better decision
making ability.
Imbalanced data sets: In VC, GI, HEART, LD, PIMA, BC, ION data
sets, the generalization performance of PBL-McRBFN is approximately 2 10 % more than SRAN classier, and 2 15 % more than
ELM and SVM classiers. In case of imbalance data sets, PBLMcRBFN require more number of neurons to approximate the
decision surface with minimal samples for approximating the

663

decision surface. Class-overlap based criterion in initializing the


centers and width of new neuron in PBL-McRBFN and metacognitive learning helps PBL-McRBFN to achieves signicantly
better generalization performance. For example, in VC data set proposed PBL-McRBFN uses fewer samples to achieve better average
testing efciency approximately 2% improvement over SRAN and
ELM classiers, and 10% improvement over SVM classier. The GI
data set has imbalance factor of 0.68 in training and 0.77 in testing.
Such high imbalance inuences the performance of SRAN, ELM and
SVM classiers. On GI data set, SRAN overall testing efciency (o ) is
6% more than the average testing efciency (a ). This is due to the
fact that SRAN classier is not able to capture the knowledge for
the classes which contain smaller number of samples accurately.
In case of proposed PBL-McRBFN classier, the average testing efciency (a ) is 8% more than the overall testing efciency (o ). Thus
the proposed PBL-McRBFN classier is able to captures the knowledge for the classes which contain smaller number of samples
accurately. On GI data set proposed PBL-McRBFN achieves better
average testing efciency 12% improvement over SRAN with fewer
samples, 5% improvement over ELM with less number of neurons,
and 15% improvement over SVM classier with fewer number of
neurons.
Binary data sets: On HEART and LD data sets proposed PBLMcRBFN achieves better average testing efciency approximately
2 7 % over SRAN, ELM and SVM with less number of neurons. On
PIMA and BC data sets proposed PBL-McRBFN achieves better average testing efciency approximately 1 2 % over SRAN, ELM and
SVM with fewer samples. On ION data set proposed PBL-McRBFN
uses fewer samples with less number of neurons to achieve better
average testing efciency 5% improvement over SRAN and 8 9 %
improvement over ELM and SVM. The overlapping conditions and
class specic criterion in learning strategies of PBL-McRBFN helps in
capturing the knowledge accurately in case of high sample imbalance problems. From the Table 2, we can say that the proposed
PBL-McRBFN improves average/overall efciency even under high
sample imbalance.
3.2.1. Statistical signicance analysis
In this section, we highlight the signicance of proposed PBLMcRBFN classier on multiple data set using non-parametric
Friedman test followed by the Benferroni-Dunn test as described
in Section 3.1.2. The Friedman test identify the measured average
ranks are signicantly different from the mean rank (mean rank
is 2.5) expected under the null-hypothesis. The Benferroni-Dunn
test highlights statistical difference in performance of PBL-McRBFN
classier over other classiers. From the Table 2, we can see that
our comparison study uses four classiers (L = 4) and ten data sets
(M = 10).
Non-parametric test using overall testing efciency (o ): Ranks of all
4 classiers based on the overall testing efciency for each data
set are provided in Table 3. The Friedman statistic ( 2F as in Eq.
(62)) is 16.89 and modied (Iman and Davenport) statistic (FF as
in Eq. (63)) is 11.59. For four classiers and ten data sets, the modied statistic is distributed according to the F-distribution with
3 and 27 degrees-of-freedom. The critical value for rejecting the
null hypothesis at signicance level of 0.05 is 3.65. Since, modied statistic is greater than the critical value (11.59 3.65), we
can reject the null hypothesis. Hence, we can say that the proposed PBL-McRBFN classier performs better than the existing
classiers on these data sets.
Next, we conduct the Benferroni-Dunn test to compare the proposed PBL-McRBFN classier with the all other classiers. From
Eq. (64), the critical difference (CD) is calculated as 1.382 for a
signicance level of 0.05 (q0.05 = 2.394). From Table 3, we can
see that the difference in average rank between the proposed

664

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

Table 2
Performance comparison of PBL-McRBFN with SRAN, ELM and SVM.
Data sets

PBL-McRBFN
K

SRAN

Samples
Used

IS
IRIS
WINE
VC
GI
HEART
LD
PIMA
BC
ION
a

50
6
11
175
71
20
87
100
13
18

89
20
29
318
115
69
116
162
45
58

Testing

ELM

o

a

94.19
98.10
98.31
78.91
84.76
81.50
73.1
79.62
97.39
96.41

94.19
98.10
98.69
79.09
92.72
81.47
72.63
76.67
97.85
96.47

Samples
Used

47
8
12
113
59
28
91
97
7
21

113
29
46
437
159
56
151
230
91
86

Testing

o

a

92.29
96.19
96.61
75.12
86.21
78.50
66.90
78.53
96.87
90.84

92.29
96.19
97.19
76.86
80.95
77.53
65.78
74.90
97.26
91.88

49
10
10
150
80
36
100
100
66
32

SVM
SVa

Testing
o

a

90.23
96.19
97.46
77.01
81.31
76.50
72.41
76.63
96.35
89.64

90.23
96.19
98.04
77.59
87.43
75.91
71.41
75.25
96.48
87.52

Testing

127
13
36
340
183
42
141
221
24
43

o

a

91.38
96.19
97.46
70.62
70.47
75.50
71.03
77.45
96.61
91.24

91.38
96.19
98.04
68.51
75.61
75.10
70.21
76.43
97.06
88.51

Number of support vectors

Table 3
Ranks based on the overall (o ) and average (a ) testing efciencies.
Data sets

IS
IRIS
WINE
VC
GI
HEART
LD
PIMA
BC
ION
Average rank (Rj )

PBL-McRBFN

SRAN

ELM

SVM

o

a

o

a

o

a

o

a

1
1
1
1
2
1
1
1
1
1
1.1

1
1
1
1
1
1
1
1
1
1
1

2
3
4
3
1
2
4
2
2
3
2.6

2
3
4
3
3
2
4
4
2
2
2.9

4
3
2.5
2
3
3
2
4
4
4
3.15

4
3
2.5
2
2
3
2
3
4
4
2.95

3
3
2.5
4
4
4
3
3
3
2
3.15

3
3
2.5
4
4
4
3
2
3
3
3.15

PBL-McRBFN classier and the other three classiers are 1.5,


2.05 and 2.05. The difference in average rank is greater than the
critical difference. Hence, based on the overall testing efciency
the Benferroni-Dunn test shows that the proposed PBL-McRBFN
classier is signicantly better than the SRAN, ELM and SVM classiers.
Non-parametric test using average testing efciency (a ): Ranks of
all 4 classiers based on the average testing efciency for each
data set are provided in Table 3. The Friedman statistic ( 2F as
in Eq. (62)) is 18.21 and modied statistic (FF as in Eq. (63)) is
13.9. Since, modied statistic is greater than the critical value
(13.9 3.65), we can reject the null hypothesis. Hence, we can say
that the proposed PBL-McRBFN classier performs better than
the other classiers on these data sets.
From Table 3, we can see that the difference in average rank
between the proposed PBL-McRBFN classier and the other three
classiers are 1.9, 1.95 and 2.15. The difference in average rank is
greater than the critical difference (1.382). Hence, based on the
average testing efciency, the Benferroni-Dunn test also shows
that the proposed PBL-McRBFN classier performs better than
the other well known classiers. Next, we present the performance results of PBL-McRBFN classier on the two real-world
classication problem data sets, viz., an acoustic emission data
set for health monitoring presented in [38] and the mammogram
classication data set for breast cancer detection presented in
[43].
3.3. Acoustic emission signal classication for health monitoring
The stress or pressure waves produced by the sensitive transducer due to the transient energy released by the irreversible
deformation in the material are called as acoustic emission signals.
These signals are produced by various sources and classication/identication of sources using the acoustic emission signals is

a very difcult problem. The presence of ambient noise and pseudo


acoustic emission signals in practical situations increases the complexity further. In addition, the supercial similarities between the
acoustic emission signals produced by different sources increases
the complexity further. In this section, we address classication
of such acoustic emission signals using the proposed PBL-McRBFN
classier. The experimental data provided for the burst type acoustic emission signals from the metallic surface is considered for our
study as given in [38]. The burst type acoustic emission signal is
characterized by 5 features and these signals are classied into one
of the 4 sources, namely, the pencil source, the pulse source, the
spark source and the noise source. Out of 199 samples, 62 samples
are used for training (as highlighted in [38]) and the remaining samples are used for testing the classier. For details on characteristics
of input features and the experimental setup, one should refer to
[38].
The performance study results of PBL-McRBFN classier are
compared against the SRAN, ELM, and SVM, and presented in
Table 4. It can be seen that PBL-McRBFN classier uses only 9 significant samples to build the classier and requires only 5 neurons to
achieve an over-all testing efciency of 99.27%. Thus, PBL-McRBFN
classier performs an efcient classication of the acoustic emission signals using a compact network.
Table 4
Performance comparison on acoustic emission signal problem.
Classier

PBL-McRBFN
SRAN
ELM
SVM
a

Hidden

Samples

Neurons

Used

o

a

5
10
10
22a

9
39
62
62

99.27
99.27
99.27
98.54

98.91
98.91
98.91
97.95

Number of support vectors.

Testing

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666


Table 5
Performance comparison on mammogram classication problem.
Classier

PBL-McRBFN
SRAN
ELM
SVM
1

Hidden

Samples

Neurons

Used

22
25
30
261

60
45
97
97

Acknowledgements

Testing
o
100
90.91
90.91
90.91

665

a
100
91.67
90.0
91.67

The authors would like to thank the Nanyang Technological


University-Ministry of Defence (NTU-MINDEF), Singapore, for the
nancial support (Grant number: MINDEF-NTU-JPP/11/02/05) to
conduct this study.
References

Number of support vectors

3.4. Mammogram classication for breast cancer detection


Mammogram is a better means for early diagnosis of breast cancer, as tumors and abnormalities show up in mammogram much
before they can be detected through physical examinations. Clinically, identication of malignant tissues involves detecting the
abnormal masses or tumors, if any, and then classifying the mass
as either malignant or benign as given in [39]. However, once a
tumor is detected, the only method of determining whether it is
benign or malignant is by conducting a biopsy, which is an invasive
procedure that involves the removal of the cells or tissue from a
patient. A non-invasive method of identifying the abnormalities
in a mammogram can reduce the number of unnecessary biopsies, thus sparing the patients of inconvenience and saving medical
costs. In this study, mammogram database available in [43] has
been used. The 9 input features extracted from the mammogram
of the identied abnormal mass are used to classify the tumor as
either malignant or benign. Here, 97 samples are used to develop
PBL-McRBFN classier and the performance of PBL-McRBFN classier is evaluated using the remaining 11 samples. For further
details on the input features and the data set, one should refer to
[43].
The performance results of PBL-McRBFN classier, in comparison with the SRAN, ELM and SVM are presented in Table 5.
From the table, it is seen that PBL-McRBFN classier performs a
highly efcient classication with 100% classication accuracy with
smaller number of hidden neurons. When compared to SRAN, ELM,
SVM classiers, performance of PBL-McRBFN is improved considerably.
Thus, from the performance study of PBL-McRBFN conducted
with SRAN, ELM, SVM for chosen benchmark data sets and practical
classication problems,it can be observed that the proposed PBLMcRBFN classier performs better than other classiers.
4. Conclusions
In this paper, we have presented a Meta-cognitive Radial Basis
Function Network (McRBFN) and its Projection Based Learning
(PBL) algorithm for classication problems in sequential framework. The meta-cognitive component in McRBFN controls the
learning of the cognitive component in McRBFN. The metacognitive component adapts the learning process appropriately by
implementing self-regulation and hence it decides what-to-learn,
when-to-learn and how-to-learn efciently. In addition, the overlapping conditions present in neuron growth strategy helps in proper
initialization of new hidden neuron parameters and also minimizes the misclassication error. The performance of the proposed
PBL-McRBFN classier has been evaluated using the benchmark
multi-category, binary classication problems from UCI machine
learning repository with wide range of imbalance factor and two
practical classication problems. The statistical performance comparison with the well-known classiers in the literature clearly
indicates the superior performance of the proposed PBL-McRBFN
classier.

[1] G.B. Zhang, Neural network for classication: a survey, IEEE Transactions on
Systems, Man and Cybernetics Part C: Applications and Reviews 30 (4) (2000)
451462.
[2] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D.
Jackel, Backpropagation applied to handwritten zip code recognition, Neural
Computation. 1 (1989) 541551.
[3] F.F. Li, T.J. Cox, A neural network model for speech intelligibility quantication,
Applied Soft Computing 7 (1) (2007) 145155.
[4] S. Ari, G. Saha, In search of an optimization technique for articial neural network to classify abnormal heart sounds, Applied Soft Computing 9 (1) (2009)
330340.
[5] V. Ravi, C. Pramodh, Threshold accepting trained principal component neural
network and feature subset selection: application to bankruptcy prediction in
banks, Applied Soft Computing 8 (4) (2008) 15391548.
[6] M.E. Ruiz, P. Srinivasan, Hierarchical text categorization using neural networks,
Information Retrieval 5 (2002) 87118.
[7] M. Khan, S.W. Khor, Web document clustering using a hybrid neural network,
Applied Soft Computing 4 (4) (2004) 423432.
[8] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by backpropagation errors, nature, Nature 323 (1986) 533536.
[9] G.-B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: a new learning
scheme of feedforward neural networks, IEEE International Joint Conference
on Neural Networks. Proceedings 2 (2004) 985990.
[10] G.-B. Huang, X. Ding, H. Zhou, Optimization method based extreme learning
machine for classication, Neurocomputing 74 (1-3) (2010) 155163.
[11] J.C. Platt, A resource allocation network for function interpolation, Neural Computation 3 (2) (1991) 213225.
[12] L. Yingwei, N. Sundararajan, P. Saratchandran, A sequential learning scheme for
function approximation using minimal radial basis function neural networks,
Neural Computation 9 (2) (1997) 461478.
[13] G.-B. Huang, P. Saratchandran, N. Sundararajan, An efcient sequential learning
algorithm for growing and pruning RBF (GAP-RBF) networks, IEEE transactions on Systems, Man, and Cybernetics. Part B, Cybernetics 34 (6) (2004)
22842292.
[14] N.-Y. Liang, G.-B. Huang, P. Saratchandran, N. Sundararajan, A fast and accurate
online sequential learning algorithm for feedforward networks., IEEE Transactions on Neural Networks 17 (6) (2006) 14111423.
[15] S. Suresh, N. Sundararajan, P. Saratchandran, A sequential multi-category classier using radial basis function networks, Neurocomputing 71 (1) (2008)
13451358.
[16] S. Suresh, R.V. Babu, H.J. Kim, No-reference image quality assessment using
modied extreme learning machine classier, Applied Soft Computing 9 (2)
(2009) 541552.
[17] N. Kasabov, Evolving fuzzy neural networks for supervised/unsupervised
online knowledge-based learning, IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics 31 (6) (2001) 902918.
[18] W.P. Rivers, Autonomy at all costs: an ethnography of metacognitive selfassessment and self-management among experienced language learners, The
Modern Language Journal 85 (2) (2001) 279290.
[19] R. Isaacson, F. Fujita, Metacognitive knowledge monitoring and self-regulated
learning: academic success and reections on learning, Journal of the Scholarship of Teaching and Learning 6 (1) (2006) 3955.
[20] S. Suresh, K. Dong, H.J. Kim, A sequential learning algorithm for self-adaptive
resource allocation network classier, Neurocomputing 73 (1618) (2010)
30123019.
[21] S. Suresh, R. Savitha, N. Sundararajan, A sequential learning algorithm
for complex-valued self-regulating resource allocation network-CSRAN, IEEE
Transactions on Neural Networks 22 (7) (2011) 10611072.
[22] G. Sateesh Babu, S. Suresh, Meta-cognitive neural network for classication
problems in a sequential learning framework, Neurocomputing 81 (2012)
8696.
[23] K. Subramanian, S. Suresh, A meta-cognitive sequential learning algorithm
for neuro-fuzzy inference system, Applied Soft Computing 12 (11) (2012)
36033614.
[24] R. Savitha, S. Suresh, N. Sundararajan, Metacognitive learning in a fully
complex-valued radial basis function neural network, Neural Computation 24
(5) (2012) 12971328.
[25] R. Savitha, S. Suresh, N. Sundararajan, A meta-cognitive learning algorithm
for a Fully Complex-valued Relaxation Network, Neural Networks 32 (2012)
209218.
[26] G. Sateesh Babu, R. Savitha, S. Suresh, A projection based learning in metacognitive radial basis function network for classication problems, in: The
2012 International Joint Conference on Neural Networks (IJCNN), 2012, pp.
29072914.

666

G.S. Babu, S. Suresh / Applied Soft Computing 13 (2013) 654666

[27] G. Sateesh Babu, S. Suresh, B.S. Mahanand, Alzheimers disease detection using
a Projection Based Learning Meta-cognitive RBF Network, in: The 2012 International Joint Conference on Neural Networks (IJCNN), 2012, pp. 408415.
[28] G. Sateesh Babu, S. Suresh, K. Uma Sangumathi, H. Kim, A Projection Based
Learning Meta-cognitive RBF network classier for effective diagnosis of
Parkinsons disease, in: J. Wang, G. Yen, M. Polycarpou (Eds.), Advances in Neural Networks ISNN 2012, vol. 7368 of Lecture Notes in Computer Science,
Springer, Berlin / Heidelberg, 2012, pp. 611620.
[29] G. Sateesh Babu, S. Suresh, Parkinsons disease prediction using gene expression a projection based learning meta-cognitive neural classier approach,
Expert Systems with Applications (2012), http://dx.doi.org/10.1016/j.eswa.
2012.08.070
[30] M.T. Cox, Metacognition in computation: a selected research review, Articial
Intelligence 169 (2) (2005) 104141.
[31] T.O. Nelson, L. Narens, Metamemory: A Theoretical Framework and New Findings, Allyn and Bacon, Boston, USA, 1992.
[32] S. Suresh, N. Sundararajan, P. Saratchandran, Risk-sensitive loss functions for
sparse multi-category classication problems, Information Sciences 178 (12)
(2008) 26212638.

[33] E. Castillo, O. Fontenla-Romero, B. Guijarro-Berdinas,


A. Alonso-Betanzos, A
global optimum approach for one-layer neural networks, Neural Computation
14 (6) (2002) 14291449.

O. Fontenla-Romero, A. Alonso-Betanzos, A
[34] E. Castillo, B. Guijarro-Berdinas,
very fast learning method for neural networks based on sensitivity analysis,
Journal of Machine Learning Research 7 (2006) 11591182.
[35] H. Hoffmann, Kernel PCA for novelty detection, Pattern Recognition 40 (3)
(2007) 863874.
[36] C. Blake, C. Merz, UCI repository of machine learning databases, University of
California, Irvine, Department of Information and Computer Sciences, 1998,
http://archive.ics.uci.edu/ml/
[37] J. Demsar, Statistical comparisons of classiers over multiple data sets, The
Journal of Machine Learning Research 7 (2006) 130.
[38] S.N. Omkar, S. Suresh, T.R. Raghavendra, V. Mani, Acoustic emission signal classication using fuzzy C-means clustering, Proceedings of the ICONIP
02, 9th International Conference on Neural Information Processing 4 (2002)
18271831.
[39] C. Aize, Q. Song, X. Yang, S. Liu, C. Guo, Mammographic mass detection by vicinal
support vector machine, Proceedings of the ICNN 04, International Conference
on Neural Networks 3 (2004) 19531958.
[40] T. Zhang, Statistical behavior and consistency of classication methods based
on convex risk minimization, Annals of Statistics 32 (1) (2004) 5685.
[41] B. Scholkopf, A.J. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2002.
[42] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995)
273297.
[43] J. Suckling, J. Parker, D.R. Dance, S. Astley, I. Hutt, C. Boggis, I. Ricketts, E. Stamatakis, N. Cerneaz, S. Kok, et al., The mammographic image analysis society
digital mammogram database, Experta Medica International Congress Series
1069 (1994) 375378.

[44] S. Suresh, S.N. Omkar, V. Mani, T.N.G. Prakash, Lift coefcient prediction at
high angle of attack using recurrent neural network, Aerospace Science and
Technology 7 (8) (2003) 595602.
[45] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology 2 (2011) 27:1-27:27, software
available at http://www.csie.ntu.edu.tw/ cjlin/libsvm
[46] R.L. Iman, J.M. Davenport, Approximations of the critical region of the Friedman
statistic, Communications in Statistics (1980) 571595.
[47] J.H. Zar, Biostatistical Analysis, 4th Ed., Prentice-Hall, Englewood Clifs, New
Jersey, 1999.
[48] O.J. Dunn, Multiple comparisons among means, Journal of the American Statistical Association 56 (293) (1961) 5264.
Mr. Giduthuri Sateesh Babu received the B.Tech degree
in electrical and electronics engineering from Jawaharlal Nehru Technological University, India, in 2007, and
M.Tech degree in electrical engineering from Indian Institute of Technology Delhi, India, in 2009. From 2009 to
2010, he worked as a senior software engineer in Samsung R&D centre, India. He is currently a Ph.D. student
with School of Computer Engineering, Nanyang Technological University, Singapore. His research interests
include machine learning, cognitive computing, neural
networks, control systems, optimization and medical
informatics.

Dr. Sundaram Suresh received the B.E degree in electrical


and electronics engineering from Bharathiyar University in 1999, and M.E (2001) and Ph.D. (2005) degrees
in aerospace engineering from Indian Institute of Science, India. He was post-doctoral researcher in school
of electrical engineering, Nanyang Technological University from 2005 to 2007. From 2007 to 2008, he was in
INRIA-Sophia Antipolis, France as ERCIM research fellow. He was in Korea University for a short period as a
visiting faculty in Industrial Engineering. From January
2009 to December 2009, he was in Indian Institute of
Technology-Delhi as an Assistant Professor in Department
of Electrical Engineering. Currently, he is working as an
Assistant Professor in School of Computer Engineering, Nanyang Technological University, Singapore since 2010. He was awarded best young faculty for the year
2009 by IIT-Delhi His research interest includes ight control, unmanned aerial
vehicle design, machine learning, applied game theory, optimization and computer
vision.