DOI 10.1007/s00521-012-1184-y
ISNN2012
Received: 28 April 2012 / Accepted: 11 September 2012 / Published online: 26 September 2012
Springer-Verlag London Limited 2012
Abstract In this paper, a regularized correntropy criterion (RCC) for extreme learning machine (ELM) is proposed to deal with the training set with noises or outliers. In
RCC, the Gaussian kernel function is utilized to substitute
Euclidean norm of the mean square error (MSE) criterion.
Replacing MSE by RCC can enhance the anti-noise ability
of ELM. Moreover, the optimal weights connecting the
hidden and output layers together with the optimal bias
terms can be promptly obtained by the half-quadratic (HQ)
optimization technique with an iterative manner. Experimental results on the four synthetic data sets and the
fourteen benchmark data sets demonstrate that the proposed method is superior to the traditional ELM and the
regularized ELM both trained by the MSE criterion.
Keywords Extreme learning machine Correntropy
Regularization term Half-quadratic optimization
1 Introduction
Recently, Huang et al. [1] proposed a novel learning
method for single-hidden layer feedforward networks
(SLFNs), which is named as extreme learning machine
(ELM). Interestingly, the weights connecting the input and
hidden layers together with the bias terms are all randomly
H.-J. Xing (&)
Key Laboratory of Machine Learning and Computational
Intelligence, College of Mathematics and Computer Science,
Hebei University, Baoding 071002, China
e-mail: hjxing@hbu.edu.cn
X.-M. Wang
College of Mathematics and Computer Science,
Hebei University, Baoding 071002, China
123
1978
In ELM-RCC, the correntropy-based criterion is utilized to replace the MSE criterion in ELM, which
makes ELM-RCC more robust against noises compared
with ELM.
To further improve the generalization ability of ELMRCC, a regularization term is added into its objective
function.
The related propositions of the proposed method are
given. Moreover, the algorithmic implementation and
123
NH
X
bj f wj xp bj ;
p 1; 2; . . .; N;
j1
f w1 x1 b1 ;
B
..
HB
.
@
...
..
.
f w1 xN b1 ; . . .
1
f wNH x1 bNH
C
..
C;
.
A
f wNH xN bNH
b b1 ; . . .; bNH ; Y y1 ; . . .; yN T ;
and T t1 ; . . .; tN T . Moreover, H hpj NNH is the
output matrix of the hidden layer, b is the matrix of the
weights connecting the hidden and output layers, Y is
1979
Vr A; B Ejr A B;
M
X
^M;r A; B 1
V
jr ai bi ;
M i1
Hy HT H1 HT :
where
P H
where kbk2F Nj1
kbj k22 is regarded as the regularization
term with k k2 denotes the L2-norm of the vector bj .
Moreover, k in (8) denotes the regularization parameter,
which plays an important role to control the trade-off
between the training error and the model complexity.
Appendix 1 provides a detailed mathematical deduction
of the solution (7) from the optimization problem (8).
jr
ai bi 2
exp 2r2 .
is
given
10
by jr ai bi ,Gai bi
M
X
^M;r A; B 1
V
Gai bi :
M i1
11
N
X
Gtp yp ;
12
p1
where tp denotes the target vector for the pth input vector
xp and yp denotes the output of the SLFN for xp, which is
given by
NH
X
2.3 Correntropy
yp
bj f wj xp bj b0 ;
p 1; 2; . . .; N;
13
j1
123
1980
14
where
0
1
; f wNH x1 bNH
C
..
..
A
.
.
1; f w1 xN b1 ; ; f wNH xN bNH
~ b0 ; b1 ; . . .; bN T . Moreover, (13) can be
and b
H
rewritten as
f w1 x1 b1 ;
..
.
1;
.
~ B
H
@ ..
yp
NH
X
~j ;
hpj b
15
~ ^JL2 b;
~ a:
JL2 b
and
~ T KT H
~ kb
~T b
~ ;
~s1 arg max Tr T H
~ b
~ b
b
~
b
21
j0
p1
j0
17
22
123
19
1981
9 d,
Es
N
1X
Gtp yp :
N p1
23
123
1982
x1 1 cos z
. To
x2 12 sin z
alternate the noise level, different percentages of data
points in each class are randomly selected and their class
labels are reversed, namely from ?1 to -1, while -1 to
?1.
Ripley: Ripleys synthetic data set [32] is generated from
mixtures of two Gaussian distributions. There are 250 twodimensional samples in the training set, while 1000 samples in the test set. The manner of changing the noise level
of the training samples is the same as that of Two-Moon.
The settings of parameters for the three different
methods upon the four data sets are summarized in Table 1.
2
Data
Sinc
ELM
RELM
ELMRCC
(a)
1.5
1
1
0.5
0.5
0.5
Data
Sinc
ELM
RELM
ELMRCC
(b)
1.5
0.5
1
6
1
6
(a)
0.5
0.5
0.5
0.5
1
2
1
2
123
(c)
0.5
0.5
0.5
2 2
2 2
ELMRCC
0.5
1
2
(b)
RELM
1
ELM
Original
1
1
2
(d)
1983
Ripley
TwoMoon
2
(a)
1.5
1.2
(b)
Datasets
# Features
Observations
# Classes
# Train
# Test
1
0.5
0.8
0.6
0.5
1
0.4
1.5
1
0.2
1.5
Positive class
Negative class
Noise
ELM
RELM
ELMRCC
Positive class
Negative class
Noise
ELM
RELM
ELMRCC
Fig. 3 The classification results of the three criterions upon TwoMoon and Ripley with noises. a Two-Moon. b Ripley
Sinc
ELM
RELM
ELM-RCC
NH
NH
25
25
NH
10-7
25
10-7
-6
Func
35
35
10
35
10-6
Two-Moon
25
25
10-3
25
10-3
25
-2
25
10-2
Ripley
25
10
Glass
11
109
105
Hayes
132
28
7
3
Balance scale
313
312
Dermatology
35
180
178
Vowel
13
450
450
10
Spect heart
45
134
133
Wdbc
31
285
284
# classesNumber of classes
# Features
# Train
Servo
Observations
# Test
83
83
11
341
341
Concrete
515
515
Wine red
12
799
799
Housing
14
253
253
222
50
10
52
51
Breast heart
Sunspots
Slump
Datasets
ELM
RELM
ELM-RCC
NH
NH
NH
Breast heart
30
20
0.1
50
0.1
0.05
Concrete
35
50
0.1
50
0.25
0.025
Housing
Servo
40
35
45
50
0.075
0.0005
45
45
0.3
0.9
0.0005
0.001
Slump
80
20
0.25
20
0.2
0.075
Sunspots
35
40
0.05
50
0.3
0.025
Wine red
30
40
0.0001
35
0.9
123
1984
ELM
RELM
NH
NH
NH
Balance scale
45
15
0.25
45
0.1
Dermatology
40
30
0.25
35
0.5
Glass
Hayes
40
30
50
50
0.01
0.005
45
35
8
6
0.075
0.0005
Spect heart
ELM-RCC
r
15
0.5
45
12
Vowel
50
50
0.01
50
0.5
0.0001
Wdbc
20
35
50
14
0.0005
Table 6 The results of the three methods on the regression data sets
Datasets
ELM
Etrain (Mean SD)
RELM
Etest (Mean SD)
ELM-RCC
Etest (Mean SD)
Breast heart
0.2937 0.0262
0.3487 0.0320
0.3100 0.0291
0.3490 0.0327
0.3185 0.0387
0.3387 0.0371
Concrete
8.1769 0.3445
8.9393 0.4460
7.5839 0.3171
8.4891 0.3822
7.4522 0.3323
8.4795 0.3839
Housing
3.7983 0.3211
5.1187 0.4737
3.6908 0.3141
4.9211 0.4148
3.6699 0.3964
4.8779 0.4335
Servo
0.4440 0.0865
0.8768 0.1894
0.4028 0.0861
0.7849 0.1289
0.4434 0.0895
0.7757 0.1119
0.0000 0.0000
3.9376 0.8711
2.5592 0.3338
3.4182 0.5297
2.3035 0.3393
3.2873 0.5018
Sunspots
Slump
11.3481 0.3723
22.0417 1.7435
11.3353 0.3141
21.2425 1.2183
10.8793 0.2328
21.1526 1.1702
Wine red
0.6277 0.0142
0.6627 0.0161
0.6174 0.0137
0.6646 0.0163
0.6234 0.0132
0.6574 0.0138
Balance
scale
ELM
RELM
ELM-RCC
Errtrain
(Mean SD)
Errtest
(Mean SD)
Errtrain
(Mean SD)
Errtest
(Mean SD)
Errtrain
(Mean SD)
Errtest
(Mean SD)
7.22 0.78
10.19 1.07
11.01 0.89
11.54 1.18
9.17 0.62
9.96 0.82
Dermatology
1.36 0.86
4.80 1.61
2.04 0.94
4.86 1.49
1.64 0.87
4.48 1.59
Glass
2.67 1.36
21.33 4.88
1.88 1.23
19.30 4.86
3.95 1.52
17.77 4.22
Hayes
Spect heart
14.73 1.49
21.02 1.63
30.50 5.89
21.26 1.58
13.54 0.50
10.30 2.03
28.36 3.89
23.08 3.15
13.64 0.91
17.95 2.34
26.57 3.83
20.92 2.09
Vowel
20.95 2.00
36.64 2.88
21.25 1.78
36.38 2.66
20.87 2.07
35.88 3.03
Wdbc
2.83 0.88
4.40 1.24
2.44 0.79
4.00 1.29
1.82 0.60
4.16 1.13
123
1985
^ 2HT T 2kb
^ 0:
2HT Hb
25
26
Appendix 2
5 Conclusions
In this paper, a novel criterion for training ELM, namely
regularized correntropy criterion is proposed. The proposed
training method can inherit the advantages of both ELM
and the regularized correntropy criterion. Experimental
results on both synthetic data sets and benchmark data sets
indicate that the proposed ELM-RCC has higher generalization performance and lower standard deviation in comparison with ELM and the regularized ELM.
To make our method more promising, there are four
tasks for future investigation. First, it is a tough issue to
find the appropriate parameters for ELM-RCC. Finding the
influences of parameters on the model remains further
study. Second, the optimization method for ELM-RCC
needs to be improved to avoid the unfavorable influence of
the singularity of generalized inverse. Third, the ELMRCC ensemble based on fuzzy integral [34] will be
investigated to further improve the generalization ability of
ELM-RCC. Fourth, our future work will also focus on the
localized generalization error bound [35] of ELM-RCC and
the comparison of the prediction accuracy between ELMRCC and rule-based systems [36, 37] together with their
refinements [38, 39].
Acknowledgments This work is partly supported by the National
Natural Science Foundation of China (Nos. 60903089; 61075051;
61073121; 61170040) and the Foundation of Hebei University (No.
3504020). In addition, the authors would like to express their special
thanks to the referees for their valuable comments and suggestions for
improving this paper.
"
() max
~
b;a
N
X
ap tp yp T tp yp
p1
#
T~
~
kTrb b const :
27
~ T KT H
~ kb
~T b
~s1 arg max Tr T H
~
~ b
~ b
b
~
b
~b
~T H
~ T KT
~b
arg max TrTT KT TT KH
~
b
~T H
~ kb
~T b
~ :
~ T KH
~b
b
28
29
Appendix 1
30
References
minTrHb TT Hb T kTrbT b
b
24
where Tr denotes the trace operator. Taking the
derivative of the objective function in (24) with respect
to the weight matrix b and setting it equal to zero, we can
get
123
1986
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
123