IJACT2433PPL

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy
Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu
A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum

Entropy
1
Xin-ying Xu, 2 Zhen-zhong Liu, 3 Qiu-feng Wu

College of Engineering, Northeast Agricultural University, Harbin 150030, China,
xuxinying0423@126.com
*2, Corresponding Author
College of Science, Northeast Agricultural University, Harbin 150030,
China, lzz00@126.com
3,
College of Science, Northeast Agricultural University, Harbin 150030, China,
neauqfwu@gmail.com
1, First Author
Abstract
In traditional K-Nearest Neighbor (TR_KNN), Euclidean distance is usually used as the distance
metric between different sampleswhich leads to the worse classification performance. This paper
presents a novel KNN classification algorithm based on Maximum Entropy (ME_KNN), improves the
distance metric without any effect of subjectivity. The proposed method is tested on 4 UCI datasets and
8 artificial Toy datasets. The experimental results show that our proposed algorithm achieves
significant improvement in recall, precision and accuracy than TR_KNN.
Keywords: K-nearest neighbor; Similarity metric; Maximum Entropy

1. Introduction
With the rapid development of database technology, data mining has been a research hotpot in many
fields [1], while classification problem is one of the key and basic technologies in data mining. At
present, main classification technologies include Decision Tree Induction (DTI), K-Nearest Neighbor
(KNN), Artificial Neural Network (ANN), Support Vector Machine (SVM), Bayesian Classification
etc.. In these technologies, KNN is one of the best classifiers in classification performance.
Furthermore, KNN is a simple, effective, easily implemented and nonparametric method [2]. Therefore,
KNN has been widely used in various fields such as text classification [3], pattern recognition [4],
image and spatial classification [5] etc..
KNN was initially proposed in 1968 for text classification by Cover and Hart [6]. However, in real
applications, traditional KNN (TR_KNN presents some disadvantages as follows [7]: Firstly,
classification speed is slower; Secondly, classification accuracy is greatly affected by feature weights;
Thirdly, time complexity and space complexity rapidly increase with the increment of sample size;
Finally, the only way to determine the value of parameter k is to repeatedly adjust it by experiment.
One of the most important reasons for the defects is that Euclidean distance is used in TR_KNN. Use
of Euclidean distance leads to all the feature components with the same weight, which affects indirectly
the classification performance. So in many literatures, TR_KNN is improved in distance metric. For
instance, Han et.al. proposed an improved KNN based on Weight Adjustment (WAKNN) [8]; Jahromi
et.al. introduced a weight adjustment coefficient in distance function [9]; Lei et.al. studied on gear
crack level identification with Two-stage Feature Selection and Weighting Technique (TFSWT) [10].
The improved methods above are based on the traditional distance metric (Euclidean distance) in KNN,
which used feature weights in distance metric. However, feature weights can either be obtained
according to the effect that each feature plays in classification or by the interaction of each other in the
whole training sample database. So, it has certain subjectivity to determine feature weights, which can
affect classification accuracy. In this paper, we improve the distance metric in KNN by using
Maximum Entropy (ME). Entropy is a metric of uncertainty of things, which obtains extensive
attention in information theory recently. When entropy reaches the maximum, the things have the most
uncertain state. This point is the closest to the actual condition. Consequently, this is called the
Maximum Entropy. In addition, ME is a general mathematical method in solving inverse problems
where the available data is insufficient to determine the solution, and it is often used in model
International Journal of Advancements in Computing Technology(IJACT)

Volume5,Number5,March 2013
doi:10.4156/ijact.vol5.issue5.115
966

estimation in linear and nonlinear problems [11]. The fact reflects that ME is similarity metric, namely
distance metric, between observed values and actual values. Therefore, we use ME as distance metric
in KNN instead of Euclidean distance. The experimental results validate that the method is able to
classify the sample datasets effectively.
In this paper, according to TR_KNN with Euclidean metric, we propose a k-nearest neighbor based
on Maximum Entropy (ME_KNN) with the combination of TR_KNN and ME, see Section 2.
Experimental results on real data and artificial data are fulfilled in Section 3. The conclusion is shown
in Section 4.
2. Novel k-nearest neighbor classification algorithm based on Maximum Entropy

2.1. Traditional k-nearest neighbor
KNN is a mature algorithm in theory, with a very simple basic idea as follow: given test sample y ,
calculate the distance between y and each training sample based on Euclidean distance, and then
choose the k samples of minimum distances as the k nearest neighbors of y . We assign y to this
category with the largest number of samples in the k nearest neighbors. The details of the algorithm are
described as follows [12]:
Step 1: Construct the training sample set and the test sample set. The training sample set is denoted by
such that xi , ci i 1,2,, n , where xi xi1 , xi2 ,, xil is a l dimensional vector, that
is, the number of features is l in which
xi denotes the jth feature component of the ith training

sample; ci denotes the corresponding class of the ith sample, and label ci belongs to label set C such
that C 1,2,, t, that is, the number of class is t . The test sample set is denoted by such
1
2
l
i
that = y j j 1,2, ,m , where y j y j , y j ,, y j , in which y j denotes the ith feature component
of the jth test sample;
Step 2: Choose k value. In general, take an initial value of k;
Step 3: Calculate the Euclidean distance between test sample and each training sample. Generally
Euclidean distance is denoted by:
d xi , y j
x
l
k 1
k
i
- y kj
(1)
Step 4: Determine k nearest neighbors. Sort the distances in ascending order, and take k samples with
relative minimum distances;
Step 5: Find the dominant class: let the k nearest neighbors be x1, x2 ,, xk , and the corresponding
class labels be c1, c2 ,, ct that belong to the label set C .The queried test sample is classified
according to the class of k nearest neighbors by means of maximum probability. The probabilitywhich
means the percentage of each class appearing in k nearest neighborsis calculated as the number of the
class appearing in k nearest neighbors divided by k. And the class with maximum probability is the
dominant class. Let S s1 , s2 , s3 ,, s t be the set of number for each class in k nearest neighbors.
The details are described as follows:
arg max s k
(2)
Step 6: Assign y j to the class .
2.2. Improvement principles on k-nearest neighbor

In TR_KNN, Euclidean distance is used as the distance metric between different samples, which
967

lead to all feature components with the same weight and the same contribution to classification result.
But this is not of universality. So many scholars introduced weight coefficient to overcome this
shortcoming [8-10]. However, classification accuracy can be affected by subjective factors when the
weight coefficient is introduced into KNN. Therefore, we introduce ME as the distance metric in KNN.
In ME, different features without any independence requirement can be merged into one probability
model, which is a significant characteristic of ME. Furthermore, ME model has the advantages of short
training time and lower classification complexity. Therefore, we use ME as the distance metric between
training sample xi and test sample y j . The details are described as follows [11]:
d_ME xi , y j y kj log
l
y kj
k 1
xik
(3)
Eq.(3) is used as distance metric instead of Euclidean distance used in TR_KNN. ME is very close to
natural state of things for keeping all the uncertainty, and it doesnt involve any weight problem, so Eq.
(3) can never be affected by subjectivity and overcome the significant defect of TR_KNN.
2.3. Improved k-nearest neighbor based on Maximum Entropy

According these analyses above, we propose ME_KNN with the combination of TR_KNN and
Maximum Entropy. The details of the improved KNN are described as follows:
Step 1: Construct training sample set, test sample set and feature vectors;
Step 2: Choose initial k value;
Step 3: Calculate the distance between test sample and training samples by means of Eq.(3), sort the
distances in ascending order, and then take the k sample with k relative minimum distances as k nearest
neighbors;
Step 4: Finding the dominant class according the Step 5 in TR_KNN, and determine the class of the
test sample;
Step 5: Evaluation. If unsatisfied with the classification result, go back to Step 2 and continue with
Step 2 to Step 5. Otherwise, go to the end.
ME_KNN inherits the advantages of simplicity and easy implementation of TR_KNN. Meanwhile,
it is absolutely objective, and the classification result is more accurate. But, it is noted that
classification result is obvious with the positive values of features components.
3. Experimental results
In this section, datasets and evaluation indexes are given in section 3.1. The experiments on the two
decisive parameters, k value and the percentage of training samples in a dataset, for classification
performance are presented in section 3.2. The experiments on real datasets and artificial datasets for
performances of TR_KNN and ME_KNN are shown in section 3.3 and section 3.4 respectively.
3.1. Datasets and evaluation indexes

In this paper, we choose 4 standard datasets in UCI (http://archive.ics.uci.edu/ml/datasets/Adult) for
evaluating the performance of ME_KNN. In Table 1, the Abalone used is a part of original Abalone
dataset. Only choose the first 100 of each category in original Abalone dataset for convenience. The
details of the datasets are described in Table 1 as follows
Table 1. UCI datasets used in the experiments
Datasets
Property type
Sample number
Class number Feature number
Iris
Real
150
Wine
Integer, Real
178
13
Abalone
Real
300
Balance
Categorical
910
968

In order to compare the classification performance of TR_KNN with that of ME_KNN, macro
average recall ( macro - r ), macro average precision ( macro - p ), as well as macro-F1 measure are
chosen here [13]. macro - r , macro - p and macro - F1 measure of ME_KNN are denoted by
rME, pME, and F1ME respectively. macro - r , macro - p and macro - F1 measure of TR_KNN
are denoted by rTR, pTR, and F1TR respectively. macro - r , macro - p and
macro - F1 measure are respectively described as follows:

t
macro - r
k 1
(4)
t
t
macro - p
p
k 1
(5)
t
n
macro - F1
F
k 1
1k
(6)
In Eq.(4), recall rk such that rk ak b k ,where ak denotes the number of the kth class test samples
t is the number of the classes

in test sample set. In Eq.(5), precision pk such that pk ak d k ,where d k is the number of test
predicted correctly, bk denotes the number of the kth class test samples;
samples that predicted to be the kth class. In Eq.(6), F1k such that F1k 2rk pk (rk p k ) , which combine
the recall ( rk ) and precision ( pk ) into a single measure.
In addition, the accuracy [14] also used in section 3.2.1, the formula is described as follow:
ACC
Number of samples classified correctly

100%
Number of whole dataset
(7)
3.2. Experimental results on optimal parametric selection

3.2.1. The influence of different k values on classification accuracy
The only way to determine k value is to repeatedly adjust it although k is a very important parameter
for KNN. In this section, in order to show the effect of different k values on the classification
performance, we assign 5, 10, 15 and 20 to k in the experiments. And given the number of training set
is two-thirds of the whole dataset. The results show that accuracy changes with different k values at Iris,
Wine, Abalone and Balance datasets. The experimental results are described in Figure 1.
In Figure 1, the classification accuracy of ME_KNN and TR_KNN reach the peaks when k equals
to both 5 and 15 at Iris dataset, while the accuracy of the two KNN algorithms reach peaks when k
equals to 15 at Wine dataset. At Abalone dataset, the accuracy of the two KNN reach the peaks when k
equals to 20. However, the accuracy reaches optimal value when k equals to 10 at Balance dataset. In
the four subfigures, the trends of the lines are absolutely different, and the k values are different when
the classification performances are the best [12]. Therefore, we still havent a determined method to
determine k value except adjusting it repeatedly by experiments. It is pointed out that k value must not
exceed the sample number of the class with minimum sample. Although there is not a determined
method to determine k value, it is very obvious that the accuracy of ME_KNN is much higher than that
of TR_KNN with the same k value.
969

a. At Iris dataset
c. At Abalone dataset
b. At Wine dataset
d. At Balance dataset
Figure 1. The changes of accuracy with different k values at four real datasets
3.2.2. The classification results with different percentage of training samples
The percentage of training samples is a critical parameter for classification, which directly affects
the classification result. In order to show the influence of the different percentage of training samples
on classification performance, we set the percentage of training samples in dataset to be 1/31/2 and
2/3, respectively. In this experimentwe set k to be 5. Table 2 demonstrates the effect of the percentage
of training sample on the classification performance.
In Table 2, Ptrs denotes the percentage of training samples in dataset, and the number of samples is
given for each dataset. At Iris dataset, when the percentage of training samples is 2/3, rME, pME, and
F1ME of ME_KNN reach the optimal valueswhich are 97.92%, 98.04%, and 0.9798, respectively.
While the best performance for TR_KNN is also obtained at the same condition with ME_KNN, which
is that rTR, pTR, and F1TR reach to 95.83%, 97.92%, and 0.9686, respectively. At Balance dataset, the
increasing extents from TR_KNN to ME_KNN of each evaluation index are much higher than that in
the other three datasets. Table 2 shows that rME reaches 76.17% when the percentage of training
samples is 1/2, pME reaches 76.65% and F1ME reaches 0.7518 with the percentage of training samples
equaling 2/3 at Balance dataset. All the evaluation indexes improve gradually with the increment of
training samples for every single dataset, which illustrate that the classification performance, both for
ME_KNN and TR_KNN, is improved with the increment of the percentage of the training samples [15].
In summary, the classification performance of ME_KNN is better than that of TR_KNN.
970

Table 2. The classification results with different percentage of training samples

Datasets
Iris
Wine
Abalone
Ptrs
rTR/%
rME/%
pTR/%
pME/%
F1TR
F1ME
1/3
92.93
94.95
93.33
94.98
0.9313
0.9496
1/2
93.33
94.67
93.33
94.67
0.9333
0.9467
2/3
95.83
97.92
97.92
98.04
0.9686
0.9798
1/3
90.87
91.74
92.87
93.01
0.9186
0.9237
1/2
93.85
94.45
92.69
94.73
0.9326
0.9459
2/3
92.14
96.28
94.93
95.14
0.9351
0.9571
1/3
86.36
88.89
79.89
81.85
0.8300
0.8523
1/2
86.50
88.00
81.24
84.17
0.8379
0.8604
2/3
87.88
90.00
80.76
83.54
0.8417
0.8660
73.49
65.87
75.57
0.6524
0.7451
1/3
Balance
64.62
1/2
67.09
76.17
67.94
73.06
0.6751
0.7458
2/3
66.56
73.76
67.69
76.65
0.6712
0.7518
3.3. Experimental results in real datasets

In this experiment, we set the percentage of training samples to be 2/3, and k to be 10.
Table 3. The classification results in ME_KNN and TR_KNN at four datasets
Datasets
rTR/%
rME/%
pTR/%
pME/%
F1TR
F1ME
Iris
95.83
97.92
97.92
98.04
0.9686
0.9798
Wine
96.66
96.80
94.71
97.22
0.9567
0.9700
Abalone
86.87
89.74
78.04
78.89
0.8222
0.8397
Balance
63.15
75.80
69.30
77.50
0.6608
0.7664
The experimental results in real datasets are shown in Table 3. Each evaluation index of the four
datasets is increased in ME_KNN relative to TR_KNN. The increment extend of macro- r from
TR_KNN to ME_KNN at Wine dataset is the minimum, but the value still reach 0.14%, while every
index of Balance dataset is improved greatly and classification performance of ME_KNN is significant.
At Balance dataset, macro- r , macro- p , and macro F1 increase by 12%, 8% and 0.1 respectively.
Therefore, the classification performance of ME_KNN at the four real datasets is superior to that of
TR_KNN.
3.4. Experimental result in artificial datasets

We construct 8 Toy datasets for presenting the effect of feature dimension on classification
performance. The details of the artificial datasets are described in Table 4 as follows
The percentage of training samples is set to be 2/3, and k value to be 5. The details of results are
shown in Table 5. In Table 5, the classification performances of the two KNN are optimal at Toy1,
whose feature dimension is 4. All the indexes values reach 100% at Toy1 in ME_KNN. With the
increment of feature dimension, the performance becomes worse, and the performance of Toy8 is the
worst. But each evaluation index is still increased in ME_KNN than TR_KNN. rTR, pTR, and F1TR of
Toy8 are 81.00%, 85.05%, and 0.8280 respectively, while rME, pME, and F1ME are 81.70%, 81.56%, and
0.8097 respectively. The results illustrate that feature dimension affects the classification performances
of TR_KNN and ME_KNN, which decrease as the feature dimensions increase [16]. Note that the
stability of ME_KNN is better than that of TR_KNN.
971

Table 4. Artificial datasets

Standard
Variance
Datasets
Number of
deviation
samples
Number of Feature
classes
dimension
Toy1
0.5
0.5
600
Toy2
0.5
600
Toy3
600
Toy4
600
Toy5
0.8
1.5
600
Toy6
600
Toy7
600
10
Toy8
600
11
Table 5. Classification results with the changes of feature dimensions.

Datasets
Feature
dimension
rTR/%
rME/%
pTR/%
pME/%
F1TR
F1ME
Toy1
99.01
100.00
99.01
100.00
0.9900
1.0000
Toy2
98.50
99.50
98.54
99.51
0.9852
0.9950
Toy3
98.50
98.50
98.50
98.50
0.9850
0.9850
Toy4
97.00
97.50
97.08
97.57
0.9705
0.9747
Toy5
93.95
96.94
94.14
96.94
0.9411
0.9697
Toy6
84.62
86.00
86.13
86.95
0.8421
0.8563
Toy7
10
82.89
85.11
83.42
85.12
0.8395
0.8535
Toy8
11
81.00
81.70
85.05
81.56
0.8280
0.8097
4. Conclusions
In this paper, a novel KNN classification algorithm based on Maximum Entropy is proposed by
combining KNN and Maximum Entropy. The methods of determining k value and the percentage of
training samples are presented in the two experiments of optimal parametric selection. ME_KNN
shows superiority than TR_KNN in recall, precision, accuracy and stability with the experiments using
real datasets and artificial datasets.
5. Acknowledgments.
This work was supported in part by the Nature Science Foundation of Northeast Agricultural
University under contract no.2011RCA01.
6. References
[1] Sarabjot S. Anand, David A. Bell, John G. Hughes, A General Framework for Data Mining Based
on Evidence Theory, Data & Knowledge Engineering, Elsevier, vol. 18, no. 3, pp.189-223, 1996.
[2] Huawen Liu, Shichao Zhang, Noisy Data Elimination Using Mutual k-Nearest Neighbor for
972

Classification Mining, Journal of Systems and Software, Elsevier, vol. 85, no. 5, pp.1067-1074,
2012.
[3] Taeho Jo, Malrey Lee, Yigon Kim, String Vectors as a Representation of Documents with
Numerical Vectors in Text Categorization, Journal of Convergence Information Technology,
AICIT, vol. 2, no. 1, pp.66-73, 2007.
[4] Jun Toyama, Mineichi Kudo, Hideyuki Imai, Probably Correct K-Nearest Neighbor Search in
High Dimensions, Pattern Recognition, Elsevier, vol. 43, no. 4, pp.1361-1372, 2012.
[5] Richard Nock, Paolo Piro, Frank Nielsen, Wafa Bel Haj Ali, Michel Barlaud, Boosting k-NN for
Categorization of Natural Scenes, International Journal of Computer Vision, Springer US, vol.
100, no. 3, pp.294-314, 2012.
[6] TM Cover, P E Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information
Theory, IEEE, vol. 13, no. 1, pp.21-27, 1967.
[7] Mohammad Ashraf, Girija. Chetty, Dat Tran, Dharmendra Sharma, A New Approach for
Constructing Missing Features Values, IJIIP, AICIT, vol. 3, no. 1, pp.110- 118, 2012.
[8] Eui-Hong Han, George Karypis, Vipin Kumar, Text Categorization Using Weight Adjusted
k-Nearest Neighbor Classification, In Proceedings of the 5th Pacific-Asia Conference on
Knowledge Discovery and Data Mining, pp.53-65, 2001.
[9] Mansoor Zolghadri Jahromi, Elham Parvinnia, Robert John, A Method of Learning Weighted
Similarity Function to Improve the Performance of Nearest Neighbor, Information Sciences,
Elsevier, vol. 179, no. 17, pp.2964-2973, 2009.

[10] Yaguo lei, Ming J. Zuo, Gear Crack Level Identification Based on Weighted k Nearest Neighbor
Classification Algorithm, Mechanical Systems and Signal Processing, Elsevier, vol. 23, no. 5,
pp.1535-1547, 2009.
[11] R. Barbuzza, P. Lotito, A. Clausse, Tomography Reconstruction by Entropy Maximization with
Smoothing Filtering, Inverse Problems in Science and Engineering, Taylor & Francis, vol. 18, no.
5, pp.711-722, 2010.
[12] Mehmet Aci, Mutlu Avci, K-Nearest Neighbor Reinforced Expectation Maximization Method,
Expert Systems with Applications, Elsevier, vol. 38, no. 10, pp.12585-12591, 2011.
[13] Shengyi Jiang, Guansong Pang, Meiling Wu, Limin Kuang, An Improved K-Nearest-Neighbor
Algorithm for Text Categorization, Expert Systems with Applications, Elsevier, vol. 39, no. 1,
pp.1503-1509, 2012.
[14] M.Govindarajan, RM.Chandrasekaran, Evaluation of k-Nearest Neighbor Classifier Performance
for Direct Marketing, Expert Systems with Applications, Elsevier, vol. 37, no. 1, pp.253-258,
2010.
[15] Jing Peng, Chang-jie Tang, Dong-qing Yang, Jing Zhang, Jian-jun Hu, Similarity Computing
Model of High Dimension Data for Symptom Classification of Chinese Traditional Medicine,
Applied Soft Computing, Elsevier, vol. 9, no. 1, pp.209-218, 2009.
[16] S.Manocha, M.A.Girolami, An Empirical Analysis of the Probabilistic K-Nearest Neighbor
Classifier, Pattern Recognition Letters, Elsevier, vol. 28, no. 13, pp.1818-1824, 2007.
973

IJACT2433PPL

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

IJACT2433PPL

Diunggah oleh

Hak Cipta:

Format Tersedia

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum

Xin-ying Xu, 2 Zhen-zhong Liu, 3 Qiu-feng Wu

Keywords: K-nearest neighbor; Similarity metric; Maximum Entropy

International Journal of Advancements in Computing Technology(IJACT)

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

2. Novel k-nearest neighbor classification algorithm based on Maximum Entropy

is, the number of features is l in which

xi denotes the jth feature component of the ith training

Step 6: Assign y j to the class .

2.2. Improvement principles on k-nearest neighbor

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

2.3. Improved k-nearest neighbor based on Maximum Entropy

3.1. Datasets and evaluation indexes

Class number Feature number

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

macro - F1 measure are respectively described as follows:

t is the number of the classes

Number of samples classified correctly

3.2. Experimental results on optimal parametric selection

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

Table 2. The classification results with different percentage of training samples

3.3. Experimental results in real datasets

3.4. Experimental result in artificial datasets

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

Table 4. Artificial datasets

Table 5. Classification results with the changes of feature dimensions.

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

Elsevier, vol. 179, no. 17, pp.2964-2973, 2009.

Anda mungkin juga menyukai