Anda di halaman 1dari 8

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy

Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum


Entropy
1

Xin-ying Xu, 2 Zhen-zhong Liu, 3 Qiu-feng Wu


College of Engineering, Northeast Agricultural University, Harbin 150030, China,
xuxinying0423@126.com
*2, Corresponding Author
College of Science, Northeast Agricultural University, Harbin 150030,
China, lzz00@126.com
3,
College of Science, Northeast Agricultural University, Harbin 150030, China,
neauqfwu@gmail.com

1, First Author

Abstract
In traditional K-Nearest Neighbor (TR_KNN), Euclidean distance is usually used as the distance
metric between different sampleswhich leads to the worse classification performance. This paper
presents a novel KNN classification algorithm based on Maximum Entropy (ME_KNN), improves the
distance metric without any effect of subjectivity. The proposed method is tested on 4 UCI datasets and
8 artificial Toy datasets. The experimental results show that our proposed algorithm achieves
significant improvement in recall, precision and accuracy than TR_KNN.

Keywords: K-nearest neighbor; Similarity metric; Maximum Entropy


1. Introduction
With the rapid development of database technology, data mining has been a research hotpot in many
fields [1], while classification problem is one of the key and basic technologies in data mining. At
present, main classification technologies include Decision Tree Induction (DTI), K-Nearest Neighbor
(KNN), Artificial Neural Network (ANN), Support Vector Machine (SVM), Bayesian Classification
etc.. In these technologies, KNN is one of the best classifiers in classification performance.
Furthermore, KNN is a simple, effective, easily implemented and nonparametric method [2]. Therefore,
KNN has been widely used in various fields such as text classification [3], pattern recognition [4],
image and spatial classification [5] etc..
KNN was initially proposed in 1968 for text classification by Cover and Hart [6]. However, in real
applications, traditional KNN (TR_KNN presents some disadvantages as follows [7]: Firstly,
classification speed is slower; Secondly, classification accuracy is greatly affected by feature weights;
Thirdly, time complexity and space complexity rapidly increase with the increment of sample size;
Finally, the only way to determine the value of parameter k is to repeatedly adjust it by experiment.
One of the most important reasons for the defects is that Euclidean distance is used in TR_KNN. Use
of Euclidean distance leads to all the feature components with the same weight, which affects indirectly
the classification performance. So in many literatures, TR_KNN is improved in distance metric. For
instance, Han et.al. proposed an improved KNN based on Weight Adjustment (WAKNN) [8]; Jahromi
et.al. introduced a weight adjustment coefficient in distance function [9]; Lei et.al. studied on gear
crack level identification with Two-stage Feature Selection and Weighting Technique (TFSWT) [10].
The improved methods above are based on the traditional distance metric (Euclidean distance) in KNN,
which used feature weights in distance metric. However, feature weights can either be obtained
according to the effect that each feature plays in classification or by the interaction of each other in the
whole training sample database. So, it has certain subjectivity to determine feature weights, which can
affect classification accuracy. In this paper, we improve the distance metric in KNN by using
Maximum Entropy (ME). Entropy is a metric of uncertainty of things, which obtains extensive
attention in information theory recently. When entropy reaches the maximum, the things have the most
uncertain state. This point is the closest to the actual condition. Consequently, this is called the
Maximum Entropy. In addition, ME is a general mathematical method in solving inverse problems
where the available data is insufficient to determine the solution, and it is often used in model

International Journal of Advancements in Computing Technology(IJACT)


Volume5,Number5,March 2013
doi:10.4156/ijact.vol5.issue5.115

966

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

estimation in linear and nonlinear problems [11]. The fact reflects that ME is similarity metric, namely
distance metric, between observed values and actual values. Therefore, we use ME as distance metric
in KNN instead of Euclidean distance. The experimental results validate that the method is able to
classify the sample datasets effectively.
In this paper, according to TR_KNN with Euclidean metric, we propose a k-nearest neighbor based
on Maximum Entropy (ME_KNN) with the combination of TR_KNN and ME, see Section 2.
Experimental results on real data and artificial data are fulfilled in Section 3. The conclusion is shown
in Section 4.

2. Novel k-nearest neighbor classification algorithm based on Maximum Entropy


2.1. Traditional k-nearest neighbor
KNN is a mature algorithm in theory, with a very simple basic idea as follow: given test sample y ,
calculate the distance between y and each training sample based on Euclidean distance, and then
choose the k samples of minimum distances as the k nearest neighbors of y . We assign y to this
category with the largest number of samples in the k nearest neighbors. The details of the algorithm are
described as follows [12]:
Step 1: Construct the training sample set and the test sample set. The training sample set is denoted by

such that xi , ci i 1,2,, n , where xi xi1 , xi2 ,, xil is a l dimensional vector, that

is, the number of features is l in which

xi denotes the jth feature component of the ith training


sample; ci denotes the corresponding class of the ith sample, and label ci belongs to label set C such
that C 1,2,, t, that is, the number of class is t . The test sample set is denoted by such
1
2
l
i
that = y j j 1,2, ,m , where y j y j , y j ,, y j , in which y j denotes the ith feature component
of the jth test sample;
Step 2: Choose k value. In general, take an initial value of k;
Step 3: Calculate the Euclidean distance between test sample and each training sample. Generally
Euclidean distance is denoted by:

d xi , y j

x
l

k 1

k
i

- y kj

(1)

Step 4: Determine k nearest neighbors. Sort the distances in ascending order, and take k samples with
relative minimum distances;
Step 5: Find the dominant class: let the k nearest neighbors be x1, x2 ,, xk , and the corresponding

class labels be c1, c2 ,, ct that belong to the label set C .The queried test sample is classified
according to the class of k nearest neighbors by means of maximum probability. The probabilitywhich
means the percentage of each class appearing in k nearest neighborsis calculated as the number of the
class appearing in k nearest neighbors divided by k. And the class with maximum probability is the
dominant class. Let S s1 , s2 , s3 ,, s t be the set of number for each class in k nearest neighbors.
The details are described as follows:

arg max s k

(2)

Step 6: Assign y j to the class .

2.2. Improvement principles on k-nearest neighbor


In TR_KNN, Euclidean distance is used as the distance metric between different samples, which

967

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

lead to all feature components with the same weight and the same contribution to classification result.
But this is not of universality. So many scholars introduced weight coefficient to overcome this
shortcoming [8-10]. However, classification accuracy can be affected by subjective factors when the
weight coefficient is introduced into KNN. Therefore, we introduce ME as the distance metric in KNN.
In ME, different features without any independence requirement can be merged into one probability
model, which is a significant characteristic of ME. Furthermore, ME model has the advantages of short
training time and lower classification complexity. Therefore, we use ME as the distance metric between
training sample xi and test sample y j . The details are described as follows [11]:

d_ME xi , y j y kj log
l

y kj

k 1

xik

(3)

Eq.(3) is used as distance metric instead of Euclidean distance used in TR_KNN. ME is very close to
natural state of things for keeping all the uncertainty, and it doesnt involve any weight problem, so Eq.
(3) can never be affected by subjectivity and overcome the significant defect of TR_KNN.

2.3. Improved k-nearest neighbor based on Maximum Entropy


According these analyses above, we propose ME_KNN with the combination of TR_KNN and
Maximum Entropy. The details of the improved KNN are described as follows:
Step 1: Construct training sample set, test sample set and feature vectors;
Step 2: Choose initial k value;
Step 3: Calculate the distance between test sample and training samples by means of Eq.(3), sort the
distances in ascending order, and then take the k sample with k relative minimum distances as k nearest
neighbors;
Step 4: Finding the dominant class according the Step 5 in TR_KNN, and determine the class of the
test sample;
Step 5: Evaluation. If unsatisfied with the classification result, go back to Step 2 and continue with
Step 2 to Step 5. Otherwise, go to the end.
ME_KNN inherits the advantages of simplicity and easy implementation of TR_KNN. Meanwhile,
it is absolutely objective, and the classification result is more accurate. But, it is noted that
classification result is obvious with the positive values of features components.

3. Experimental results
In this section, datasets and evaluation indexes are given in section 3.1. The experiments on the two
decisive parameters, k value and the percentage of training samples in a dataset, for classification
performance are presented in section 3.2. The experiments on real datasets and artificial datasets for
performances of TR_KNN and ME_KNN are shown in section 3.3 and section 3.4 respectively.

3.1. Datasets and evaluation indexes


In this paper, we choose 4 standard datasets in UCI (http://archive.ics.uci.edu/ml/datasets/Adult) for
evaluating the performance of ME_KNN. In Table 1, the Abalone used is a part of original Abalone
dataset. Only choose the first 100 of each category in original Abalone dataset for convenience. The
details of the datasets are described in Table 1 as follows
Table 1. UCI datasets used in the experiments
Datasets

Property type

Sample number

Class number Feature number

Iris

Real

150

Wine

Integer, Real

178

13

Abalone

Real

300

Balance

Categorical

910

968

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

In order to compare the classification performance of TR_KNN with that of ME_KNN, macro
average recall ( macro - r ), macro average precision ( macro - p ), as well as macro-F1 measure are
chosen here [13]. macro - r , macro - p and macro - F1 measure of ME_KNN are denoted by
rME, pME, and F1ME respectively. macro - r , macro - p and macro - F1 measure of TR_KNN
are denoted by rTR, pTR, and F1TR respectively. macro - r , macro - p and

macro - F1 measure are respectively described as follows:


t

macro - r

k 1

(4)

t
t

macro - p

p
k 1

(5)

t
n

macro - F1

F
k 1

1k

(6)

In Eq.(4), recall rk such that rk ak b k ,where ak denotes the number of the kth class test samples

t is the number of the classes


in test sample set. In Eq.(5), precision pk such that pk ak d k ,where d k is the number of test

predicted correctly, bk denotes the number of the kth class test samples;

samples that predicted to be the kth class. In Eq.(6), F1k such that F1k 2rk pk (rk p k ) , which combine
the recall ( rk ) and precision ( pk ) into a single measure.
In addition, the accuracy [14] also used in section 3.2.1, the formula is described as follow:

ACC

Number of samples classified correctly


100%
Number of whole dataset

(7)

3.2. Experimental results on optimal parametric selection


3.2.1. The influence of different k values on classification accuracy

The only way to determine k value is to repeatedly adjust it although k is a very important parameter
for KNN. In this section, in order to show the effect of different k values on the classification
performance, we assign 5, 10, 15 and 20 to k in the experiments. And given the number of training set
is two-thirds of the whole dataset. The results show that accuracy changes with different k values at Iris,
Wine, Abalone and Balance datasets. The experimental results are described in Figure 1.
In Figure 1, the classification accuracy of ME_KNN and TR_KNN reach the peaks when k equals
to both 5 and 15 at Iris dataset, while the accuracy of the two KNN algorithms reach peaks when k
equals to 15 at Wine dataset. At Abalone dataset, the accuracy of the two KNN reach the peaks when k
equals to 20. However, the accuracy reaches optimal value when k equals to 10 at Balance dataset. In
the four subfigures, the trends of the lines are absolutely different, and the k values are different when
the classification performances are the best [12]. Therefore, we still havent a determined method to
determine k value except adjusting it repeatedly by experiments. It is pointed out that k value must not
exceed the sample number of the class with minimum sample. Although there is not a determined
method to determine k value, it is very obvious that the accuracy of ME_KNN is much higher than that
of TR_KNN with the same k value.

969

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

a. At Iris dataset

c. At Abalone dataset

b. At Wine dataset

d. At Balance dataset

Figure 1. The changes of accuracy with different k values at four real datasets
3.2.2. The classification results with different percentage of training samples

The percentage of training samples is a critical parameter for classification, which directly affects
the classification result. In order to show the influence of the different percentage of training samples
on classification performance, we set the percentage of training samples in dataset to be 1/31/2 and
2/3, respectively. In this experimentwe set k to be 5. Table 2 demonstrates the effect of the percentage
of training sample on the classification performance.
In Table 2, Ptrs denotes the percentage of training samples in dataset, and the number of samples is
given for each dataset. At Iris dataset, when the percentage of training samples is 2/3, rME, pME, and
F1ME of ME_KNN reach the optimal valueswhich are 97.92%, 98.04%, and 0.9798, respectively.
While the best performance for TR_KNN is also obtained at the same condition with ME_KNN, which
is that rTR, pTR, and F1TR reach to 95.83%, 97.92%, and 0.9686, respectively. At Balance dataset, the
increasing extents from TR_KNN to ME_KNN of each evaluation index are much higher than that in
the other three datasets. Table 2 shows that rME reaches 76.17% when the percentage of training
samples is 1/2, pME reaches 76.65% and F1ME reaches 0.7518 with the percentage of training samples
equaling 2/3 at Balance dataset. All the evaluation indexes improve gradually with the increment of
training samples for every single dataset, which illustrate that the classification performance, both for
ME_KNN and TR_KNN, is improved with the increment of the percentage of the training samples [15].
In summary, the classification performance of ME_KNN is better than that of TR_KNN.

970

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

Table 2. The classification results with different percentage of training samples


Datasets
Iris

Wine

Abalone

Ptrs

rTR/%

rME/%

pTR/%

pME/%

F1TR

F1ME

1/3

92.93

94.95

93.33

94.98

0.9313

0.9496

1/2

93.33

94.67

93.33

94.67

0.9333

0.9467

2/3

95.83

97.92

97.92

98.04

0.9686

0.9798

1/3

90.87

91.74

92.87

93.01

0.9186

0.9237

1/2

93.85

94.45

92.69

94.73

0.9326

0.9459

2/3

92.14

96.28

94.93

95.14

0.9351

0.9571

1/3

86.36

88.89

79.89

81.85

0.8300

0.8523

1/2

86.50

88.00

81.24

84.17

0.8379

0.8604

2/3

87.88

90.00

80.76

83.54

0.8417

0.8660

73.49

65.87

75.57

0.6524

0.7451

1/3
Balance

64.62

1/2

67.09

76.17

67.94

73.06

0.6751

0.7458

2/3

66.56

73.76

67.69

76.65

0.6712

0.7518

3.3. Experimental results in real datasets


In this experiment, we set the percentage of training samples to be 2/3, and k to be 10.
Table 3. The classification results in ME_KNN and TR_KNN at four datasets
Datasets

rTR/%

rME/%

pTR/%

pME/%

F1TR

F1ME

Iris

95.83

97.92

97.92

98.04

0.9686

0.9798

Wine

96.66

96.80

94.71

97.22

0.9567

0.9700

Abalone

86.87

89.74

78.04

78.89

0.8222

0.8397

Balance

63.15

75.80

69.30

77.50

0.6608

0.7664

The experimental results in real datasets are shown in Table 3. Each evaluation index of the four
datasets is increased in ME_KNN relative to TR_KNN. The increment extend of macro- r from
TR_KNN to ME_KNN at Wine dataset is the minimum, but the value still reach 0.14%, while every
index of Balance dataset is improved greatly and classification performance of ME_KNN is significant.
At Balance dataset, macro- r , macro- p , and macro F1 increase by 12%, 8% and 0.1 respectively.
Therefore, the classification performance of ME_KNN at the four real datasets is superior to that of
TR_KNN.

3.4. Experimental result in artificial datasets


We construct 8 Toy datasets for presenting the effect of feature dimension on classification
performance. The details of the artificial datasets are described in Table 4 as follows
The percentage of training samples is set to be 2/3, and k value to be 5. The details of results are
shown in Table 5. In Table 5, the classification performances of the two KNN are optimal at Toy1,
whose feature dimension is 4. All the indexes values reach 100% at Toy1 in ME_KNN. With the
increment of feature dimension, the performance becomes worse, and the performance of Toy8 is the
worst. But each evaluation index is still increased in ME_KNN than TR_KNN. rTR, pTR, and F1TR of
Toy8 are 81.00%, 85.05%, and 0.8280 respectively, while rME, pME, and F1ME are 81.70%, 81.56%, and
0.8097 respectively. The results illustrate that feature dimension affects the classification performances
of TR_KNN and ME_KNN, which decrease as the feature dimensions increase [16]. Note that the
stability of ME_KNN is better than that of TR_KNN.

971

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

Table 4. Artificial datasets


Standard

Variance

Datasets

Number of

deviation

samples

Number of Feature
classes

dimension

Toy1

0.5

0.5

600

Toy2

0.5

600

Toy3

600

Toy4

600

Toy5

0.8

1.5

600

Toy6

600

Toy7

600

10

Toy8

600

11

Table 5. Classification results with the changes of feature dimensions.


Datasets

Feature
dimension

rTR/%

rME/%

pTR/%

pME/%

F1TR

F1ME

Toy1

99.01

100.00

99.01

100.00

0.9900

1.0000

Toy2

98.50

99.50

98.54

99.51

0.9852

0.9950

Toy3

98.50

98.50

98.50

98.50

0.9850

0.9850

Toy4

97.00

97.50

97.08

97.57

0.9705

0.9747

Toy5

93.95

96.94

94.14

96.94

0.9411

0.9697

Toy6

84.62

86.00

86.13

86.95

0.8421

0.8563

Toy7

10

82.89

85.11

83.42

85.12

0.8395

0.8535

Toy8

11

81.00

81.70

85.05

81.56

0.8280

0.8097

4. Conclusions
In this paper, a novel KNN classification algorithm based on Maximum Entropy is proposed by
combining KNN and Maximum Entropy. The methods of determining k value and the percentage of
training samples are presented in the two experiments of optimal parametric selection. ME_KNN
shows superiority than TR_KNN in recall, precision, accuracy and stability with the experiments using
real datasets and artificial datasets.

5. Acknowledgments.
This work was supported in part by the Nature Science Foundation of Northeast Agricultural
University under contract no.2011RCA01.

6. References
[1] Sarabjot S. Anand, David A. Bell, John G. Hughes, A General Framework for Data Mining Based
on Evidence Theory, Data & Knowledge Engineering, Elsevier, vol. 18, no. 3, pp.189-223, 1996.
[2] Huawen Liu, Shichao Zhang, Noisy Data Elimination Using Mutual k-Nearest Neighbor for

972

A Novel K-Nearest Neighbor Classification Algorithm Based on Maximum Entropy


Xin-ying Xu, Zhen-zhong Liu, Qiu-feng Wu

Classification Mining, Journal of Systems and Software, Elsevier, vol. 85, no. 5, pp.1067-1074,
2012.
[3] Taeho Jo, Malrey Lee, Yigon Kim, String Vectors as a Representation of Documents with
Numerical Vectors in Text Categorization, Journal of Convergence Information Technology,
AICIT, vol. 2, no. 1, pp.66-73, 2007.
[4] Jun Toyama, Mineichi Kudo, Hideyuki Imai, Probably Correct K-Nearest Neighbor Search in
High Dimensions, Pattern Recognition, Elsevier, vol. 43, no. 4, pp.1361-1372, 2012.
[5] Richard Nock, Paolo Piro, Frank Nielsen, Wafa Bel Haj Ali, Michel Barlaud, Boosting k-NN for
Categorization of Natural Scenes, International Journal of Computer Vision, Springer US, vol.
100, no. 3, pp.294-314, 2012.
[6] TM Cover, P E Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information
Theory, IEEE, vol. 13, no. 1, pp.21-27, 1967.
[7] Mohammad Ashraf, Girija. Chetty, Dat Tran, Dharmendra Sharma, A New Approach for
Constructing Missing Features Values, IJIIP, AICIT, vol. 3, no. 1, pp.110- 118, 2012.
[8] Eui-Hong Han, George Karypis, Vipin Kumar, Text Categorization Using Weight Adjusted
k-Nearest Neighbor Classification, In Proceedings of the 5th Pacific-Asia Conference on
Knowledge Discovery and Data Mining, pp.53-65, 2001.
[9] Mansoor Zolghadri Jahromi, Elham Parvinnia, Robert John, A Method of Learning Weighted
Similarity Function to Improve the Performance of Nearest Neighbor, Information Sciences,

Elsevier, vol. 179, no. 17, pp.2964-2973, 2009.


[10] Yaguo lei, Ming J. Zuo, Gear Crack Level Identification Based on Weighted k Nearest Neighbor
Classification Algorithm, Mechanical Systems and Signal Processing, Elsevier, vol. 23, no. 5,
pp.1535-1547, 2009.
[11] R. Barbuzza, P. Lotito, A. Clausse, Tomography Reconstruction by Entropy Maximization with
Smoothing Filtering, Inverse Problems in Science and Engineering, Taylor & Francis, vol. 18, no.
5, pp.711-722, 2010.
[12] Mehmet Aci, Mutlu Avci, K-Nearest Neighbor Reinforced Expectation Maximization Method,
Expert Systems with Applications, Elsevier, vol. 38, no. 10, pp.12585-12591, 2011.
[13] Shengyi Jiang, Guansong Pang, Meiling Wu, Limin Kuang, An Improved K-Nearest-Neighbor
Algorithm for Text Categorization, Expert Systems with Applications, Elsevier, vol. 39, no. 1,
pp.1503-1509, 2012.
[14] M.Govindarajan, RM.Chandrasekaran, Evaluation of k-Nearest Neighbor Classifier Performance
for Direct Marketing, Expert Systems with Applications, Elsevier, vol. 37, no. 1, pp.253-258,
2010.
[15] Jing Peng, Chang-jie Tang, Dong-qing Yang, Jing Zhang, Jian-jun Hu, Similarity Computing
Model of High Dimension Data for Symptom Classification of Chinese Traditional Medicine,
Applied Soft Computing, Elsevier, vol. 9, no. 1, pp.209-218, 2009.
[16] S.Manocha, M.A.Girolami, An Empirical Analysis of the Probabilistic K-Nearest Neighbor
Classifier, Pattern Recognition Letters, Elsevier, vol. 28, no. 13, pp.1818-1824, 2007.

973

Anda mungkin juga menyukai