Anda di halaman 1dari 5

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com


Volume 2, Issue 2, March April 2013 ISSN 2278-6856


Volume 2, Issue 2 March April 2013 Page 30


Abstract: The attribute reduction is one of the key processes
for knowledge acquisition. Some data set is multidimensional
and larger in size. When this data set is used for classification
it may produce wrong results and it may also occupy more
resources especially in terms of time. Most of the features
present are redundant, inconsistent and affects the
classification. To improve the efficiency of classification
these redundant and inconsistent nature must be eliminated.
This paper presents a new method for dealing with feature
subset selection based on fuzzy entropy measures for
handling classification problems. The first step is to discretize
numeric data to construct the membership function of each
fuzzy set of a feature. Then, select the feature subset based on
the proposed fuzzy entropy measure focusing on boundary
samples. The Paper also gives an experimental result to show
the applicability of the proposed method. The performance of
the system is evaluated in MATLAB on several benchmark
data sets with resides in the UCI machine learning repository.
Keywords: Fuzzy Entropy, Data Mining ,Attribute
Reduction, Feature selection

1.INTRODUCTION
Data collection and storage capabilities during
the past decades have led to an information overload in
all the fields especially in science. Researchers working in
domains as diverse as engineering, medical, astronomy,
remote sensing, economics, and consumer transactions
face larger and larger observations and simulations on a
daily basis. Such datasets that have been studied
extensively in the past present new challenges in data
analysis. Traditional statistical methods break down
partly because of the increase in the number of
observations which in turn lead to change in dimensions.
The dimension is the number of variables that
are measured on each observation. High-dimensional
datasets present many mathematical challenges as well as
some opportunities and are bound to give rise to new
theoretical developments. One of the problems with high-
dimensional datasets is that, in many cases, not all the
measured variables are important for understanding the
underlying phenomena of interest. While certain
computationally expensive novel methods can construct
predictive models with high accuracy from high-
dimensional data. It is still of interest in many
applications to reduce the dimension of the original data
prior to any modelling of the data. Feature selection is the
method that can reduce both the data and the
computational complexity. Dataset can also get more
efficient and can be useful to find out feature subsets.
Data mining is a technique that discovers the
reliable, intelligent information from raw data. The
dredging of data is needed to extract the knowledge from
large data. The discovery of knowledge follows several
steps, which includes, cleaning, integration, selection,
transformation, mining of data and further pattern
evolution and knowledge presentation.
Data mining tasks can be categorised as
descriptive and predictive tasks. The descriptive mining
summarizes the general properties of data, whereas the
predictive mining predicts the needed information by
inferring on the current data. Data mining tasks consists
of different kinds of patterns like discovery of class
descriptions, associations, classification, clustering,
prediction etc., but this paper mainly focuses on the
applications of clustering and classification. Clustering
and classification are generalized data mining techniques,
which twins the data to obtain the reasonable
information.

A feature selection method selects a subset of
meaningful or useful dimensions (specific for the
application) from the original set of dimensions. Using
feature subset selection techniques, redundant and
irrelevant features can be omitted to reduce the amount of
data during run time. There are many proposed feature
subset selection techniques. The goal of Feature selection
is to map a set of observations into a space that preserves
the intrinsic structure as much as possible so that each
target feature space is one feature of source feature space.

2. Related Work

Rough set based reduction [1] was proposed by
Wa'el M. Mahmud, Hamdy N.Agiza, and Elsayed
Radwan. The main contribution of this paper was to
create a new hybrid model RSC-PGA (Rough Set
Classification Parallel Genetic Algorithm) which
addresses the problem of identifying important features in
building an Intrusion Detection System. Tests has been
carried out using KDD-99 dataset.
FEATURE SELECTION BASED ON FUZZY
ENTROPY
B.AzhaguSundari
1
, Dr.Antony Selvadoss Thanamani
2


1
P.G Department of Computer Applications, N.G.M College,Pollachi
Coimbatore, TamilNadu, India.

2
Department of Computer Science, N.G.M College,Pollachi.
Coimbatore, TamilNadu, India.


International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March April 2013 ISSN 2278-6856


Volume 2, Issue 2 March April 2013 Page 31

Rough set and neural network based reduction has
been proposed by Thangavel .K, & Pethalakshmi .A [2],
Which describes the reduction attribute with the help of
medical datasets.
Protocol based classifications has been proposed by,
Kun-Ming Yu, Ming-Feng Wu, and Wai-Tak Wong [3],
which describes the protocol based classification by using
genetic algorithm with logistic Regression and
implemented by KDD 99 dataset.
Data Analysis methodologies were described
by,Shaik Akbar, Dr.K.Nageswara Rao Dr.J.A.Chandulal
[4],deals with eleven data computing techniques
associated with IDS are divided groups into categories.
Those methods are based on computation (Fuzzy logic
and Bayesian networks), some are Artificial Intelligence
(Expert Systems, agents and neural networks)and other
are biological concepts (Genetics and Immune systems).
Discernibility matrix was described by Chuzhou [5],
gives a neat explanation about the discernibility matrix
function and reduction of features. Misuse and Anomaly
detection using SVM, NBayes, ANN approaches
discussed by T.Subbulakshmi[6], notifies the detection
rate and false alarm rates. Multilayer Perceptrons, Nave
Bayes classifiers and Support vector machines with three
kernel functions were used for detecting intruders. The
Precision, Recall and F- Measure for all the techniques
were calculated.
A rough set theory is a new mathematical tool to
deal with uncertainty and vagueness of decision system
and it has been applied successfully in all the fields by
Y.Y.Yao and Y. Zhao [7], the authors give a
methodology to identify the reduction set of the set of all
attributes of the decision system. The reduction set has
been used as pre-processing technique for classification of
the decision system in order to bring out the potential
patterns or association rules or knowledge through data
mining techniques.
Jen-Da Shie Shyi-Ming Chen et al., has described
about the feature selection based fuzzy entropy for
handling classification problems. The fuzzy entropy
method is compared with OFFSS method ,OFEI method ,
the FQI method and the MIFS method to get more
accuracy. Hamid Parvin et al., describes about the fuzzy
entropy and compares it with the other feature selection
methods.

Entropy
Entropy can be defined as a measure of the
expected information content or uncertainty of a
probability distribution. This concept has been defined in
various ways and generalized in different applied fields,
such as communication theory, mathematics, statistical
thermodynamics, and economics. Shannon has
contributed the broadest and the most fundamental
definition of the entropy measure in information theory.
This paper proposes a fuzzy entropy measure which is an
extension of Shannons definition.

3. METHODOLOGY

This paper presents a fuzzy entropy based feature
subset selection approach. The feature selection approach
consists of two phases. In the first phase the entire dataset
is classified and according to the number of clusters in the
dataset then each feature is classified alone with the same
cluster number and proposed entropy fuzzy measures are
calculated. In the second phase to find a feature subset
that meets the boundaries to get a high accuracy degree.
The proposed method is examined on different datasets.

Step 1: Use K-means cluster to generate k cluster based
on the values of a feature, where k2.
Step 2: Calculate a new cluster center m
i
for each cluster
until each cluster is not changed.
Step 3: Construct the membership functions of the fuzzy
sets based on the k cluster centers.


Construct a membership function
vi
of the fuzzy set v
i

based on the i
th
cluster center m
i
is shown in Fig. 1.


Fig. 1. A numeric feature f, with fuzzy set v
i1
,v
i2
v
ik

Step4: Calculate the fuzzy entropy FE of a fuzzy set A is
defined as


Summation of fuzzy entropy of the samples in feature f

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March April 2013 ISSN 2278-6856


Volume 2, Issue 2 March April 2013 Page 32


Find the minimizes the function SFE(f) and remove it
from the F feature.

Step 5: A Fuzzy Membership Grade Matrix (FMG) is
defined which consists of K fuzzy membership grades
(one fuzzy membership grade for each cluster) of feature f
for each one of N samples in dataset.


Where denotes the membership grade of the
value of the feature f of the sample belonging to
the fuzzy set v
1
, n denotes number of samples, k denotes
the number of fuzzy sets of the feature f .

Features with minimum entropy are more important than
the other features in feature subset selection operation.


FS=fs+{f}

where F is the set of features of dataset , f is the selected
feature with minimum fuzzy entropy, fs is current
selected subset of features and FS is new selected subset
after adding feature f.

Step 6 : The Combined Fuzzy Membership Grade matrix
CFMG(f1,f2,Tr) for constructing the extension matrix of
the membership grades of the values of a feature subset
{f1,f2}

CFH0(
1
,
2
,I
r
)
=_
p
:11
(r
11
) p
:21
(r
12
) p
:11
(r
n1
) p
:2]
(r
12
)

p
:11
(r
n1
) p
:21
(r
n2
) p
:11
(r
n1
) p
:2]
(r
n2
)

p
:1i
(r
11
) p
:21
(r
12
)

p
:1i
(r
n1
) p
:21
(r
n2
)

p
:1i
(r
n1
) p
:2]
(r
12
)

p
:1i
(r
n1
) p
:2]
(r
n2
)
_

T
r
denotes [0,1] , i denotes the number of fuzzy
of the feature f
1
, j denotes the number of fuzzy of the
feature f
2
, denotes the membership grade of
the value of the feature f
1
of the sample r
1
belonging
to a fuzzy set v
11 ,
denotes the minimum operator.

A threshold parameter given by the user is used
to create this matrix. In this matrix i is the number of
fuzzy sets defined in the feature f maximum class degree
is smaller than the given threshold value. j is the number
of fuzzy sets defined in the feature f whose maximum
class degree is smaller than the given threshold value.

step 7: The fuzzy entropy measure CFE(f
1
,f
2
)of a feature
subset focusing on boundary samples is dened as
follows:


CFE(
1
,
2
)
=

S
1B
S
1
X
S
w
S
FS
wI
FS
FE(w) +
S
:1
S
1
:1I
1uB
FE(:
1
)
i
S
1B
S
1
<
S
2B
S
2
S
2B
S
2
X
S
w
S
FS
wI
FS
FE(w) +
S
:2
S
2
:2I
1uB
FE(:
2
)
i 0tcrwisc

S
1B
denotes the summation of the member grade
of the value of the feature f
1
,S
FS
denotes the summation
of the membership grade values of the feature
subset(f
1
,f
2
), S
w
denotes the summation of the
membership grade values of the feature subset (f
1
,f
2
) of
the samples belongs to a combines fuzzy set w, FE(v
1
)
denotes the fuzzy entropy of a combined fuzzy set v
1
of
the feature f
1
and f
2
denotes the fuzzy entropy of the fuzzy
set v
2
of the feature f
2
.Find the minimizes the function
CFE and add it the selected feature subset.
Step 8: Convert the selected feature file into .arff file
format for calculating accuracy by using WEKA tool.
Pseudo code of fuzzy Entropy
Fuzzy entropy(Dataset,Threshold(T
c
,T
r
))
{
do
{
Select feature f;
K=2;
While true
{
Using K-Means algorithm fink K cluster
centres in feature f;
Find membership functions using clusters
K Centres;
Calculate fuzzy entropy of feature f;
If (decreasing rate of fuzzy entropy>T
c
)
K=K+1;
else
K=K-1;
break;
}
}
Create extension matrix for each feature f;
Calculate fuzzy entropy of each feature;
While true
{
Select feature subset f with minimum fuzzy
entropy value;
Add f into previous selected subset and update
combined extension matrix;
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March April 2013 ISSN 2278-6856


Volume 2, Issue 2 March April 2013 Page 33

Calculate fuzzy entropy of new selected subset
according to T
r
;
If (new fuzzy entropy value >previous
fuzzy entropy value) or (fuzzy entropy
=zero) or (there is no additional feature
for selection)
break;
} While exist feature in dataset D

}
4. RESULT
The proposed entropy method is to be
implemented in MATLAB. The experimental data sets
are belongs to UCI machine learning repository. The Iris
data set, the Breast cancer data set, the Cleve data set is
used in this experiment. First, apply the proposed method
to select feature subsets of these three data sets (i.e., the
Iris data set, the Breast cancer data set, and the Cleve
dataset), respectively. The accuracy rate is shown in the
table.
Table 1: A comparison of the accuracy rates of different
methods

S. no Various Data
Sets
FQI
Method
MIFS
Method
Proposed
entropy
Method
1 Iris Data Set 94.67% 94.67% 94.69%

2 Breast Data Set 97.05% 96.05% 97.09%

3 Cleve Data Set 84.47% 83.69% 84.52%


From the table 1 the average classification
accuracy rates of different feature selection methods with
respect to different clustering approaches are shown in
the Table 1.
The proposed feature subset selection method
with the methods used to compare with the proposed
method in the experiments, i.e.,the FQI(Frequency
Quality Index) method ,the MIFS(Mutual information
based Feature Selector) method, where the Iris dataset,
the Breast cancer data set and the Cleve data set are used
in our experiments.

Figure 3: comparison between FQI, MFIS and proposed
entropy method
5. CONCLUSION
The paper is concerned with fuzzy sets and decision tree.
In this paper, feature selection based on fuzzy set theory
and information theory is presented. It proposes a fuzzy
method of which numeric attributes can be represented
by fuzzy number, interval value as well as crisp value, of
which nominal attributes are represented by crisp
nominal value, and of which class has confidence factor.
An example is used to prove the validity. First, the fuzzy
set theory is applied to transform real-world data into
fuzzy linguistic forms. Secondly, the information theory
is used to select the sub set of the feature selection.
Through the integration of both fuzzy set theory and
information theory, it can make classification tasks
originally thought too difficult or complex to become
possible.

REFERENCES
[1] Wa'el M. Mahmud, Hamdy N.Agiza, and Elsayed
Radwan (October 2009) ,Intrusion Detection Using
Rough Sets based Parallel Genetic Algorithm Hybrid
Model, Proceedings of the World Congress on
Engineering and Computer Science 2009 Vol II
WCECS 2009, San Francisco, USA.
[2] Thangavel, K., & Pethalakshmi, A. Elseviewer
(2009)., Dimensionality reduction based on rough
set theory 9, 1-12. doi: 10.1016/j.asoc.2008.05.006.
[3] Kun-Ming Yu, Ming-Feng Wu,and Wai-Tak Wong
(April,2008), Protocol-Based Classification for
Intrusion Detection, APPLIED COMPUTER &
APPLIED COMPUTATIONAL SCIENCE
(ACACOS '08), Hangzhou, China.
[4] Shaik Akbar, Dr.K.Nageswara Rao
,Dr.J.A.Chandulal (August 2010),Intrusion Detection
System Methodologies Based on Data Analysis,
International Journal of Computer Applications
(0975 8887) Volume 5 No.2.
[5] Chuzhou University,China, Guangshun Yao,
Chuanjian Yang,1Lisheng Ma, Qian Ren (June 2011)
An New Algorithm of Modifying Hus Discernibility
Matrix and its Attribute Reduction, International
Journal of Advancements in Computing Technology
Volume 3, Number 5.
[6] T. Subbulakshmi , A. Ramamoorthi, and Dr. S.
Mercy Shalinie(August 2009), Ensemble design for
intrusion detection systems, International Journal of
Computer science & Information Technology
(IJCSIT), Vol 1, No 1.
[7] Y.Y.Yao and Y. Zhao (2009), Discernibility matrix
simplication for constructing attribute reducts,
Information Sciences, Vol. 179, No. 5, 867-882.
[8] Jen-Da Shie Shyi-Ming Chen (Feb 2007),Feature
subset selection based on fuzzy entropy measures for
handling classication problems, Appl Intell
(2008) 28: 6982,DOI 10.1007/s10489-007-0042-6.
[9] Kosko B (1986) Fuzzy entropy and conditioning. Inf
Scie 40(2):165174
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March April 2013 ISSN 2278-6856


Volume 2, Issue 2 March April 2013 Page 34

[10] Lee HM, Chen CM, Chen JM, Jou YL (2001) An
efficient fuzzy classifier with feature selection based
on fuzzy entropy. IEEE Trans Syst Man Cybern Part
B Cybern 31(3):426432
[11] Luca AD, Termini S (1972) A definition of a non-
probabilistic entropy in the setting of fuzzy set
theory. Inf Control 20(4):301 312
[12] Shannon CE (1948) A mathematical theory of
communication.Bell Syst Techn J 27(3):379423
[13] Hahn-Ming Lee, Member, IEEE, Chih-Ming Chen
Jyh-Ming Chen, and Yu-Lu Jou , An Efficient Fuzzy
Classifier with Feature Selection Based on Fuzzy
Entropy .
[14] Hamid Parvin, Behrouz Minaei Bidgoli,hossein
ghaffarin , An Innovative Feature Selection Using
Fuzzy Entropy, Advances in Neural Networks
ISNN 2011
[15] B.Azhagusundari,Dr.AntonySelvadossThanamani,
Feature Seelction based on Information Gain,
International Journal of Innovative Technology and
Exploring Engineering (IJITEE) ISSN: 2278-3075,
Volume-2, Issue-2, January 2013.

AUTHOR
B.AzhaguSundari received her B.Sc
Mathematics and Master of Computer
Applications from NGM College, Pollachi,
Coimbatore, India. She completed her
Master of Philosophy in Bharathidasan
University, trichy. Presently she is working as an
Assistant Professor in the P.G Department of Computer
Applications in NGM College (Autonomous), Pollachi.
Her area of interest includes data Mining. Now she is
pursuing her Ph.D Computer Science in Mother Teresa
University, Kodaikannal

Dr. Antony Selvadoss Thanamani is
presently working as Professor and Head,
Dept of Computer Science, NGM College,
Coimbatore, India (affiliated to Bharathiar
University, Coimbatore). He has published
more than 100 papers in international/ national journals
and conferences. He has authored many books on recent
trends in Information Technology. His areas of interest
include E-Learning, Knowledge Management, Data
Mining, Networking, Parallel and Distributed
Computing. He has to his credit 24 years of teaching and
research experience. He is a senior member of
International Association of Computer Science and
Information Technology, Singapore and Active member
of Computer Science Society of India, Computer Science
Teachers Association, New York

Anda mungkin juga menyukai