International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March – April 2013
ISSN 22786856
B.AzhaguSundari ^{1} , Dr.Antony Selvadoss Thanamani ^{2}
^{1} P.G Department of Computer Applications, N.G.M College,Pollachi Coimbatore, TamilNadu, India .
^{2} Department of Computer Science, N.G.M College,Pollachi. Coimbatore, TamilNadu, India.
Abstract: The attribute reduction is one of the key processes for knowledge acquisition. Some data set is multidimensional and larger in size. When this data set is used for classification it may produce wrong results and it may also occupy more resources especially in terms of time. Most of the features present are redundant, inconsistent and affects the classification. To improve the efficiency of classification these redundant and inconsistent nature must be eliminated. This paper presents a new method for dealing with feature subset selection based on fuzzy entropy measures for handling classification problems. The first step is to discretize numeric data to construct the membership function of each fuzzy set of a feature. Then, select the feature subset based on the proposed fuzzy entropy measure focusing on boundary samples. The Paper also gives an experimental result to show the applicability of the proposed method. The performance of the system is evaluated in MATLAB on several benchmark data sets with resides in the UCI machine learning repository.
Keywords: Fuzzy Entropy, Data Mining ,Attribute Reduction, Feature selection
1.INTRODUCTION
Data collection and storage capabilities during the past decades have led to an information overload in all the fields especially in science. Researchers working in domains as diverse as engineering, medical, astronomy, remote sensing, economics, and consumer transactions face larger and larger observations and simulations on a daily basis. Such datasets that have been studied extensively in the past present new challenges in data analysis. Traditional statistical methods break down partly because of the increase in the number of observations which in turn lead to change in dimensions. The dimension is the number of variables that are measured on each observation. Highdimensional datasets present many mathematical challenges as well as some opportunities and are bound to give rise to new theoretical developments. One of the problems with high dimensional datasets is that, in many cases, not all the measured variables are important for understanding the underlying phenomena of interest. While certain computationally expensive novel methods can construct predictive models with high accuracy from high dimensional data. It is still of interest in many applications to reduce the dimension of the original data
prior to any modelling of the data. Feature selection is the method that can reduce both the data and the computational complexity. Dataset can also get more efficient and can be useful to find out feature subsets. Data mining is a technique that discovers the reliable, intelligent information from raw data. The dredging of data is needed to extract the knowledge from large data. The discovery of knowledge follows several steps, which includes, cleaning, integration, selection, transformation, mining of data and further pattern evolution and knowledge presentation. Data mining tasks can be categorised as descriptive and predictive tasks. The descriptive mining summarizes the general properties of data, whereas the predictive mining predicts the needed information by inferring on the current data. Data mining tasks consists of different kinds of patterns like discovery of class descriptions, associations, classification, clustering, prediction etc., but this paper mainly focuses on the applications of clustering and classification. Clustering and classification are generalized data mining techniques, which twins the data to obtain the reasonable information.
A feature selection method selects a subset of meaningful or useful dimensions (specific for the application) from the original set of dimensions. Using feature subset selection techniques, redundant and irrelevant features can be omitted to reduce the amount of data during run time. There are many proposed feature subset selection techniques. The goal of Feature selection is to map a set of observations into a space that preserves the intrinsic structure as much as possible so that each target feature space is one feature of source feature space.
2. Related Work
Rough set based reduction [1] was proposed by Wa'el M. Mahmud, Hamdy N.Agiza, and Elsayed Radwan. The main contribution of this paper was to create a new hybrid model RSCPGA (Rough Set Classification Parallel Genetic Algorithm) which addresses the problem of identifying important features in building an Intrusion Detection System. Tests has been carried out using KDD99 dataset.
Volume 2, Issue 2 March – April 2013
Page 30
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March – April 2013
ISSN 22786856
Rough set and neural network based reduction has been proposed by Thangavel .K, & Pethalakshmi .A [2], Which describes the reduction attribute with the help of medical datasets.
Protocol based classifications has been proposed by, KunMing Yu, MingFeng Wu, and WaiTak Wong [3], which describes the protocol based classification by using genetic algorithm with logistic Regression and implemented by KDD 99 dataset.
Data Analysis methodologies were described by,Shaik Akbar, Dr.K.Nageswara Rao Dr.J.A.Chandulal [4],deals with eleven data computing techniques associated with IDS are divided groups into categories. Those methods are based on computation (Fuzzy logic and Bayesian networks), some are Artificial Intelligence (Expert Systems, agents and neural networks)and other are biological concepts (Genetics and Immune systems).
Discernibility matrix was described by Chuzhou [5], gives a neat explanation about the discernibility matrix function and reduction of features. Misuse and Anomaly detection using SVM, NBayes, ANN approaches discussed by T.Subbulakshmi[6], notifies the detection rate and false alarm rates. Multilayer Perceptrons, Naïve Bayes classifiers and Support vector machines with three kernel functions were used for detecting intruders. The Precision, Recall and F Measure for all the techniques were calculated.
A rough set theory is a new mathematical tool to deal with uncertainty and vagueness of decision system and it has been applied successfully in all the fields by Y.Y.Yao and Y. Zhao [7], the authors give a methodology to identify the reduction set of the set of all attributes of the decision system. The reduction set has been used as preprocessing technique for classification of the decision system in order to bring out the potential patterns or association rules or knowledge through data mining techniques. JenDa Shie · ShyiMing Chen et al., has described about the feature selection based fuzzy entropy for handling classification problems. The fuzzy entropy method is compared with OFFSS method ,OFEI method , the FQI method and the MIFS method to get more accuracy. Hamid Parvin et al., describes about the fuzzy entropy and compares it with the other feature selection methods.
Entropy
Entropy can be defined as a measure of the expected information content or uncertainty of a probability distribution. This concept has been defined in various ways and generalized in different applied fields, such as communication theory, mathematics, statistical thermodynamics, and economics. Shannon has contributed the broadest and the most fundamental definition of the entropy measure in information theory.
This paper proposes a fuzzy entropy measure which is an extension of Shannon’s definition.
3. METHODOLOGY
This paper presents a fuzzy entropy based feature subset selection approach. The feature selection approach consists of two phases. In the first phase the entire dataset is classified and according to the number of clusters in the dataset then each feature is classified alone with the same cluster number and proposed entropy fuzzy measures are calculated. In the second phase to find a feature subset that meets the boundaries to get a high accuracy degree. The proposed method is examined on different datasets.
Step 1: Use Kmeans cluster to generate k cluster based on the values of a feature, where k≥2. Step 2: Calculate a new cluster center m _{i} for each cluster until each cluster is not changed. Step 3: Construct the membership functions of the fuzzy sets based on the k cluster centers.
Construct a membership function µ _{v}_{i} of the fuzzy set v _{i} based on the i ^{t}^{h} cluster center m _{i} is shown in Fig. 1.
Fig. 1. A numeric feature f, with fuzzy set v _{i}_{1} ,v _{i}_{2} …v _{i}_{k} Step4: Calculate the fuzzy entropy FE of a fuzzy set A is defined as
Summation of fuzzy entropy of the samples in feature f
Volume 2, Issue 2 March – April 2013
Page 31
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March – April 2013
ISSN 22786856
Find the minimizes the function SFE(f) and remove it from the F feature.
step 7: The fuzzy entropy measure CFE(f _{1} ,f _{2} )of a feature
subset focusing on boundary samples is deﬁned as
follows:
the fuzzy set v _{1} , n denotes number of samples, k denotes the number of fuzzy sets of the feature f .
Features with minimum entropy are more important than the other features in feature subset selection operation.
FS=fs+{f}
where F is the set of features of dataset , f is the selected feature with minimum fuzzy entropy, fs is current selected subset of features and FS is new selected subset after adding feature f.
( _{1} , _{2} ) 

_{⎧} 1


= ⎪ ⎪ ⎪ ⎨ 
1 ∈ _{}_{} 

⎪ ⎪ ⎪ ^{} 2 2


⎩ 
∈ _{}_{} 


^{} 

^{} ^{1} ^{} < 1 
^{} 2 
2 



^{} 

ℎ 
1∈ _{1} _{}_{}
2∈ _{1} _{}_{}
( ) + ^{} ^{} ^{1} ( _{1} )
1
( ) + ^{} ^{} ^{2} ( _{2} )
2
S _{1}_{B} denotes the summation of the member grade of the value of the feature f _{1} ,S _{F}_{S} denotes the summation of the membership grade values of the feature subset(f _{1} ,f _{2} ), S _{w} denotes the summation of the membership grade values of the feature subset (f _{1} ,f _{2} ) of the samples belongs to a combines fuzzy set w, FE(v _{1} )
denotes the fuzzy entropy of a combined fuzzy set v _{1} of
the feature f _{1} and f _{2} denotes the fuzzy entropy of the fuzzy
set v _{2} of the feature f _{2} .Find the minimizes the function
CFE and add it the selected feature subset. Step 8: Convert the selected feature file into .arff file format for calculating accuracy by using WEKA tool.
Pseudo code of fuzzy Entropy Fuzzy entropy(Dataset,Threshold(T _{c} ,T _{r} )) { do
Step 6 : The Combined Fuzzy Membership Grade matrix CFMG(f1,f2,Tr) for constructing the extension matrix of the membership grades of the values of a feature subset
{f1,f2}
( _{1} , _{2} , _{} )
=
^{} 11 ^{(} ^{} 1 1 ^{)} ^{∧} ^{} 21 ^{(} ^{} 1 2 ^{)} ^{} 11 ^{(} ^{} 1 ^{)} ^{∧} ^{} 21 ^{(} ^{} 2 ^{)}
⋮
^{⋯}
⋱
^{⋯}
^{} 11 ^{(} ^{} 1 ^{)} ^{∧} ^{} 2 ^{(} ^{} 1 2 ^{)} ^{} 11 ^{(} ^{} 1 ^{)} ^{∧} ^{} 2 ^{(} ^{} 2 ^{)}
⋮
^{…}
⋱
^{…}
^{} 1 ^{(} ^{} 1 1 ^{)} ^{∧} ^{} 1 ^{(} ^{} 1 ^{)} ^{∧}
⋮
^{} 21 ^{(} ^{} 1 2 ^{)} ^{} 21 ^{(} ^{} 2 ^{)}
T _{r} denotes [0,1] , i denotes the number of fuzzy
{
^{…}
⋱
^{…}
Select feature f; 

K=2; 

While true 

{ 

( 1 ^{} 1 
) ∧ ( 2 ^{} 1 2 ⋮ 
) 


( 1 ^{} 1 
) ∧ ( 2 ^{} 2 K=K+1; 
) 

else 

K=K1; 

break; 

} 
centres in feature f;
Using KMeans algorithm fink K cluster
Find membership functions using clusters’
K Centres; Calculate fuzzy entropy of feature f; If (decreasing rate of fuzzy entropy>T _{c} )
A threshold parameter given by the user is used to create this matrix. In this matrix i is the number of fuzzy sets defined in the feature f maximum class degree is smaller than the given threshold value. j is the number of fuzzy sets defined in the feature f whose maximum class degree is smaller than the given threshold value.
} Create extension matrix for each feature f; Calculate fuzzy entropy of each feature; While true {
Select feature subset f with minimum fuzzy entropy value; Add f into previous selected subset and update combined extension matrix;
Volume 2, Issue 2 March – April 2013
Page 32
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March – April 2013
ISSN 22786856
Calculate fuzzy entropy of new selected subset according to T _{r} ; If (new fuzzy entropy value > previous fuzzy entropy value) or (fuzzy entropy =zero) or (there is no additional feature for selection) break; } While exist feature in dataset D
}
4. RESULT
The proposed entropy method is to be implemented in MATLAB. The experimental data sets are belongs to UCI machine learning repository. The Iris data set, the Breast cancer data set, the Cleve data set is used in this experiment. First, apply the proposed method to select feature subsets of these three data sets (i.e., the Iris data set, the Breast cancer data set, and the Cleve dataset), respectively. The accuracy rate is shown in the table. Table 1: A comparison of the accuracy rates of different methods
S. no 
Various Data 
FQI 
MIFS 
Proposed 
Sets 
Method 
Method 
entropy 

Method 

Iris Data Set

94.67% 
94.69% 

Breast Data Set 
97.05% 
97.09% 

Cleve Data Set

84.47% 
84.52% 
From the table 1 the average classification accuracy rates of different feature selection methods with respect to different clustering approaches are shown in the Table 1. The proposed feature subset selection method with the methods used to compare with the proposed method in the experiments, i.e.,the FQI(Frequency Quality Index) method ,the MIFS(Mutual information based Feature Selector) method, where the Iris dataset, the Breast cancer data set and the Cleve data set are used in our experiments.
Figure 3: comparison between FQI, MFIS and proposed entropy method
5. CONCLUSION
The paper is concerned with fuzzy sets and decision tree. In this paper, feature selection based on fuzzy set theory and information theory is presented. It proposes a fuzzy method of which numeric attributes can be represented by fuzzy number, interval value as well as crisp value, of which nominal attributes are represented by crisp nominal value, and of which class has confidence factor. An example is used to prove the validity. First, the fuzzy set theory is applied to transform realworld data into fuzzy linguistic forms. Secondly, the information theory is used to select the sub set of the feature selection. Through the integration of both fuzzy set theory and information theory, it can make classification tasks originally thought too difficult or complex to become possible.
REFERENCES
[1] Wa'el M. Mahmud, Hamdy N.Agiza, and Elsayed Radwan (October 2009) ,Intrusion Detection Using Rough Sets based Parallel Genetic Algorithm Hybrid Model, Proceedings of the World Congress on Engineering and Computer Science 2009 Vol II WCECS 2009, San Francisco, USA. [2] Thangavel, K., & Pethalakshmi, A. Elseviewer (2009)., Dimensionality reduction based on rough set theory 9, 112. doi: 10.1016/j.asoc.2008.05.006. [3] KunMing Yu, MingFeng Wu,and WaiTak Wong (April,2008), ProtocolBased Classification for Intrusion Detection, APPLIED COMPUTER & APPLIED COMPUTATIONAL SCIENCE (ACACOS '08), Hangzhou, China. [4] Shaik Akbar, Dr.K.Nageswara Rao ,Dr.J.A.Chandulal (August 2010),Intrusion Detection System Methodologies Based on Data Analysis, International Journal of Computer Applications (0975 – 8887) Volume 5– No.2. [5] Chuzhou University,China, Guangshun Yao, Chuanjian Yang,1Lisheng Ma, Qian Ren (June 2011) An New Algorithm of Modifying Hu’s Discernibility Matrix and its Attribute Reduction, International Journal of Advancements in Computing Technology Volume 3, Number 5. [6] T. Subbulakshmi , A. Ramamoorthi, and Dr. S. Mercy Shalinie(August 2009), Ensemble design for intrusion detection systems, International Journal of Computer science & Information Technology (IJCSIT), Vol 1, No 1.
[7] Y.Y.Yao and Y. Zhao (2009), Discernibility matrix simplication for constructing attribute reducts, Information Sciences, Vol. 179, No. 5, 867882.
[8] JenDa Shie · ShyiMing Chen (Feb 2007),Feature subset selection based on fuzzy entropy measures for handling classiﬁcation problems, Appl Intell (2008) 28: 69–82,DOI 10.1007/s1048900700426. [9] Kosko B (1986) Fuzzy entropy and conditioning. Inf Scie 40(2):165–174
Volume 2, Issue 2 March – April 2013
Page 33
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 2, March – April 2013
[10] Lee HM, Chen CM, Chen JM, Jou YL (2001) An efficient fuzzy classifier with feature selection based on fuzzy entropy. IEEE Trans Syst Man Cybern Part B Cybern 31(3):426–432 [11] Luca AD, Termini S (1972) A definition of a non probabilistic entropy in the setting of fuzzy set theory. Inf Control 20(4):301– 312 [12] Shannon CE (1948) A mathematical theory of communication.Bell Syst Techn J 27(3):379–423 [13] HahnMing Lee, Member, IEEE, ChihMing Chen JyhMing Chen, and YuLu Jou , An Efficient Fuzzy Classifier with Feature Selection Based on Fuzzy Entropy . [14] Hamid Parvin, Behrouz Minaei Bidgoli,hossein ghaffarin , An Innovative Feature Selection Using Fuzzy Entropy, Advances in Neural Networks – ISNN 2011 [15] B.Azhagusundari,Dr.AntonySelvadossThanamani, Feature Seelction based on Information Gain, International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 22783075, Volume2, Issue2, January 2013.
AUTHOR
B.AzhaguSundari received her B.Sc
Mathematics and Master of Computer Applications from NGM College, Pollachi, Coimbatore, India. She completed her Master of Philosophy in Bharathidasan University, trichy. Presently she is working as an Assistant Professor in the P.G Department of Computer Applications in NGM College (Autonomous), Pollachi. Her area of interest includes data Mining. Now she is pursuing her Ph.D Computer Science in Mother Teresa University, Kodaikannal
Dr. Antony Selvadoss Thanamani is
presently working as Professor and Head, Dept of Computer Science, NGM College, Coimbatore, India (affiliated to Bharathiar University, Coimbatore). He has published more than 100 papers in international/ national journals and conferences. He has authored many books on recent trends in Information Technology. His areas of interest include ELearning, Knowledge Management, Data Mining, Networking, Parallel and Distributed Computing. He has to his credit 24 years of teaching and research experience. He is a senior member of International Association of Computer Science and Information Technology, Singapore and Active member of Computer Science Society of India, Computer Science Teachers Association, New York
ISSN 22786856
Volume 2, Issue 2 March – April 2013
Page 34