Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
341
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
II. METHODOLOGY Therefore it is needed to perform supervised data mining
on the target data set. This narrowed down the choice of
Steps of the System:
classifiers to only few, classifiers that can handle numeric
1. Selecting dataset as an input to the algorithm for data as well as give a classification (amongst a predefined
processing. set of classifications). Hence selecting C4.5 decision tree
2 .Selecting the classifiers learning became obvious. The attribute evaluation was
3. Calculate entropy, information gain, gain ratio of also performed in order to find out the gain ratio and
attributes. ranking of each attribute in the decision tree learning. In
4. Processing the given input dataset according to the case for some data set data mining could not produce any
defined algorithm of C4.5 data mining. suitable result then finding the correlation coefficient was
5. According to the defined algorithm of improved C4.5 resorted to investigate if relation between attributes.
data mining processing the given input dataset.
6. The data which should be inputted to the tree generation C. Entropy:
mechanism is given by the C4.5 and improved C4.5 It is minimum number of bits of information needed to
processors. Tree generator generates the tree for C4.5 and encode the classification of arbitrary members of S.
improved C4.5 decision tree algorithm. Lets attribute A have v distinct value a1,............., av.
Attribute A can be used to Partition S into v subsets, S1,
III. D ATA M INING AND KNOWLEDGE D ISCOVERY S2,........, Sv , where Sj contains those samples in S that have
value aj of A. If A were selected as the test attribute, then
A. Attribute Selection Measure: these subset would corresponds to the branches grown from
the node contains the set S. Let Sij be the number of class
The attribute selection measure provides a ranking for
Ci, in a subset by Sj , The entropy or expected information
each attribute describing the given training tuples. The
attribute having the best score for the measure is chosen as based on partitioning into subset by A, is given by equation
the splitting attribute for the given tuples. If the splitting E(A) = ∑jv=1 (S1j +S2j+ · · · + Smj / S )*I(Sij + · · · + Smj)
attribute is continuous-valued or if we are restricted to
binary trees then, respectively, either a split point or a The first term acts as the weight of the jth subset and is
splitting subset must also be determined as part of the the number of samples in the subset divided by the total
splitting criterion. number of sample in S. The smaller the entropy value, the
The tree node created for partition D is labeled with the greater purity of subset partitions as shown in
splitting criterion, branches are grown for each outcome of I(S1, S2, · · · , Sm) = −∑
the criterion, and the tuples are partitioned accordingly.
There are two most popular attribute selection measures— Where Pi is the probability that a sample in Sj belongs to
information gain, gain ratio [12]. class Ci.
Let S be set consisting of data sample. Suppose the class D. Information Gain:
label attribute has m Distinct values defining m distinct
class Ci (for i =1... m). Let Si be the number of Sample of S It is simply the expected reduction in entropy caused by
in class Ci. The expected information needed to classify a partitioning the examples according to the attribute .More
given sample is given by equation precisely the information gain, Gain(S, A) of an attribute
A, relative collection of examples S, is given by equation.
I (S1, S2, · · · , Sm) = − ∑
Gain (A) = I (S1,S2, · · ·, Sm) − E (A)
Where Pi is probability that an arbitrary sample belongs
In other words gain (A) is the expected reduction in
to classify Ci and estimated by Si/S. Note that a log entropy caused by knowing the Value of attribute A. The
function to base 2 is used since the information in encoded algorithm computes the information gain of each attribute.
in bit.
With highest information gain is chosen as the test attribute
B. Classifiers: for a given set.
In [17] order to mine the data, a well-known data mining E. Gain ratio:
tool WEKA was used. Since the data has numeric data
It [12] differs from information gain, which measures
type with only the classification as nominal leading to the the information with respect to classification that is
category of labeled data set. acquired based on the same partitioning.
342
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
The gain ratio is defined as 1.Handling both continuous and discrete attributes - In
order to handle continuous attributes, C4.5 creates a
Gain Ratio(A) = Gain(A)/Split Info(A)
threshold and then splits the list into those whose attribute
The attribute with the maximum gain ratio is selected as value is above the threshold and those that are less than or
the splitting attribute. Note , however, that as the split equal to it.
information approaches 0, the ratio becomes unstable. A
constraint is added to avoid this, whereby the information 2. Handling training data with missing attribute values -
gain of the test selected must be large-at least as great as C4.5 allows attribute values to be marked as ? for missing.
the average gain over all tests examined. Missing attribute values are simply not used in gain and
entropy calculations.
IV. C4.5 ALGORITHM 3. Handling attributes with differing costs.
C4.5 is an algorithm used to generate a decision tree
4. Pruning trees after creation - C4.5 goes back through the
developed by Ross qiunlan. Many scholars made kinds of
tree once it's been created and attempts to remove branches
improvements on the decision tree algorithm. But the
that do not help by replacing them with leaf nodes
problem is that these decision tree algorithms need multiple
scanning and sorting of data collection several times in the
V. T HE IMPROVEMENT OF C4.5 ALGORITHM
construction process of the decision tree. The processing
speed reduced greatly in the case that the data set is so A. The improvement
large that can not fit in the memory .At present, the The C4.5 algorithm [8] [9] generates a decision tree
literature about the improvement on the efficiency of through learning from a training set, in which each example
decision tree classification algorithm For example, is structured in terms of attribute-value pair. The current
Wei Zhao, Jamming Su in the literature [7] proposed attribute node is one which has the maximum rate of
improvements to the ID3 algorithm, which is simplify the information gain which has been calculated, and the root
information gain in the use of Taylor's formula. But this node of the decision tree is obtained in this way. Having
improvement is more suitable for a small amount of data, studied carefully, we find that for each node in the selection
so it's not particularly effective in large data sets. of test attributes there are logarithmic calculations, and in
Due to dealing with large amount of datasets, a variety each time these calculations have been performed
of decision tree classification algorithm has been previously too. The efficiency of decision tree generation
considered. can be impacted when the dataset is large. We find that the
The advantages of C4.5 algorithm is significantly, so it all antilogarithm in logarithmic calculation is usually small
can be choose. But its efficiency must be improved to meet after studying the calculation process carefully, so the
the dramatic increase in the demand for large amount of process can be simplified by using L’Hospital Rule. As
data. follows:
A. Pseudo Code [16]: If f(x) and g(x) satisfy:
1. Check for base cases. (1) ) And are both zero or are
2. For each attribute a calculate: both ∞
i. Normalized information gain from splitting on attribute a.
3. Select the best a, attribute that has highest information (2) In the deleted neighbourhood of the point x0, both f'(x)
gain. and g'(x) exist and g'(x)! = 0;
4. Create a decision node that splits on best of a, as rot f ( x)
lim exits or is ∞
node. x x0 g ( x)
5. Recurs on the sub lists obtained by splitting on best of a
and add those nodes as children node. f ( x) f ' ( x)
Then lim lim '
B. Improvements from ID3 algorithm: x x0 g ( x) x x0 g ( x)
343
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
344
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)
In the expression above, Gain-Ratio (A) only has With the improved algorithm, we can get faster and
addition, subtraction, multiplication and division but no more effective results without the change of the final
logarithmic calculation, so computing time is much shorter decision and the presented algorithm constructs the
than the original expression. What’s more, the decision tree more clear and understandable. Efficiency and
simplification can be extended for multi-class. classification is greatly improved.
B. Reasonable arguments for the improvement: REFERENCES
In the improvement of C4.5 above, there is no item [1 ] I. H. Witten, E. Frank, Data Mining Practical Machine Learning
increased or decreased only approximate calculation is used Tools and Techniques, China Machine Press, 2006.
when we calculate the information gain rate. And the [2 ] S. F. Chen, Z. Q. Chen, Artificial intelligence in knowledge
engineering [M]. Nanjing: Nanjing University Press, 1997.
antilogarithm in logarithmic calculation is a probability
[3 ] Z. Z. Shi, Senior Artificial Intelligence [M]. Beijing: Science
which is less than 1. In order to facilitate the improvement Press,1998.
of the calculation, there are only two categories in this [4 ] D. Jiang, Information Theory and Coding [M]: Science and
article and the probability is a little bigger than in multi- Technology of China University Press, 2001.
class. And the probability will become smaller when the [5 ] M. Zhu, Data Mining [M]. Hefei: China University of Science and
number of categories becomes larger; it is more helpful to Technology Press ,2002.67-72.
justify the rationality. Furthermore, there is also the [6 ] A. P. Engelbrecht., A new pruning heuristic based on variance
guarantee of L’Hospital Rule in the approximate analysis of sensitivity information[J]. IEEE Trans on Neural
calculation, so the improvement is reasonable. Networks, 2001, 12(6): 1386-1399.
[7 ] N. Kwad, C. H. Choi, Input feature selection for classification
C. Comparison of the complexity: problem [J],IEEE Trans on Neural Networks, 2002,13(1): 143- 159.
To calculate Gain – Ratio(S, A), the C4.5 algorithm’s [8 ] Quinlan JR.Induction of decision tree [J].Machine Learing.1986
complexity is mainly concentrated in E(S) and E(S, [9 ] Quinlan, J.R.C4.5: ProgramsforMachineLearning.SanMateo,
CA:Morgan Kaufmann1993
A).When we compute E(s), each probability value is
[10 ] UCIRepository of machine earning databases. University of
needed to calculated first and this need o (n) time. Then California, Department of Information and Computer Science, 1998.
each one is multiplied and accumulated which need http: //www.ics. uci. edu/~mlearn/MLRepository. Html
O(log2n) time. So the complexity is O(log2n).Again, in the [11 ] UCI Machine Learning Repository –
calculation of E(S,A),the complexity is O(n(log2n)2),so the http://mlearn.icsuci.edu/database
total complexity of Gain-Ration(S,A) is O(n(log2n)2). [12 ] Jaiwei Han and Micheline Kamber , Data Mining Concepts and
And the improved C4.5 algorithm only involves. Techniques.second Edition,Morgan Kaufmann Publishers.
Original data and only addition, subtract, multiply and [13 ] Chen Jin,Luo De –lin,mu Fen-xiang ,An Improved ID3 Decision tree
divide operation. So it only needs one scan to obtain the algorithm,Xiamen University,2009.
total value and then do some simple calculations, the total [14 ] Rong Cao,Lizhen Xu,Improved C4.5 Decision tree algorithm for the
analysis of sales.Southeast University Nanjing211189,china,2009.
complexity are O (n).
[15 ] Huang Ming,NiuWenying ,Liang Xu ,An improved decision tree
classification algorithm based on ID3 and the application in score
VI. CONCLUSION AND F UTURE W ORK analysis.Dalian jiao Tong University,2009.
In this Paper we study that C4.5 and improved c4.5 [16 ] Surbhi Hardikar, Ankur Shrivastava and Vijay Choudhary
Comparison betweenID3 and C4.5 in Contrast to IDS VSRD-
algorithm to improve the performance of existing algorithm IJCSIT, Vol. 2 (7), 2012.
in terms of time saving by the used of L’hospital rule and [17 ] Khalid Ibnal Asad, Tanvir Ahmed ,MD. Saiedur Rahman,Movie
increased the efficiency a lot .We can not only speed up Popularity Classification based on Inherent, MovieAttributes using
the growing of the decision tree , but also better C4.5,PART and Correlation Coefficientd.IEEE/OSA/IAPR
information of rules can be generated. In this paper International Conference on Informatics, Electronics &
Vision,2012
algorithm we will verify different large datasets which are
publicly available on UCI machine learning repository.
345