Session_2b
MDI Gurgaon
1
Contents
Information gain in C4.5
Problem of Overfitting
Evaluating Performance
Ensemble Methods
2
Representative Data Set
Sr No Attributes Class
Suppose a training set has w classes C1, C2, C3, Cw, a function that
measures the level of impurity of the data set known as Gini impurity function
under a given condition t is defined as
Gini(Class) = 1- (i=1,w)P(Ci)2
5
If two equally probable classes are present we have 1-0.5*0.5-0.5*0.5 = 0.5
To construct a tree we need to select the attribute that reduces the impurity
the most, as the root of the tree
High Norm True False Sunn Over Rain Hot Mild Cool
al y cast
P 3 6 3 6 2 4 3 2 4 3
N 4 1 3 2 3 0 2 2 2 1
7
Gini Index of Impurity for the weather data
P 2 7 4 5 6 3
N 3 2 0 5 3 2
8
Chi-square statistic in the CHAID Algo
The chi-square statistic is a measure of the degree of association or
dependence between two variables
Where xij represent the actual frequency of attribute value aj and class Ci and
Eij represents the expected frequency, Eij = ni*nj/N 9
The chi-square statistic calculates the difference between the actual
frequencies of classes in an attribute with the expected frequencies when no
association is assumed
The greater the difference, the stronger is the association between the
classes and the chosen attribute (given that it is statistically significant)
Using independent variable repeat the previous step for each of the subsets
of the data until;
All dependent variables are exhausted or they are not statistically significant
~ 29.653
12
Bonferroni correction
Since the tree results after multiple splits and multiple tests of hypothesis, the
significance value alpha must be corrected for
Type 1 error in this case is = 1-0.9027 = 0.0975, higher than alpha =0.05 as
required
14
Solving Overfitting
Overfitting: All classification algorithms suffer from overfitting as the model is
a representation of training data rather than actual data
Problem is even more severe if at a particular leaf node only few training
examples exist
Act of pruning is replacing a subtree by a leaf; to make tree smaller and more
robust
16
Two types of pruning methods are used: Pre pruning and post-pruning
For example C4.5 actually makes certain assumptions about data distribution
and goes on to estimate generalization errors that the tree may make to
unseen examples
Post-pruning on the other hand is conducted after tree is fully grown and
requires independent validation or testing data
2. For every non-leaf node, count the number of errors if the subtree where the
node is the root were to be replaced by a leaf node with the best possible
class label
4. Repeat steps 2 and 3 until further pruning increases the number of errors
Second there may be a number of subtrees with same error reduction, in this
case largest subtree is pruned (Is this in a iterative way in which all trees are
considered and bigger trees are pruned first or single sweep bottom-up with
each smaller subtree being pruned first)?
Third should we use training data or testing data or both for pruning?
Fourth certain parts of the tree may be irrelevant for the testing data, should
these be removed as they do not increase error among testing dataset, is this
a good strategy?
19
Research suggests specific solutions for each of the above
Evaluating the Performance
Error rate = (FP+FN/TP+TN+FP+FN)
Predicted Class
Positive Negative
Negative FP TN
20
Evaluating Performance
Holdout method: Standard division in and for training and testing dataset
21
Ensemble Methods
Ensemble method is a machine learning algo that generates several
classifiers using different sampling strategies
A majority voting is then used for classifying a new observation using multiple
classifiers that are developed
For new data point all classifiers in the ensemble are used to identify the
class respectively and the final class is decided based on majority if given
equal weight
Final value in these cases is obtained using the linear combination of all
22
Random Forest
It is well known Ensemble method in which several trees are developed using
different sampling strategies
New observation is classified by using all the trees in the random forest and
majority voting is adopted
The model is tested using the Out-of-bag data which is not a part of the
23
training data set used for creating many trees.
Steps to create Random Forest
1. Assume training data has N observations, one needs to generate many
samples of size M(M<N) with replacement, say S such samples
3. Develop trees for each of the samples generated in step 1 using the sample
of predictors from step 2 using CART
24
Strengths /weakness
Assigns label to unseen record and also explains why the decision is made
Depends only on the number of level of tree for performance and not on
sample size
Weakness
High error rates when training set contains small number of large variety of
different classes
25
References
Du, Hongbo, Data Mining Techniques and Applications,
Cengage Learning, India, 2013
Berry, M, and Linoff,G.,1997,Data Mining Techniques for
Marketing, sales and customer support, John Wiley and
Sons
Han, J. and Kamber, M., 2001, Data Mining: Concepts
and Techniques, Morgan Kaufmann Publishers
Tan, P-N, Steinbach, M. and Kumar, V., 2006,
Introduction to Data Mining, Addison-Wesley
Kumar, U.D., Business Analytics, 2017, Wiley India
26