Anda di halaman 1dari 26

BDAM

Session_2b
MDI Gurgaon

1
Contents
Information gain in C4.5

Intro to CART Algo and example

Intro to CHAID Algo and example

Problem of Overfitting

Evaluating Performance

Ensemble Methods
2
Representative Data Set
Sr No Attributes Class

Outlook Temperature Humidity Windy

1 Sunny Hot High False N


2 Sunny Hot High True N
3 Overcast Hot High False P
4 Rain Mild High False P
5 Rain Cool Normal False P
6 Rain Cool Normal True N
7 Overcast Cool Normal True P
8 Sunny Mild High False N
9 Sunny Cool Normal False P
10 Rain Mild Normal False P
11 Sunny Mild Normal True P
12 Overcast Mild High True P
13 Overcast Hot Normal False P
14 Rain Mild High True N 3
Information Gain Ratio in C4.5 C5 Algorithms
The information gain ratio is defined as the ratio of the information gain over
Attribute A against the average information of the attribute,

Gain Ratio = Gain(A)/H(A)

The gain ratio normalizes uncertainty across different attributes in order to


avoid bias towards attributes with more distinct values

H(Outlook) = -(5/14)log2(5/14)-4/14log2(4/14)-5/14log2(5/14) =1.577 bits

Gain ratio(Outlook) = Gain (Outlook)/H(Outlook) = 0.246/1.577 = 0.156

Gain Ratio (Temp) = 0.019; Gain Ratio(Humidity) = 0.151; Gain Ratio(Windy)


= 0.019 4
Gini Index of Impurity in the CART Algo
Algos like CART, SPRINT, SLIQ, consider a given dataset as impure set and
constructing a decision tree is a process of splitting the data set into purer
and purer partitions (examples of one kind should overwhelm others)

Suppose a training set has w classes C1, C2, C3, Cw, a function that
measures the level of impurity of the data set known as Gini impurity function
under a given condition t is defined as

Gini(t) = 1- (i=1,w)P(Ci|t)2 , where P(Ci|t) is the fraction of a specific class


under the condition t.

Gini(Class) = 1- (i=1,w)P(Ci)2
5
If two equally probable classes are present we have 1-0.5*0.5-0.5*0.5 = 0.5
To construct a tree we need to select the attribute that reduces the impurity
the most, as the root of the tree

Gini index of impurity over an attribute A of m values is defined as

GiniIndex(A) = Gini(Class) - (i=1,m)P(aj)*Gini(A=aj), where P(aj) is the


probability of A=aj

Gini(Class) = 1-(9/14)^2 -(5/14)^2 = 0.459

Gini(Outlook=Sunny)= 1-P(Class=P|Outlook = Sunny)^2 - P(Class=N|Outlook


= Sunny)^2 = 1- ()2 -()2 = 0.48

Gini (Outlook = overcast) = ?

Gini (Outlook = rain) = ? 6


Gini Index of Impurity for the weather data

Clas Humidity Windy Outlook Temperature


s

High Norm True False Sunn Over Rain Hot Mild Cool
al y cast

P 3 6 3 6 2 4 3 2 4 3

N 4 1 3 2 3 0 2 2 2 1

Gini 0.367 0.438 0.343 0.440

Gini 0.092 0.021 0.116 0.019


Index

7
Gini Index of Impurity for the weather data

Class Outlook Outlook Outlook

Sunny {Overcas Overcast {Rain, Rain {Sunny,


t, Rain} Sunny} Overcast
}

P 2 7 4 5 6 3

N 3 2 0 5 3 2

Gini 0.394 0.357 0.457

Gini Index 0.065 0.102 0.002

8
Chi-square statistic in the CHAID Algo
The chi-square statistic is a measure of the degree of association or
dependence between two variables

For classification the statistic can be used to measure the degree of


dependence between given attribute and the class variable

Given a set of n examples of w classes C1, C2, C3, Cw and an attribute A of


v values a1, a2, a3, av, the chi-square is defined as follows;

2 = (i=1,v)(i=1,w) (Xij -Eij)2/Eij ,

Where xij represent the actual frequency of attribute value aj and class Ci and
Eij represents the expected frequency, Eij = ni*nj/N 9
The chi-square statistic calculates the difference between the actual
frequencies of classes in an attribute with the expected frequencies when no
association is assumed

The greater the difference, the stronger is the association between the
classes and the chosen attribute (given that it is statistically significant)

Depending on the nature of the dependent variable the following statistical


tests are used for splitting the dataset

Chi square test of independence when response variable is discrete

F-test when the response variable is continuous

Likelihood ratio test when response variable is ordinal


10
CHAID Algo
The variables with lowest p-values based on statistical test are used for
splitting the dataset thereby creating internal nodes, Bonferroni correction is
used for adjusting the significance level alpha

Using independent variable repeat the previous step for each of the subsets
of the data until;

All dependent variables are exhausted or they are not statistically significant

The stopping criteria is met

Generate business rules from the leafs


11
Example (Weather data)
Ci- C1 and C2 are P and N

Parameter j stands for (say) Outlook = sunny, overcast and rain

X2(Outlook) = X2(Outlook=Sunny) + X2(Outlook=Overcast) + X2(Outlook=Rain)

X2(Outlook) = [((2-(5*9/14))^2/(5*9/14)) + ((3 - (5*5/14))^2/(5*5/14))] + ?? + ??

~ 29.653

12
Bonferroni correction
Since the tree results after multiple splits and multiple tests of hypothesis, the
significance value alpha must be corrected for

Type 1 error is probability of rejecting Null hypothesis when it is true,


identified as alpha

In multiple tests of hypothesis at alpha =0.05; probability of retaining null


hypothesis is 0.95*0.95 =0.9027

Type 1 error in this case is = 1-0.9027 = 0.0975, higher than alpha =0.05 as
required

For 10 tests = 1-(0.95)^10 = 0.4012 13


Cost-based Splitting Criteria
Other than impurity measures such as Gini index and entropy; decision makers
also use cost of misclassification to split data

The total penalty is C01P01 +C10P10 where

P01 = Proportion of 0 classified as 1

P10 = proportion of 1 classified as 0

C01 = Cost of classifying 0 as 1

C10 = Cost of classifying 1 as 0

14
Solving Overfitting
Overfitting: All classification algorithms suffer from overfitting as the model is
a representation of training data rather than actual data

Problem is even more severe if at a particular leaf node only few training
examples exist

Cause of problem is understood to be a) presence of noise records b) lack of


representative examples in training sets c) repeated attribute selection
process of the algorithm

Presence of noise increases the peculiarity thereby adding new subtrees to


accommodate training data
15
Solution is to denoise the data before building tree but this is not practical
Tree Pruning
The problem of non representative training examples too is not exactly
solvable.

The problem of repeated attribute selection is based on creating subsets if


gain is above a threshold, however, if threshold is set too low, we would have
a large tree

The problem of noise and repeated attribute selection can be addressed by


Tree Pruning

Act of pruning is replacing a subtree by a leaf; to make tree smaller and more
robust
16
Two types of pruning methods are used: Pre pruning and post-pruning
For example C4.5 actually makes certain assumptions about data distribution
and goes on to estimate generalization errors that the tree may make to
unseen examples

Another approach is Occams Razor or penalizing the complex trees, thereby


selecting the tree with less complexity and lowest possible error rate

Post-pruning on the other hand is conducted after tree is fully grown and
requires independent validation or testing data

Experimental studies claim that post-pruning produces more accurate and


robust trees as compared to pre-pruning

Among post-pruning methods are cost complexity pruning, pessimistic


pruning, path-length pruning, cross validation pruning, reduced error pruning
17
Reduced-Error Pruning Method
1. Classify all examples in the testing dataset using the tree, and notify at each
non-leaf node the type and number of errors

2. For every non-leaf node, count the number of errors if the subtree where the
node is the root were to be replaced by a leaf node with the best possible
class label

3. Prune the subtree that yields largest reduction in number of errors

4. Repeat steps 2 and 3 until further pruning increases the number of errors

Above method brings out few important points!!!


18
Issues in Reduction error pruning
First a subtree may be pruned even with zero error reduction?

Second there may be a number of subtrees with same error reduction, in this
case largest subtree is pruned (Is this in a iterative way in which all trees are
considered and bigger trees are pruned first or single sweep bottom-up with
each smaller subtree being pruned first)?

Third should we use training data or testing data or both for pruning?

Fourth certain parts of the tree may be irrelevant for the testing data, should
these be removed as they do not increase error among testing dataset, is this
a good strategy?
19
Research suggests specific solutions for each of the above
Evaluating the Performance
Error rate = (FP+FN/TP+TN+FP+FN)

Accuracy rate = 1-error rate


Confusion Matrix

Predicted Class

Positive Negative

Actual Class Positive TP FN

Negative FP TN

20
Evaluating Performance
Holdout method: Standard division in and for training and testing dataset

Random subsampling: Repeat hold-out method multiple times and average of


accuracy for all runs is taken as accuracy

Bootstrap: Sampling with replacement unlike hold-out, data is again divided in


two parts, accuracy is calculated as average of all runs

Cross-validation: k-fold cross validation says data be divided into k equal


parts and for each run 1 part is used for testing and others (k-1) for training.
The process is repeated k times. Errors are averaged across the runs

21
Ensemble Methods
Ensemble method is a machine learning algo that generates several
classifiers using different sampling strategies

A majority voting is then used for classifying a new observation using multiple
classifiers that are developed

For new data point all classifiers in the ensemble are used to identify the
class respectively and the final class is decided based on majority if given
equal weight

Different weight can be assigned to each classifier based on its accuracy


(Adaptive boosting).

Final value in these cases is obtained using the linear combination of all
22
Random Forest
It is well known Ensemble method in which several trees are developed using
different sampling strategies

Most common strategy is Bootstrap Aggregating (Bagging) which is random


sampling with replacement

New observation is classified by using all the trees in the random forest and
majority voting is adopted

In general random Forest are expected to provide much higher accuracy


compared to single tree

The model is tested using the Out-of-bag data which is not a part of the
23
training data set used for creating many trees.
Steps to create Random Forest
1. Assume training data has N observations, one needs to generate many
samples of size M(M<N) with replacement, say S such samples

2. If data has n predictors, sample m predictors (m<n)

3. Develop trees for each of the samples generated in step 1 using the sample
of predictors from step 2 using CART

4. Repeat step 3 for all the samples generated in step 1

5. Predict the class of new observation using majority vote

24
Strengths /weakness
Assigns label to unseen record and also explains why the decision is made

Depends only on the number of level of tree for performance and not on
sample size

Weakness

High error rates when training set contains small number of large variety of
different classes

Computationally expensive (O(|C|*|L|*|A|)

25
References
Du, Hongbo, Data Mining Techniques and Applications,
Cengage Learning, India, 2013
Berry, M, and Linoff,G.,1997,Data Mining Techniques for
Marketing, sales and customer support, John Wiley and
Sons
Han, J. and Kamber, M., 2001, Data Mining: Concepts
and Techniques, Morgan Kaufmann Publishers
Tan, P-N, Steinbach, M. and Kumar, V., 2006,
Introduction to Data Mining, Addison-Wesley
Kumar, U.D., Business Analytics, 2017, Wiley India
26

Anda mungkin juga menyukai