Big Data

BDAM
Session_2b
MDI Gurgaon
1
Contents
Information gain in C4.5
Intro to CART Algo and example
Intro to CHAID Algo and example
Problem of Overfitting
Evaluating Performance
Ensemble Methods
2
Representative Data Set
Sr No Attributes Class
Outlook Temperature Humidity Windy
1 Sunny Hot High False N

2 Sunny Hot High True N
3 Overcast Hot High False P
4 Rain Mild High False P
5 Rain Cool Normal False P
6 Rain Cool Normal True N
7 Overcast Cool Normal True P
8 Sunny Mild High False N
9 Sunny Cool Normal False P
10 Rain Mild Normal False P
11 Sunny Mild Normal True P
12 Overcast Mild High True P
13 Overcast Hot Normal False P
14 Rain Mild High True N 3
Information Gain Ratio in C4.5 C5 Algorithms
The information gain ratio is defined as the ratio of the information gain over
Attribute A against the average information of the attribute,
Gain Ratio = Gain(A)/H(A)
The gain ratio normalizes uncertainty across different attributes in order to

avoid bias towards attributes with more distinct values
H(Outlook) = -(5/14)log2(5/14)-4/14log2(4/14)-5/14log2(5/14) =1.577 bits
Gain ratio(Outlook) = Gain (Outlook)/H(Outlook) = 0.246/1.577 = 0.156
Gain Ratio (Temp) = 0.019; Gain Ratio(Humidity) = 0.151; Gain Ratio(Windy)

= 0.019 4
Gini Index of Impurity in the CART Algo
Algos like CART, SPRINT, SLIQ, consider a given dataset as impure set and
constructing a decision tree is a process of splitting the data set into purer
and purer partitions (examples of one kind should overwhelm others)
Suppose a training set has w classes C1, C2, C3, Cw, a function that
measures the level of impurity of the data set known as Gini impurity function
under a given condition t is defined as
Gini(t) = 1- (i=1,w)P(Ci|t)2 , where P(Ci|t) is the fraction of a specific class

under the condition t.
Gini(Class) = 1- (i=1,w)P(Ci)2
5
If two equally probable classes are present we have 1-0.5*0.5-0.5*0.5 = 0.5
To construct a tree we need to select the attribute that reduces the impurity
the most, as the root of the tree
Gini index of impurity over an attribute A of m values is defined as
GiniIndex(A) = Gini(Class) - (i=1,m)P(aj)*Gini(A=aj), where P(aj) is the

probability of A=aj
Gini(Class) = 1-(9/14)^2 -(5/14)^2 = 0.459
Gini(Outlook=Sunny)= 1-P(Class=P|Outlook = Sunny)^2 - P(Class=N|Outlook

= Sunny)^2 = 1- ()2 -()2 = 0.48
Gini (Outlook = overcast) = ?
Gini (Outlook = rain) = ? 6

Gini Index of Impurity for the weather data
Clas Humidity Windy Outlook Temperature

s
High Norm True False Sunn Over Rain Hot Mild Cool
al y cast
P 3 6 3 6 2 4 3 2 4 3
N 4 1 3 2 3 0 2 2 2 1
Gini 0.367 0.438 0.343 0.440
Gini 0.092 0.021 0.116 0.019

Index
7
Gini Index of Impurity for the weather data
Class Outlook Outlook Outlook
Sunny {Overcas Overcast {Rain, Rain {Sunny,

t, Rain} Sunny} Overcast
}
P 2 7 4 5 6 3
N 3 2 0 5 3 2
Gini 0.394 0.357 0.457
Gini Index 0.065 0.102 0.002
8
Chi-square statistic in the CHAID Algo
The chi-square statistic is a measure of the degree of association or
dependence between two variables
For classification the statistic can be used to measure the degree of

dependence between given attribute and the class variable
Given a set of n examples of w classes C1, C2, C3, Cw and an attribute A of

v values a1, a2, a3, av, the chi-square is defined as follows;
2 = (i=1,v)(i=1,w) (Xij -Eij)2/Eij ,
Where xij represent the actual frequency of attribute value aj and class Ci and
Eij represents the expected frequency, Eij = ni*nj/N 9
The chi-square statistic calculates the difference between the actual
frequencies of classes in an attribute with the expected frequencies when no
association is assumed
The greater the difference, the stronger is the association between the
classes and the chosen attribute (given that it is statistically significant)
Depending on the nature of the dependent variable the following statistical

tests are used for splitting the dataset
Chi square test of independence when response variable is discrete
F-test when the response variable is continuous
Likelihood ratio test when response variable is ordinal

10
CHAID Algo
The variables with lowest p-values based on statistical test are used for
splitting the dataset thereby creating internal nodes, Bonferroni correction is
used for adjusting the significance level alpha
Using independent variable repeat the previous step for each of the subsets
of the data until;
All dependent variables are exhausted or they are not statistically significant
The stopping criteria is met
Generate business rules from the leafs

11
Example (Weather data)
Ci- C1 and C2 are P and N
Parameter j stands for (say) Outlook = sunny, overcast and rain
X2(Outlook) = X2(Outlook=Sunny) + X2(Outlook=Overcast) + X2(Outlook=Rain)
X2(Outlook) = [((2-(5*9/14))^2/(5*9/14)) + ((3 - (5*5/14))^2/(5*5/14))] + ?? + ??
~ 29.653
12
Bonferroni correction
Since the tree results after multiple splits and multiple tests of hypothesis, the
significance value alpha must be corrected for
Type 1 error is probability of rejecting Null hypothesis when it is true,

identified as alpha
In multiple tests of hypothesis at alpha =0.05; probability of retaining null

hypothesis is 0.95*0.95 =0.9027
Type 1 error in this case is = 1-0.9027 = 0.0975, higher than alpha =0.05 as
required
For 10 tests = 1-(0.95)^10 = 0.4012 13

Cost-based Splitting Criteria
Other than impurity measures such as Gini index and entropy; decision makers
also use cost of misclassification to split data
The total penalty is C01P01 +C10P10 where
P01 = Proportion of 0 classified as 1
P10 = proportion of 1 classified as 0
C01 = Cost of classifying 0 as 1
C10 = Cost of classifying 1 as 0
14
Solving Overfitting
Overfitting: All classification algorithms suffer from overfitting as the model is
a representation of training data rather than actual data
Problem is even more severe if at a particular leaf node only few training
examples exist
Cause of problem is understood to be a) presence of noise records b) lack of

representative examples in training sets c) repeated attribute selection
process of the algorithm
Presence of noise increases the peculiarity thereby adding new subtrees to

accommodate training data
15
Solution is to denoise the data before building tree but this is not practical
Tree Pruning
The problem of non representative training examples too is not exactly
solvable.
The problem of repeated attribute selection is based on creating subsets if

gain is above a threshold, however, if threshold is set too low, we would have
a large tree
The problem of noise and repeated attribute selection can be addressed by

Tree Pruning
Act of pruning is replacing a subtree by a leaf; to make tree smaller and more
robust
16
Two types of pruning methods are used: Pre pruning and post-pruning
For example C4.5 actually makes certain assumptions about data distribution
and goes on to estimate generalization errors that the tree may make to
unseen examples
Another approach is Occams Razor or penalizing the complex trees, thereby

selecting the tree with less complexity and lowest possible error rate
Post-pruning on the other hand is conducted after tree is fully grown and
requires independent validation or testing data
Experimental studies claim that post-pruning produces more accurate and

robust trees as compared to pre-pruning
Among post-pruning methods are cost complexity pruning, pessimistic

pruning, path-length pruning, cross validation pruning, reduced error pruning
17
Reduced-Error Pruning Method
1. Classify all examples in the testing dataset using the tree, and notify at each
non-leaf node the type and number of errors
2. For every non-leaf node, count the number of errors if the subtree where the
node is the root were to be replaced by a leaf node with the best possible
class label
3. Prune the subtree that yields largest reduction in number of errors
4. Repeat steps 2 and 3 until further pruning increases the number of errors
Above method brings out few important points!!!

18
Issues in Reduction error pruning
First a subtree may be pruned even with zero error reduction?
Second there may be a number of subtrees with same error reduction, in this
case largest subtree is pruned (Is this in a iterative way in which all trees are
considered and bigger trees are pruned first or single sweep bottom-up with
each smaller subtree being pruned first)?
Third should we use training data or testing data or both for pruning?
Fourth certain parts of the tree may be irrelevant for the testing data, should
these be removed as they do not increase error among testing dataset, is this
a good strategy?
19
Research suggests specific solutions for each of the above
Evaluating the Performance
Error rate = (FP+FN/TP+TN+FP+FN)
Accuracy rate = 1-error rate

Confusion Matrix
Predicted Class
Positive Negative
Actual Class Positive TP FN
Negative FP TN
20
Evaluating Performance
Holdout method: Standard division in and for training and testing dataset
Random subsampling: Repeat hold-out method multiple times and average of

accuracy for all runs is taken as accuracy
Bootstrap: Sampling with replacement unlike hold-out, data is again divided in

two parts, accuracy is calculated as average of all runs
Cross-validation: k-fold cross validation says data be divided into k equal

parts and for each run 1 part is used for testing and others (k-1) for training.
The process is repeated k times. Errors are averaged across the runs
21
Ensemble Methods
Ensemble method is a machine learning algo that generates several
classifiers using different sampling strategies
A majority voting is then used for classifying a new observation using multiple
classifiers that are developed
For new data point all classifiers in the ensemble are used to identify the
class respectively and the final class is decided based on majority if given
equal weight
Different weight can be assigned to each classifier based on its accuracy

(Adaptive boosting).
Final value in these cases is obtained using the linear combination of all
22
Random Forest
It is well known Ensemble method in which several trees are developed using
different sampling strategies
Most common strategy is Bootstrap Aggregating (Bagging) which is random

sampling with replacement
New observation is classified by using all the trees in the random forest and
majority voting is adopted
In general random Forest are expected to provide much higher accuracy

compared to single tree
The model is tested using the Out-of-bag data which is not a part of the
23
training data set used for creating many trees.
Steps to create Random Forest
1. Assume training data has N observations, one needs to generate many
samples of size M(M<N) with replacement, say S such samples
2. If data has n predictors, sample m predictors (m<n)
3. Develop trees for each of the samples generated in step 1 using the sample
of predictors from step 2 using CART
4. Repeat step 3 for all the samples generated in step 1
5. Predict the class of new observation using majority vote
24
Strengths /weakness
Assigns label to unseen record and also explains why the decision is made
Depends only on the number of level of tree for performance and not on
sample size
Weakness
High error rates when training set contains small number of large variety of
different classes
Computationally expensive (O(|C|*|L|*|A|)
25
References
Du, Hongbo, Data Mining Techniques and Applications,
Cengage Learning, India, 2013
Berry, M, and Linoff,G.,1997,Data Mining Techniques for
Marketing, sales and customer support, John Wiley and
Sons
Han, J. and Kamber, M., 2001, Data Mining: Concepts
and Techniques, Morgan Kaufmann Publishers
Tan, P-N, Steinbach, M. and Kumar, V., 2006,
Introduction to Data Mining, Addison-Wesley
Kumar, U.D., Business Analytics, 2017, Wiley India
26

Big Data

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Big Data

Diunggah oleh

Hak Cipta:

Format Tersedia

BDAM

Intro to CART Algo and example

Intro to CHAID Algo and example

Outlook Temperature Humidity Windy

1 Sunny Hot High False N

Gain Ratio = Gain(A)/H(A)

The gain ratio normalizes uncertainty across different attributes in order to

H(Outlook) = -(5/14)log2(5/14)-4/14log2(4/14)-5/14log2(5/14) =1.577 bits

Gain ratio(Outlook) = Gain (Outlook)/H(Outlook) = 0.246/1.577 = 0.156

Gain Ratio (Temp) = 0.019; Gain Ratio(Humidity) = 0.151; Gain Ratio(Windy)

Gini(t) = 1- (i=1,w)P(Ci|t)2 , where P(Ci|t) is the fraction of a specific class

Gini index of impurity over an attribute A of m values is defined as

GiniIndex(A) = Gini(Class) - (i=1,m)P(aj)*Gini(A=aj), where P(aj) is the

Gini(Class) = 1-(9/14)^2 -(5/14)^2 = 0.459

Gini(Outlook=Sunny)= 1-P(Class=P|Outlook = Sunny)^2 - P(Class=N|Outlook

Gini (Outlook = overcast) = ?

Gini (Outlook = rain) = ? 6

Clas Humidity Windy Outlook Temperature

Gini 0.367 0.438 0.343 0.440

Gini 0.092 0.021 0.116 0.019

Class Outlook Outlook Outlook

Sunny {Overcas Overcast {Rain, Rain {Sunny,

Gini 0.394 0.357 0.457

Gini Index 0.065 0.102 0.002

For classification the statistic can be used to measure the degree of

Given a set of n examples of w classes C1, C2, C3, Cw and an attribute A of

2 = (i=1,v)(i=1,w) (Xij -Eij)2/Eij ,

Depending on the nature of the dependent variable the following statistical

Chi square test of independence when response variable is discrete

F-test when the response variable is continuous

Likelihood ratio test when response variable is ordinal

The stopping criteria is met

Generate business rules from the leafs

Parameter j stands for (say) Outlook = sunny, overcast and rain

X2(Outlook) = X2(Outlook=Sunny) + X2(Outlook=Overcast) + X2(Outlook=Rain)

X2(Outlook) = [((2-(5*9/14))^2/(5*9/14)) + ((3 - (5*5/14))^2/(5*5/14))] + ?? + ??

Type 1 error is probability of rejecting Null hypothesis when it is true,

In multiple tests of hypothesis at alpha =0.05; probability of retaining null

For 10 tests = 1-(0.95)^10 = 0.4012 13

The total penalty is C01P01 +C10P10 where

P01 = Proportion of 0 classified as 1

P10 = proportion of 1 classified as 0

C01 = Cost of classifying 0 as 1

C10 = Cost of classifying 1 as 0

Cause of problem is understood to be a) presence of noise records b) lack of

Presence of noise increases the peculiarity thereby adding new subtrees to

The problem of repeated attribute selection is based on creating subsets if

The problem of noise and repeated attribute selection can be addressed by

Another approach is Occams Razor or penalizing the complex trees, thereby

Experimental studies claim that post-pruning produces more accurate and

Among post-pruning methods are cost complexity pruning, pessimistic

3. Prune the subtree that yields largest reduction in number of errors

Above method brings out few important points!!!

Accuracy rate = 1-error rate

Actual Class Positive TP FN

Random subsampling: Repeat hold-out method multiple times and average of

Bootstrap: Sampling with replacement unlike hold-out, data is again divided in

Cross-validation: k-fold cross validation says data be divided into k equal

Different weight can be assigned to each classifier based on its accuracy

Most common strategy is Bootstrap Aggregating (Bagging) which is random

In general random Forest are expected to provide much higher accuracy

2. If data has n predictors, sample m predictors (m<n)

4. Repeat step 3 for all the samples generated in step 1

5. Predict the class of new observation using majority vote

X2(Outlook) = [((2-(59/14))^2/(59/14)) + ((3 - (55/14))^2/(55/14))] + ?? + ??

Computationally expensive (O(|C||L||A|)