Anda di halaman 1dari 31

Decision Trees

Outline

Decision tree representation

ID3 learning algorithm
Entropy, information gain
Overfitting

Instituto Superior de Estatstica e Gesto de Informao

1
Decision Tree for PlayTennis

Outlook

High Normal Strong Weak

No Yes No Yes
Instituto Superior de Estatstica e Gesto de Informao

Outlook

High Normal Each branch corresponds to an

attribute value node
No Yes Each leaf node assigns a classification
Instituto Superior de Estatstica e Gesto de Informao

2
Decision Tree for PlayTennis

Outlook Temperature Humidity Wind PlayTennis

Sunny Hot High Weak ?No
Outlook

High Normal Strong Weak

No
Instituto Superior de Estatstica e Gesto de Informao
Yes No Yes

Decision Tree for Conjunction

Outlook=Sunny Wind=Weak

Outlook

Sunny Overcast Rain

Wind No No

Strong Weak

No Yes
Instituto Superior de Estatstica e Gesto de Informao

3
Decision Tree for Disjunction

Outlook=Sunny Wind=Weak

Outlook

Strong Weak Strong Weak

No Yes No Yes
Instituto Superior de Estatstica e Gesto de Informao

Outlook

Yes No No Yes No Yes

Instituto Superior de Estatstica e Gesto de Informao

4
Decision Tree

Outlook

High Normal Strong Weak

No Yes No Yes

(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
Instituto Superior de Estatstica e Gesto de Informao

Instances describable by attribute-value pairs

Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
Missing attribute values
Examples:
Medical diagnosis
Credit risk analysis

Instituto Superior de Estatstica e Gesto de Informao

5
Top-Down Induction of Decision Trees ID3

1. A the best decision attribute for next node

2. Assign A as decision attribute for node
3. For each value of A create new descendant
4. Sort training examples to leaf node according to
the attribute value of the branch
5. If all training examples are perfectly classified
(same value of target attribute) stop, else
iterate over new leaf nodes.

6
Entropy

S is a sample of training examples

p+ is the proportion of positive examples
p- is the proportion of negative examples
Entropy measures the impurity of S
Entropy(S) = -p+ log2 p+ - p- log2 p-

Entropy

Entropy(S)= expected number of bits needed to

encode class (+ or -) of randomly drawn members of S
(under the optimal, shortest length-code)

Information theory optimal length code assign

log2 p bits to messages having probability p.
So the expected number of bits to encode
(+ or -) of random member of S:
-p+ log2 p+ - p- log2 p-
(log 0 = 0)
Instituto Superior de Estatstica e Gesto de Informao

7
Information Gain

A

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 35/64 log2 35/64
= 0.99
[29+,35-] A1=? A2=? [29+,35-]

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

Instituto Superior de Estatstica e Gesto de Informao

Information Gain

Entropy([21+,5-]) = 0.71 Entropy([18+,33-]) = 0.94

Entropy([8+,30-]) = 0.74 Entropy([8+,30-]) = 0.62
Gain(S,A1)=Entropy(S) Gain(S,A2)=Entropy(S)
-26/64*Entropy([21+,5-]) -51/64*Entropy([18+,33-])
-38/64*Entropy([8+,30-]) -13/64*Entropy([11+,2-])
=0.27 =0.12

[21+, 5-] [8+, 30-] [18+, 33-] [11+, 2-]

Instituto Superior de Estatstica e Gesto de Informao

8
Training Examples

Day Outlook Temp. Humidity Wind Play Tennis

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Selecting the Next Attribute

S=[9+,5-] S=[9+,5-]
E=0.940 E=0.940
Humidity Wind

[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811
E=1.0
Gain(S,Humidity) Gain(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
(7/14)*0.592 (6/14)*1.0
=0.151 =0.048
Instituto Superior de Estatstica e Gesto de Informao

9
Selecting the Next Attribute

S=[9+,5-]
E=0.940
Outlook Temp ?

Over
Sunny Rain
cast

[2+, 3-] [4+, 0] [3+, 2-]

E=0.971 E=0.0 E=0.971
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 (5/14)*0.0971
=0.247
Instituto Superior de Estatstica e Gesto de Informao

ID3 Algorithm

[D1,D2,,D14] Outlook
[9+,5-]

Ssunny=[D1,D2,D8,D9,D11] [D3,D7,D12,D13] [D4,D5,D6,D10,D14]

[2+,3-] [4+,0-] [3+,2-]
? Yes ?
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 2/5(0.0) = 0.970
Gain(Ssunny , Temp.)=0.970-(2/5)0.0 2/5(1.0)-(1/5)0.0 = 0.570
Gain(Ssunny , Wind)=0.970= -(2/5)1.0 3/5(0.918) = 0.019
Instituto Superior de Estatstica e Gesto de Informao

10
ID3 Algorithm

Outlook

[D3,D7,D12,D13]

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]

Instituto Superior de Estatstica e Gesto de Informao

Hypothesis Space Search ID3

+ - +
A2
A1
+ - + + - -
+ - + - - +
A2 A2

- + - + -
A3 A4
+ - - +
Instituto Superior de Estatstica e Gesto de Informao

11
Hypothesis Space Search ID3

Hypothesis space is complete!

Target function surely in there
Outputs a single hypothesis
No backtracking on selected attributes (greedy search)
Local minimal (suboptimal splits)
Statistically-based search choices
Robust to noisy data
Inductive bias (search bias)
Prefer shorter trees over longer ones
Place high information gain attributes close to the root

Outlook

High Normal Strong Weak

No Yes No Yes
R1 : If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No
R2 : If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=Yes
R3 : If (Outlook=Overcast) Then PlayTennis=Yes
R4 : If (Outlook=Rain) (Wind=Strong) Then PlayTennis=No
R5 : If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
Instituto Superior de Estatstica e Gesto de Informao

12
Continuous Valued Attributes

Create a discrete attribute to test continuous

Temperature = 24.50C
(Temperature > 20.00C) = {true, false}
Where to set the threshold?

Problem: if an attribute has many values, maximizing InformationGain will

select it.
E.g.: Imagine using Date=12.7.1996 as attribute
perfectly splits the data into subsets of size 1
Use GainRatio instead of information gain as criteria:
GainRatio(S,A) = Gain(S,A) / SplitInformation(S,A)
SplitInformation(S,A) = -i=1..c |Si|/|S| log2 |Si|/|S|
Where Si is the subset for which attribute A has the value vi

Instituto Superior de Estatstica e Gesto de Informao

13
Unknown Attribute Values

If node n tests A, assign most common value of A among other examples

sorted to node n.
Assign most common value of A among other examples with same target value
Assign probability pi to each possible value vi of A
Assign fraction pi of example to each descendant in tree

Occams Razor

Fewer short hypotheses than long hypotheses

A short hypothesis that fits the data is unlikely to be a coincidence
A long hypothesis that fits the data might be a coincidence

14
Overfitting

Consider error of hypothesis h over

Training data: errortrain(h)
Entire distribution D of data: errorD(h)
Hypothesis hH overfits training data if there is an
alternative hypothesis hH such that
errortrain(h) < errortrain(h)
and
errorD(h) > errorD(h)

Instituto Superior de Estatstica e Gesto de Informao

15
Boosting: Combining Classifiers

Cross Validation

k-fold Cross Validation

Divide the data set into k sub samples
Use k-1 sub samples as the training data and one sub sample
as the validation data.
Repeat the second step by choosing different sub samples as
the validation set.

16
Bagging

Generate a random sample from training set

Repeat this sampling procedure, getting a sequence of
K independent training sets
A corresponding sequence of classifiers C1,C2,,Ck is
constructed for each of these training sets, by using
the same classification algorithm
To classify an unknown sample X, let each classifier
predict.
The Bagged Classifier C* then combines the
predictions of the individual classifiers to generate the
final outcome. (sometimes combination is simple
voting)
Instituto Superior de Estatstica e Gesto de Informao

Boosting

INTUITION
Combining Predictions of an ensemble is more accurate than a
single classifier
Reasons
Easy to find quite correct rules of thumb however hard
to find single highly accurate prediction rule.
If the training examples are few and the hypothesis space
is large then there are several equally accurate classifiers.
Hypothesis space does not contain the true function, but
it has several good approximations.
Exhaustive global search in the hypothesis space is
expensive so we can combine the predictions of several
locally accurate classifiers.

17
Boosting

The final prediction is a combination of the prediction

of several predictors.
Differences between Boosting and previous methods?
Its iterative.
Boosting: Successive classifiers depends upon its
predecessors.
Previous methods : Individual classifiers were independent.
Training Examples may have unequal weights.
Look at errors from previous classifier step to decide how to
focus on next iteration over data
Set weights to focus more on hard examples. (the ones on
which we committed mistakes in the previous iterations)

Instituto Superior de Estatstica e Gesto de Informao

Boosting(Algorithm)

W(x) is the distribution of weights over the N

training points W(xi)=1
Initially assign uniform weights W0(x) = 1/N for all x,
step k=0
At each iteration k :
Find best weak classifier Ck(x) using weights Wk(x)
With error rate k and based on a loss function:
weight k the classifier Cks weight in the final hypothesis
For each xi , update weights based on k to get Wk+1(xi )
CFINAL(x) =sign [ i Ci (x) ]

Instituto Superior de Estatstica e Gesto de Informao

18
Boosting (Algorithm)

Instituto Superior de Estatstica e Gesto de Informao

Outline

Background

Theory/Interpretations

19

Commonly used in many areas

Simple to implement

Not prone to overfitting

Instituto Superior de Estatstica e Gesto de Informao

Instead of resampling, uses training set re-weighting

Each training sample uses a weight to determine the probability of
being selected for a training set.

AdaBoost is an algorithm for constructing a strong

classifier as linear combination of simple weak
classifier

classifiers

20

ht(x) weak or basis classifier (Classifier =

Learner = Hypothesis)
strong or final classifier

Weak Classifier: < 50% error over any distribution

Strong Classifier: thresholded linear combination
of weak classifier outputs

Instituto Superior de Estatstica e Gesto de Informao

Each training
sample has a weight,
which determines
the probability of
being selected for
training the
component classifier

Instituto Superior de Estatstica e Gesto de Informao

21
Find the Weak Classifier

Instituto Superior de Estatstica e Gesto de Informao

22
The algorithm core

Reweighting

y * h(x) = 1

y * h(x) = -1

23
Reweighting

In this way, AdaBoost focused on the

informative or difficult examples.
Instituto Superior de Estatstica e Gesto de Informao

Reweighting

In this way, AdaBoost focused on the

informative or difficult examples.
Instituto Superior de Estatstica e Gesto de Informao

24
Algorithm recapitulation
t=1

Instituto Superior de Estatstica e Gesto de Informao

Algorithm recapitulation

Instituto Superior de Estatstica e Gesto de Informao

25
Algorithm recapitulation

Instituto Superior de Estatstica e Gesto de Informao

Algorithm recapitulation

Instituto Superior de Estatstica e Gesto de Informao

26
Algorithm recapitulation

Instituto Superior de Estatstica e Gesto de Informao

Algorithm recapitulation

Instituto Superior de Estatstica e Gesto de Informao

27
Algorithm recapitulation

Instituto Superior de Estatstica e Gesto de Informao

Algorithm recapitulation

28

ROUND 1

29

ROUND 2

ROUND 3

30

Instituto Superior de Estatstica e Gesto de Informao

Very simple to implement
Does feature selection resulting in relatively simple
classifier
Fairly good generalization