Anda di halaman 1dari 20

Case Based Reasoning

Lecture 3: CBR Case-Base Indexing


Outline
Indexing CBR case knowledge
Why might we want an index?
Decision tree indexes
C4.5 algorithm
Summary
Why might we want an index?
Efficiency
Similarity matching is computationally
expensive for large case-bases
Similarity matching can be computationally
expensive for complex case representations
Relevancy of cases for similarity matching
some features of new problem may make
certain cases irrelevant
despite being very similar
Cases are pre-selected from case-base
Similarity matching is applied to subset of cases

What to index?
Client Ref #: 64
Client Name: John Smith
Address: 39 Union Street
Tel: 01224 665544
Photo:






Age: 37
Occupation: IT Analyst
Income: 20000

Unindexed
features
Indexed
features
Case Features are:
- Indexed
- Unindexed
Indexed vs Unindexed Features
Indexed features are:
used for retrieval
are predictive of the cases solution
Unindexed feature are:
not used for retrieval
not predictive of the cases solution
provide valuable contextual information and
lessons learned

Playing Tennis Example (case-base)
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Cloudy Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Cloudy Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Cloudy Mild High True Yes
Cloudy Hot Normal False Yes
Rainy Mild High True No
Decision Tree (Index) for Playing
Tennis
outlook
Yes

sunny
cloudy
rainy
humidity
No

Yes

high
normal
windy
No

Yes

true false
Choosing the Root Attribute
humidity
Yes
Yes
Yes
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
temperature
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
outlook
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
windy
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Which attribute is best for the root of the tree?
- the one that gives the best information gain
- in this case outlook (as we are going to see)
sunny
cloudy
rainy high low true false hot
mild
cold
Building Decision Trees C4.5
Algorithm
Based on the Information Theory (Shannon 1948)
Divide and conquer strategy
Choose attribute for root node
Create branch for each value of that attribute
Split cases according to branches
Repeat process for each branch until all cases in
the branch have the same class

Assumption:
simplest tree which classifies the cases is best
Entropy of a set of cases
Playing Tennis Example:
S is the set of 14 cases
We want to classify the cases according to the values of
Play, i.e., Yes and No in this example.
the proportion of Yes cases is 9 out of 14: 9/14 = 0.64
the proportion of No cases is 5 out of 14: 5/14 = 0.36
The Entropy measures the impurity of S
Entropy (S) = - 0.64(log
2
0.64) 0.36(log
2
0.36)
= -0.64(-0.644)-0.36(-1.474) = 0.41+0.53 = 0.94
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Cloudy Hot High False Yes

Yes case
No case
14 cases
Entropy of a set of cases
S is a set of cases
A is a feature
Play in the example
{S
1
... S
i
S
n
} are the partitions of S according to
values of A
Yes and No in the example
{P
1
... P
i
P
n
} are the proportions of {S
1
... S
i
S
n
}
in S
i
n
i
i
p log p S Entropy
2
* ) (
1

=
=
) ( * ) ( ) , (
1
i
n
i
i
S Entropy
S
S
S Entropy A S Gain

=
=
Gain of an attribute
Calculate Gain (S, A) for each attribute A
expected reduction in entropy due to sorting on A
Choose the attribute with highest gain as root of tree
Gain (S, A) = Entropy(S) Expectation(A)
{S
1
, ..., S
i
, , S
n
} = partitions of S according to
values of attribute A
n = number of attributes A
|S
i
| = number of cases in the partition S
i
|S| = total number of cases in S


Which attribute is root?
If Outlook is made root of the tree
There are 3 partitions of the cases
S
1
for Sunny,
S
2
for Cloudy, S
3
for Rainy
S
1
(Sunny)= {cases 1,2,8,9,11}
|S
1
| = 5
In these 5 cases
values for Play are
3 No and 2 Yes
Entropy(S
1
)
= - 2/5 (log
2
2/5) 3/5(log
2
3/5)
= 0.97

Similarly
Entropy(S
2
)= 0
Entropy(S
3
)= 0.97
Outlook Tempe
rature
Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Cloudy Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Cloudy Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Cloudy Mild High True Yes
Cloudy Hot Normal False Yes
Rainy Mild High True No
Choosing the Root Attribute
humidity
Yes
Yes
Yes
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
No
temperature
Yes
Yes
No
No
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
outlook
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
windy
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
No
No
No
Which attribute is best for the root of the tree?
- the one that gives the best information gain
- in this case outlook (as we are going to see)
sunny
cloudy
rainy high low true false hot
mild
cold
Which attribute is root?
Gain(S, Outlook) = Entropy(S) Expectation(Outlook) =



Gain(S, Outlook)
= 0.94 [5/14 * 0.97 + 4/14 * 0 + 5/14 * 0.97]
= 0.247
Similarly
Gain(S, Temperature)= 0.059
Gain(S, Humidity)= 0.051
Gain(S, Windy)= 0.048

Gain(S, Outlook) is the highest gain
Outlook should be the root of the decision tree (index)
(

+ + ) 3 ( *
| |
| |
) 2 ( *
| |
| |
) ( *
| |
| |
) (
3 2
1
1
S Entropy
S
S
S Entropy
S
S
S Entropy
S
S
S Entropy
Repeat for Sunny Node
outlook
Yes

sunny
cloudy
rainy
?
temperature
No
No

Yes
No

Yes

hot mild cold
outlook
Yes

sunny
cloudy
rainy
?
windy
Yes
No
No

Yes
No

false true
outlook
Yes

sunny
cloudy
rainy
?
humidity
No
No
No

Yes
Yes
Yes

high normal
Repeat for Rainy Node
outlook
Yes

sunny
cloudy
rainy
humidity
No

Yes

high normal


Mild High False Yes
Cool Normal False Yes
Cool Normal True No
Mild Normal False Yes
Mild High True No

Decision Tree (Index) for Playing
Tennis
outlook
Yes

sunny
cloudy
rainy
humidity
No

Yes

high
normal
windy
No

Yes

true false
Case Retrieval via DTree Index
Typical implementation
e.g.
Case-Base indexed
using a decision-tree
Cases are
stored in the
index
leaves
DTree created from cases
Automated indexing of case-base
Summary
Decision tree is built from cases
Decision tree is often used for problem-solving
In CBR, decision tree is used to partition
cases
Similarity matching is applied to cases in leaf
node
Indexing pre-selects relevant cases for k-NN
retrieval
BRING CALCULATOR on MONDAY

Anda mungkin juga menyukai