Concepts and
Techniques
(3rd ed.)
Chapter 6
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2013 Han, Kamber & Pei. All rights reserved.
1
Basic Concepts
Summary
Applications
Items bought
10
20
30
40
50
Customer
buys diaper
Customer
buys beer
Items bought
10
20
30
40
50
buys
diaper
Customer
buys beer
confidence, c, conditional
probability that a
transaction having X also
contains Y
Let
minsup = 50%,
minconf
= 50%
Association
rules:
(many
Freq.more!)
Pat.: Beer:3, Nuts:3, Diaper:4,
Eggs:3,
{Beer,
Diaper}:3
Beer
Diaper
(60%,
100%)
A long pattern contains a combinatorial number of subpatterns, e.g., {a1, , a100} contains (1001) + (1002) + +
(110000) = 2100 1 = 1.27*1030 sub-patterns!
<a1, , a100>: 1
Min_sup = 1.
<a1, , a100>: 1
!!
9
Basic Concepts
Summary
11
Method:
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
L2
Supmin = 2 Itemset
C1
1st scan
C2
sup
{A}
{B}
{C}
{D}
{E}
Itemset
sup
1
Itemset
sup
{A, B}
{A, C}
{A, C}
{B, C}
{A, E}
{B, E}
{B, C}
{C, E}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
3rd scan
L3
L1
Itemset
sup
{A}
{B}
{C}
{E}
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset
sup
{B, C, E}
2
15
Implementation of Apriori
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
Self-joining: L3*L3
Pruning:
C4 = {abcd}
17
Method:
Transaction: 1 2 3 5 6
2,5,8
1+2356
234
567
13+56
145
136
12+356
124
457
125
458
345
356
357
689
367
368
159
19
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
DB1
sup1(i) <
DB1
DB2
sup2(i) <
DB2
DBk
supk(i) <
DBk
DB
sup(i) < DB
Candidates: a, b, c, d, e
Hash entries
count itemset
s ad,
{ab,
35
88
.
.
.
ae}
{bd, be,
de}
.
.
.
10
{yz, qs,
2 Hash
wt}
Table
Frequent 1-itemset: a, b, d, e
AC
BC
AD
BD
CD
Transactions
A
Apriori
{}
Itemset lattice
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset DIC
counting and implication
rules for market basket data.
In SIGMOD97
1-itemsets
2-itemsets
1-itemsets
2-items
3-items
26
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD 00)
Depth-first search
Major philosophy: Grow long patterns from short ones using local
frequent items only
28
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
min_support = 3
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
f:4
c:1
Item frequency head
frequent 1-itemset
f
4
(single item pattern)
c:3 b:1 b:1
c
4
2. Sort frequent items in
a
3
frequency descending
b
3
a:3
p:1
m
3
order, f-list
p
3
m:2 b:1
3. Scan DB again,
construct FP-tree
F-list = f-c-a-b-m-p p:2 m:1
29
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
b:1
p:1
f:3
fc:3
m:2
b:1
fca:2, fcab:1
p:2
m:1
fcam:2, cb:1
31
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
{}
f:3
m:2
b:1
c:3
p:2
m:1
a:3
All frequent
patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m-conditional FP-tree
32
f:3
Cond. pattern base of am: (fc:3)
c:3
f:3
c:3
a:3
am-conditional FP-tree
{}
f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
{}
a1:n1
a2:n2
a3:n3
b1:m1
C2:k2
r1
{}
C1:k1
C3:k3
r1
a1:n1
a2:n2
a3:n3
b1:m1
C2:k2
C1:k1
C3:k3
34
Completeness
Compactness
DB projection
Parallel projection
Partition projection
Partition-Based Projection
Tran. DB
fcamp
fcabm
fb
cbp
fcamp
cb
fca
cb
fcam
fca
am-proj DB cm-proj DB
fc
f
fc
f
fc
f
38
100
90
D1 FP-grow th runtime
D1 Apriori runtime
80
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
39
140
D2 FP-growth
Runtime (sec.)
120
D2 TreeProjection
100
80
60
40
20
0
0
0.5
1
Support
Data
Mining:threshold
Concepts and(%)
Techniques
1.5
2
40
Divide-and-conquer:
Other factors
Basic ops: counting local freq items and building sub FPtree, no pattern search and matching
TID
Items
Patterns having d
10
a, c, d, e, f
20
a, b, e
30
c, e, f
40
a, c, d, f
50
c, e, f
Min_sup=2
Flist: d-a-f-e-c
A, B, C, D, E
Tid
Items
10
A, B, C, D, E
20
B, C, D, E,
30
A, C, D, F
Potential
CD, CE, CDE, DE
maxSince BCDE is a max-pattern, no need
to check
patterns
BCD, BDE, CDE in later scan
51
52
Visualization of Association
Rules
(SGI/MineSet 3.0)
53
Basic Concepts
Summary
54
P( A B)
lift
P ( A) P( B)
2000 / 5000
lift ( B, C )
0.89
3000 / 5000 * 3750 / 5000
lift ( B, C )
Basketba
ll
Not
basketball
Sum
(row)
Cereal
2000
1750
3750
Not
cereal
1000
250
1250
Sum(col.)
3000
2000
5000
1000 / 5000
1.33
3000 / 5000 *1250 / 5000
55
Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD02)
56
Null-Invariant Measures
57
Comparison of Interestingness
Measures
Milk
No Milk
Sum
(row)
Coffee
m, c
~m, c
No
Coffee
m, ~c
~m, ~c
~c
Sum(col. m
~m
)
Null-transactions
w.r.t. m and c
Kulczynski
measure (1927)
Null-invariant
59
Basic Concepts
Summary
61
Summary
62
63
J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for
Mining Frequent Closed Itemsets. KDD'03.
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A DualPruning Algorithm for Itemsets with Constraints. KDD02.
68
69
Basic Concepts
Weblog mining
Collaborative Filtering
Bioinformatics
Summary
70