Mining
UNIT III:
Association Rule Mining Single Dimensional Boolean Association Rules
from Transactional Databases Multi-Level Association Rules from
Transaction Databases. Association Rules: Basic Concepts -Efficient
and Scalable Frequent Item set Mining Methods - Mining Various Kinds
of Association Rules
Classification: Decision tree induction - Bayesian classification.
Prediction: Linear Regression
Regression-Based methods
Nonlinear
Regression
Other
Basic Concepts
Summary
Applications
A long pattern contains a combinatorial number of subpatterns, e.g., {a1, , a100} contains (1001) + (1002) + +
(110000) = 2100 1 = 1.27*1030 sub-patterns!
Let min_sup = 1
{a1, , a100}: 1
{a1, , a50}: 2
{a1, , a100}: 1
Basic Concepts
Summary
Method:
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
L2
Supmin = 2 Itemset
C1
1st scan
C2
sup
{A}
{B}
{C}
{D}
{E}
Itemset
sup
1
Itemset
sup
{A, B}
{A, C}
{A, C}
{B, C}
{A, E}
{B, E}
{B, C}
{C, E}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
3rd scan
L3
L1
Itemset
sup
{A}
{B}
{C}
{E}
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset
sup
{B, C, E}
2
12
Implementation of Apriori
Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
Self-joining: L3*L3
Pruning:
C4 = {abcd}
13
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD 00)
Depth-first search
Major philosophy: Grow long patterns from short ones using local
frequent items only
20
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
min_support = 3
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Header Table
1. Scan DB once, find
f:4
c:1
Item frequency head
frequent 1-itemset
f
4
(single item pattern)
c:3 b:1 b:1
c
4
2. Sort frequent items in
a
3
frequency descending
b
3
a:3
p:1
m
3
order, f-list
p
3
m:2 b:1
3. Scan DB again,
construct FP-tree
F-list = f-c-a-b-m-p p:2 m:1
21
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
f:3
fc:3
fca:2, fcab:1
fcam:2, cb:1
23
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
{}
f:3
m:2
b:1
c:3
p:2
m:1
a:3
All frequent
patterns relate to m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m-conditional FP-tree
24
f:3
Cond. pattern base of am: (fc:3)
c:3
f:3
c:3
a:3
am-conditional FP-tree
{}
f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
{}
a1:n1
a2:n2
a3:n3
b1:m1
C2:k2
r1
{}
C1:k1
C3:k3
r1
a1:n1
a2:n2
a3:n3
b1:m1
C2:k2
C1:k1
C3:k3
26
Completeness
Compactness
DB projection
Parallel projection
Partition projection
Divide-and-conquer:
Other factors
Basic ops: counting local freq items and building sub FPtree, no pattern search and matching
Min_sup=2
TID
Items
Patterns having d
10
a, c, d, e, f
20
a, b, e
30
c, e, f
40
a, c, d, f
Flist: d-a-f-e-c
Mining Multi-Dimensional
Association
Single-dimensional rules:
buys(X, milk) buys(X, bread)
Basic Concepts
Summary
36
P( A B)
lift
P ( A) P( B)
lift ( B, C )
2000 / 5000
0.89
3000 / 5000 * 3750 / 5000
lift ( B, C )
1000 / 5000
1.33
3000 / 5000 *1250 / 5000
Basketba
ll
Not
basketball
Sum
(row)
Cereal
2000
1750
3750
Not
cereal
1000
250
1250
Sum(col.)
3000
2000
5000
37