Anda di halaman 1dari 83

Data Mining

Association Analysis: Basic Concepts


and Algorithms
Lecture Notes for Session 5
Delivered by
Dr. K. Satyanarayan Reddy
Based on
Introduction to Data Mining
by
Tan, Steinbach, Kumar

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Association Rule Mining

Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Tan,Steinbach, Kumar

Introduction to Data Mining

Example of Association Rules


{Diaper}
{Beer},
{Milk, Bread}
{Eggs,Coke},
{Beer, Bread}
{Milk},

Implication means co-occurrence,


not causality!

4/18/2004

Definition: Frequent Itemset

Itemset
A collection of one or more items

Example: {Milk, Bread, Diaper}

k-itemset

An itemset that contains k items

Support count ( )
Frequency of occurrence of an itemset
E.g.

({Milk, Bread,Diaper}) = 2

Support

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Fraction of transactions that contain an


itemset
E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Definition: Association Rule

Association Rule
An implication expression of the form
X
Y, where X and Y are itemsets
Example:
{Milk, Diaper}

{Beer}

Rule Evaluation Metrics


Support (s)

Measures how often items in Y


appear in transactions that
contain X

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

{Milk, Diaper}
s
c

Tan,Steinbach, Kumar

Items

Example:

Fraction of transactions that contain


both X and Y

Confidence (c)

TID

Introduction to Data Mining

Beer

(Milk , Diaper, Beer )


|T|
(Milk, Diaper, Beer )
(Milk , Diaper )
4/18/2004

2
5
2
3

0 .4
0.67
4

Association Rule Mining Task

Given a set of transactions T, the goal of


association rule mining is to find all rules having
support minsup threshold
confidence minconf threshold

Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Mining Association Rules


Example of Rules:

TID

Items

Bread, Milk

2
3
4
5

Bread, Diaper, Beer, Eggs


Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

{Milk,Diaper}
{Beer} (s=0.4, c=0.67)
{Milk,Beer}
{Diaper} (s=0.4, c=1.0)
{Diaper,Beer}
{Milk} (s=0.4, c=0.67)
{Beer}
{Milk,Diaper} (s=0.4, c=0.67)
{Diaper}
{Milk,Beer} (s=0.4, c=0.5)
{Milk}
{Diaper,Beer} (s=0.4, c=0.5)

Observations:
All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
Rules originating from the same itemset have identical support but
can have different confidence
Thus, we may decouple the support and confidence requirements
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Mining Association Rules

Two-step approach:
1. Frequent Itemset Generation

Generate all itemsets whose support

minsup

2. Rule Generation

Generate high confidence rules from each frequent itemset,


where each rule is a binary partitioning of a frequent itemset

Frequent itemset generation is still


computationally expensive

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Frequent Itemset Generation


null

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

ABDE

ACDE

ABCDE

Tan,Steinbach, Kumar

Introduction to Data Mining

BCDE

Given d items, there


are 2d possible
candidate itemsets
4/18/2004

Frequent Itemset Generation

Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the
database
Transactions

TID
1
2
3
4
5

Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

List of
Candidates

w
Match each transaction against every candidate

Complexity ~ O(NMw) => Expensive since M = 2d !!!


Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

Computational Complexity

Given d unique items:


Total number of itemsets = 2d
Total number of possible association rules:

d 1
k 1

d
k
2

d k
j 1

d 1

d k
j

If d=6, R = 602 rules

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

10

Frequent Itemset Generation Strategies

Reduce the number of candidates (M)


Complete search: M=2d
Use pruning techniques to reduce M

Reduce the number of transactions (N)


Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms

Reduce the number of comparisons (NM)


Use efficient data structures to store the candidates or
transactions
No need to match every candidate against every
transaction

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

11

Reducing Number of Candidates

Apriori principle:
If an itemset is frequent, then all of its subsets must also
be frequent

Apriori principle holds due to the following property


of the support measure:

X ,Y : ( X

Y)

s( X ) s(Y )

Support of an itemset never exceeds the support of its


subsets
This is known as the anti-monotone property of support
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

12

Illustrating Apriori Principle


null

Found to be
Infrequent

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

ABCE

Pruned
supersets
Tan,Steinbach, Kumar

Introduction to Data Mining

ABDE

ACDE

BCDE

ABCDE

4/18/2004

13

Illustrating Apriori Principle


Item
Bread
Coke
Milk
Beer
Diaper
Eggs

Count
4
2
4
3
4
1

Items (1-itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}

Minimum Support = 3

Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)

Triplets (3-itemsets)

If every subset is considered,


6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13

Tan,Steinbach, Kumar

Count
3
2
3
2
3
3

Introduction to Data Mining

Itemset
{Bread,Milk,Diaper}

Count
3

4/18/2004

14

Apriori Algorithm

Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those
that are frequent

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

15

Reducing Number of Comparisons

Candidate counting:
Scan the database of transactions to determine the
support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets

Transactions

TID
1
2
3
4
5

Hash Structure

Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke

Buckets
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

16

Generate Hash Tree


Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3
5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
Hash function

Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)

Hash function
3,6,9
1,4,7

234
567
345
136

145

2,5,8
124
457
Tan,Steinbach, Kumar

125
458

Introduction to Data Mining

159

356
357
689

4/18/2004

367
368

17

Association Rule Discovery: Hash tree


Hash Function

1,4,7

Candidate Hash Tree

3,6,9

2,5,8

234
567
145

136
345

Hash on
1, 4 or 7

Tan,Steinbach, Kumar

124

125

457

458

159

Introduction to Data Mining

356

367

357

368

689

4/18/2004

18

Association Rule Discovery: Hash tree


Hash Function

1,4,7

Candidate Hash Tree

3,6,9

2,5,8

234
567
145

136
345

Hash on
2, 5 or 8

Tan,Steinbach, Kumar

124

125

457

458

159

Introduction to Data Mining

356

367

357

368

689

4/18/2004

19

Association Rule Discovery: Hash tree


Hash Function

1,4,7

Candidate Hash Tree

3,6,9

2,5,8

234
567
145

136
345

Hash on
3, 6 or 9

Tan,Steinbach, Kumar

124

125

457

458

159

Introduction to Data Mining

356

367

357

368

689

4/18/2004

20

Subset Operation
Given a transaction t, what
are the possible subsets of
size 3?

Transaction, t
1 2 3 5 6

Level 1
1 2 3 5 6

2 3 5 6

3 5 6

Level 2
12 3 5 6

13 5 6

123
125
126

135
136

Level 3
Tan,Steinbach, Kumar

15 6

156

23 5 6

235
236

25 6

256

35 6

356

Subsets of 3 items
Introduction to Data Mining

4/18/2004

21

Subset Operation Using Hash Tree


Hash Function

1 2 3 5 6 transaction
1+ 2356

2+ 356

1,4,7

3+ 56

3,6,9

2,5,8

234
567
145

136
345

124
457

125
458

Tan,Steinbach, Kumar

159

356
357
689

Introduction to Data Mining

367
368

4/18/2004

22

Subset Operation Using Hash Tree


Hash Function

1 2 3 5 6 transaction
1+ 2356

2+ 356

12+ 356

1,4,7

3+ 56

3,6,9

2,5,8

13+ 56
234
567

15+ 6
145

136

345
124
457
Tan,Steinbach, Kumar

125
458

159

Introduction to Data Mining

356
357
689

367
368

4/18/2004

23

Subset Operation Using Hash Tree


Hash Function

1 2 3 5 6 transaction
1+ 2356

2+ 356

12+ 356

1,4,7

3+ 56

3,6,9

2,5,8

13+ 56
234
567

15+ 6
145

136

345
124
457
Tan,Steinbach, Kumar

125
458

159

356
357
689

367
368

Match transaction against 11 out of 15 candidates


Introduction to Data Mining

4/18/2004

24

Factors Affecting Complexity

Choice of minimum support threshold

Dimensionality (number of items) of the data set

more space is needed to store support count of each item


if number of frequent items also increases, both computation and
I/O costs may also increase

Size of database

lowering support threshold results in more frequent itemsets


this may increase number of candidates and max length of
frequent itemsets

since Apriori makes multiple passes, run time of algorithm may


increase with number of transactions

Average transaction width


transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

25

Compact Representation of Frequent Itemsets

Some itemsets are redundant because they have identical


support as their supersets

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10


1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
4
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
5
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
9
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
10
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1

Number of frequent itemsets


Need a compact representation

Tan,Steinbach, Kumar

Introduction to Data Mining

10
k 1

10
k
4/18/2004

26

Maximal Frequent Itemset


An itemset is maximal frequent if none of its immediate supersets
is frequent
null
Maximal
Itemsets

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

ABCD

Infrequent
Itemsets
Tan,Steinbach, Kumar

ABCE

ABDE

ABCD
E

Introduction to Data Mining

ACDE

BCDE

Border
4/18/2004

27

Closed Itemset

An itemset is closed if none of its immediate supersets has


the same support as the itemset

TID
1
2
3
4
5

Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}

Tan,Steinbach, Kumar

Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}

Introduction to Data Mining

Support
4
5
3
4
4
2
3
3
4
3

Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2

4/18/2004

28

Maximal vs Closed Itemsets


TID

Items

ABC

ABCD

BCE

ACDE

DE

124

123

12

124

AB

12

24

AC

ABE

245

123

4
AE

24
ABD

1234

AD

ABC

ACD

BD

345
D

BC

4
ACE

BE

ADE

BCD

24

CD

BCE

34

CE

45

BDE

4
ABCD

ABCE

Not supported by
any transactions
Tan,Steinbach, Kumar

Transaction Ids

null

ABDE

ACDE

BCDE

ABCDE

Introduction to Data Mining

4/18/2004

29

DE

CDE

Maximal vs Closed Frequent Itemsets


Minimum support = 2
124

123

12

124

AB

12
ABC

24

AC

AD

ABD

ABE

1234

AE

345
D

2
BC

BD

4
ACD

245

123

24

Closed but
not maximal

null

4
ACE

BE

ADE

BCD

24

CD

BCE

Closed and
maximal
34

CE

BDE

45

DE

CDE

4
ABCD

ABCE

ABDE

ACDE

BCDE

# Closed = 9
# Maximal = 4

ABCDE

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

30

Maximal vs Closed Itemsets

Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

31

Alternative Methods for Frequent Itemset Generation

Traversal of Itemset Lattice


General-to-specific vs Specific-to-general
Frequent
itemset
border

null

null

..
..

..
..

{a1,a2,...,an}

{a1,a2,...,an}

(a) General-to-specific
Tan,Steinbach, Kumar

Frequent
itemset
border

..
..
Frequent
itemset
border

(b) Specific-to-general

Introduction to Data Mining

null

{a1,a2,...,an}

(c) Bidirectional
4/18/2004

32

Alternative Methods for Frequent Itemset Generation

Traversal of Itemset Lattice


Equivalent Classes
null

AB

ABC

AC

AD

ABD

ACD

null

BC

BD

CD

AB

BCD

AC

ABC

BC

AD

ABD

BD

CD

ACD

BCD

ABCD

ABCD

(a) Prefix tree


Tan,Steinbach, Kumar

Introduction to Data Mining

(b) Suffix tree


4/18/2004

33

Alternative Methods for Frequent Itemset Generation

Traversal of Itemset Lattice


Breadth-first vs Depth-first

(a) Breadth first

Tan,Steinbach, Kumar

(b) Depth first

Introduction to Data Mining

4/18/2004

34

Alternative Methods for Frequent Itemset Generation

Representation of Database
horizontal vs vertical data layout
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10

Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B

Tan,Steinbach, Kumar

Vertical Data Layout


A
1
4
5
6
7
8
9

Introduction to Data Mining

B
1
2
5
7
8
10

C
2
3
4
8
9

D
2
4
5
9

E
1
3
6

4/18/2004

35

FP-growth Algorithm

Use a compressed representation of the database


using an FP-tree

Once an FP-tree has been constructed, it uses a


recursive divide-and-conquer approach to mine the
frequent itemsets

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

36

FP-tree construction
null

After reading TID=1:

TID
1
2
3
4
5
6
7
8
9
10

Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}

A:1
B:1
After reading TID=2:

A:1

null
B:1

B:1

C:1
D:1

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

37

FP-Tree Construction
TID
1
2
3
4
5
6
7
8
9
10

Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}

Header table
Item
Pointer
A
B
C
D
E
Tan,Steinbach, Kumar

Transaction
Database

null
B:3

A:7
B:5

C:1

C:3

D:1

D:1

C:3
D:1
D:1

D:1

E:1

E:1

E:1
Pointers are used to assist
frequent itemset generation

Introduction to Data Mining

4/18/2004

38

FP-growth

null
A:7
B:5

B:1

C:1

C:3
D:1

D:1

D:1

C:1
D:1

Recursively apply FPgrowth on P


Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD

D:1

Tan,Steinbach, Kumar

Conditional Pattern base for


D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}

Introduction to Data Mining

4/18/2004

39

Tree Projection
null

Set enumeration tree:


A

Possible Extension:
E(A) = {B,C,D,E}

AB

AC

AD

AE

BC

BD

BE

CD

CE

DE

ABC

ABD

ABE

ACD

ACE

ADE

BCD

BCE

BDE

CDE

Possible Extension:
E(ABC) = {D,E}
ABCD

ABCE

ABDE

ACDE

BCDE

ABCDE

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

40

Tree Projection
Items are listed in lexicographic order
Each node P stores the following information:

Itemset for node P


List of possible lexicographic extensions of P: E(P)
Pointer to projected database of its ancestor node
Bitvector containing information about which
transactions in the projected database contain the
itemset

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

41

Projected Database
Original Database:

TID
1
2
3
4
5
6
7
8
9
10

Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}

Projected Database
for node A:

TID
1
2
3
4
5
6
7
8
9
10

Items
{B}
{}
{C,D,E}
{D,E}
{B,C}
{B,C,D}
{}
{B,C}
{B,D}
{}

For each transaction T, projected transaction at node A is T


Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

E(A)
42

ECLAT

For each item, store a list of transaction ids (tids)


Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10

Tan,Steinbach, Kumar

Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B

Vertical Data Layout


A
1
4
5
6
7
8
9

B
1
2
5
7
8
10

C
2
3
4
8
9

D
2
4
5
9

E
1
3
6

TID-list

Introduction to Data Mining

4/18/2004

43

ECLAT

Determine support of any k-itemset by intersecting tid-lists


of two of its (k-1) subsets.
B
1
2
5
7
8
10

A
1
4
5
6
7
8
9

AB
1
5
7
8

3 traversal approaches:
top-down, bottom-up and hybrid

Advantage: very fast support counting


Disadvantage: intermediate tid-lists may become too
large for memory

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

44

Rule Generation

Given a frequent itemset L, find all non-empty


subsets f L such that f
L f satisfies the
minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,

ABD C,
B ACD,
AC
BD,
CD AB,

ACD B,
C ABD,
AD
BC,

BCD A,
D ABC
BC AD,

If |L| = k, then there are 2k 2 candidate


association rules (ignoring L
and

Tan,Steinbach, Kumar

Introduction to Data Mining

L)

4/18/2004

45

Rule Generation

How to efficiently generate rules from frequent


itemsets?
In general, confidence does not have an anti-monotone
property
c(ABC

D) can be larger or smaller than c(AB

D)

But confidence of rules generated from the same


itemset has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC

D)

c(AB

CD)

c(A

BCD)

Confidence is anti-monotone w.r.t. number of items on the RHS


of the rule

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

46

Rule Generation for Apriori Algorithm


Lattice of rules
Low
Confidence
Rule

CD=>AB
CD=>AB

Pruned
Rules
Tan,Steinbach, Kumar

ABCD=>{ }
ABCD=>{ }

BCD=>A
BCD=>A

ACD=>B
ACD=>B

BD=>AC
BD=>AC

D=>ABC
D=>ABC

BC=>AD
BC=>AD

C=>ABD
C=>ABD

Introduction to Data Mining

ABD=>C
ABD=>C

AD=>BC
AD=>BC

B=>ACD
B=>ACD

ABC=>D
ABC=>D

AC=>BD
AC=>BD

AB=>CD
AB=>CD

A=>BCD
A=>BCD

4/18/2004

47

Rule Generation for Apriori Algorithm

Candidate rule is generated by merging two rules


that share the same prefix
in the rule consequent
CD=>AB

BD=>AC

join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence

Tan,Steinbach, Kumar

Introduction to Data Mining

D=>ABC

4/18/2004

48

Effect of Support Distribution

Many real data sets have skewed support


distribution

Support
distribution of
a retail data set

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

49

Effect of Support Distribution

How to set the appropriate minsup threshold?


If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)

If minsup is set too low, it is computationally expensive


and the number of itemsets is very large

Using a single minimum support threshold may not


be effective

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

50

Multiple Minimum Support

How to apply multiple minimum supports?


MS(i): minimum support for item i
e.g.: MS(Milk)=5%,
MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone

Suppose:

Support(Milk, Coke) = 1.5% and


Support(Milk, Coke, Broccoli) = 0.5%

{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

51

Multiple Minimum Support


Item

MS(I)

Sup(I)

0.10% 0.25%

0.20% 0.26%

0.30% 0.29%

0.50% 0.05%

Tan,Steinbach, Kumar

3%

4.20%

AB

ABC

AC

ABD

AD

ABE

AE

ACD

BC

ACE

BD

ADE

BE

BCD

CD

BCE

CE

BDE

DE

CDE

B
C

Introduction to Data Mining

4/18/2004

52

Multiple Minimum Support


Item

MS(I)

Sup(I)

AB

ABC

AC

ABD

AD

ABE

AE

ACD

BC

ACE

BD

ADE

BE

BCD

CD

BCE

CE

BDE

DE

CDE

A
A

0.10% 0.25%

0.20% 0.26%

B
C

0.30% 0.29%
D

0.50% 0.05%
E

Tan,Steinbach, Kumar

3%

4.20%

Introduction to Data Mining

4/18/2004

53

Multiple Minimum Support (Liu 1999)

Order the items according to their minimum


support (in ascending order)
e.g.:

MS(Milk)=5%,
MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk

Need to modify Apriori such that:


L1 : set of frequent items
F1 : set of items whose support is MS(1)
where MS(1) is mini( MS(i) )
C2 : candidate itemsets of size 2 is generated from F1
instead of L1

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

54

Multiple Minimum Support (Liu 1999)

Modifications to Apriori:
In traditional Apriori,
A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
The candidate is pruned if it contains any infrequent subsets
of size k

Pruning step has to be modified:


Prune only if subset contains the first item
e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
{Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent

Candidate is not pruned because {Coke,Milk} does not contain


the first item, i.e., Broccoli.

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

55

Pattern Evaluation

Association rule algorithms tend to produce too


many rules
many of them are uninteresting or redundant
Redundant if {A,B,C}
{D} and {A,B}
{D}
have same support & confidence

Interestingness measures can be used to


prune/rank the derived patterns

In the original formulation of association rules,


support & confidence are the only measures used

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

56

Application of Interestingness Measure


Interestingness
Measures

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

57

Computing Interestingness Measure

Given a rule X
Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X

f11

f10

f1+

f01

f00

fo+

f+1

f+0

|T|

f11: support of X and Y


f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
Used to define various measures

Tan,Steinbach, Kumar

support, confidence, lift, Gini,


J-measure, etc.

Introduction to Data Mining

4/18/2004

58

Drawback of Confidence
Coffee

Coffee

Tea

15

20

Tea

75

80

90

10

100

Association Rule: Tea

Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
Although confidence is high, rule is misleading
P(Coffee|Tea) = 0.9375
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

59

Statistical Independence

Population of 1000 students


600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(S B) = 420/1000 = 0.42
P(S) P(B) = 0.6 0.7 = 0.42
P(S B) = P(S)
P(S B) > P(S)
P(S B) < P(S)

Tan,Steinbach, Kumar

P(B) => Statistical independence


P(B) => Positively correlated
P(B) => Negatively correlated

Introduction to Data Mining

4/18/2004

60

Statistical-based Measures

Measures that take into account statistical


dependence

P(Y | X )
Lift
P(Y )
P( X , Y )
Interest
P( X ) P(Y )
PS P( X , Y ) P( X ) P(Y )
coefficient
Tan,Steinbach, Kumar

P( X , Y ) P( X ) P(Y )
P( X )[1 P( X )]P(Y )[1 P(Y )]
Introduction to Data Mining

4/18/2004

61

Example: Lift / Interest


Coffee

Coffee

Tea

15

20

Tea

75

80

90

10

100

Association Rule: Tea

Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

62

Drawback of Lift & Interest

10

10

90

90

10

90

100

Lift

0.1
10
(0.1)(0.1)

90

90

10

10

90

10

100

Lift

0 .9
1.11
(0.9)(0.9)

Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

63

There are lots of


measures proposed
in the literature
Some measures are
good for certain
applications, but not
for others
What criteria should
we use to determine
whether a measure
is good or bad?
What about Aprioristyle support based
pruning? How does
it affect these
measures?

Properties of A Good Measure

Piatetsky-Shapiro:
3 properties a good measure M must satisfy:
M(A,B) = 0 if A and B are statistically independent
M(A,B) increase monotonically with P(A,B) when P(A)
and P(B) remain unchanged
M(A,B) decreases monotonically with P(A) [or P(B)]
when P(A,B) and P(B) [or P(A)] remain unchanged

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

65

Comparing Different Measures


10 examples of
contingency tables:

Example

f11

E1
E2
E3
E4
E5
E6
E7
E8
E9
E10

8123
8330
9481
3954
2886
1500
4000
4000
1720
61

Rankings of contingency tables


using various measures:

Tan,Steinbach, Kumar

Introduction to Data Mining

f10

f01

f00

83
424 1370
2
622 1046
94
127 298
3080
5
2961
1363 1320 4431
2000 500 6000
2000 1000 3000
2000 2000 2000
7121
5
1154
2483
4
7452

4/18/2004

66

Property under Variable Permutation


B
p
r

A
A

B
q
s

B
B

A
p
q

A
r
s

Does M(A,B) = M(B,A)?


Symmetric measures:

support, lift, collective strength, cosine, Jaccard, etc

Asymmetric measures:

confidence, conviction, Laplace, J-measure, etc

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

67

Property under Row/Column Scaling


Grade-Gender Example (Mosteller, 1968):
Male

Female

High

Low

10

Male

Female

High

30

34

Low

40

42

70

76

2x

10x

Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

68

Property under Inversion Operation

.
.
.
.
.

Transaction 1

Transaction N

1
0
0
0
0
0
0
0
0
1

0
0
0
0
1
0
0
0
0
0

0
1
1
1
1
1
1
1
1
0

1
1
1
1
0
1
1
1
1
1

0
1
1
1
1
1
1
1
1
0

0
0
0
0
1
0
0
0
0
0

(a)
Tan,Steinbach, Kumar

(b)
Introduction to Data Mining

(c)
4/18/2004

69

Example: -Coefficient

-coefficient is analogous to correlation coefficient


for continuous variables
Y

60

10

70

10

20

30

70

30

100

0.6 0.7 0.7


0.7 0.3 0.7 0.3
0.5238

20

10

30

10

60

70

30

70

100

0.2 0.3 0.3


0.7 0.3 0.7 0.3
0.5238

Coefficient is the same for both tables


Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

70

Property under Null Addition

A
A

B
p
r

B
q
s

A
A

B
p
r

B
q
s+k

Invariant measures:

support, cosine, Jaccard, etc

Non-invariant measures:

correlation, Gini, mutual information, odds ratio, etc

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

71

Different Measures have Different Properties


Sym bol

Q
Y
M
J
G
s
c
L
V
I
IS
PS
F
AV
S

Measure

Range

P1

P2

P3

O1

O2

O3

O3'

O4

Correlation
Lambda
Odds ratio
Yule's Q
Yule's Y
Cohen's
Mutual Information
J-Measure
Gini Index
Support
Confidence
Laplace
Conviction
Interest
IS (cosine)
Piatetsky-Shapiro's
Certainty factor
Added value
Collective strength
Jaccard

-1 0 1
01
01
-1 0 1
-1 0 1
-1 0 1
01
01
01
01
01
01
0.5 1
01
0 .. 1
-0.25 0 0.25
-1 0 1
0.5 1 1
01
0 .. 1

Yes
Yes
Yes*
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
No
Yes*
No
Yes
Yes
Yes
No
No

Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes

Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes

Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
Yes
Yes
Yes
Yes**
Yes
Yes
Yes
No
No
Yes
Yes

No
No
Yes
Yes
Yes
No
No
No
No
No
No
No
No
No
No
No
No
No
No
No

Yes
No*
Yes*
Yes
Yes
No
No*
No
No*
No
No
No
No
No
No
Yes
No
No
Yes*
No

Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
No
Yes
No
No
Yes
Yes
No
Yes
No

No
No
No
No
No
No
No
No
No
No
Yes
No
No
No
Yes
No
No
No
No
Yes

2
1
2
1 2
3
0
Yes
Introduction
3
3 to Data
3 3 Mining

Yes

Yes

No

No

No

No

Klosgen's
K
Tan,Steinbach, Kumar

4/18/2004

72

No

Support-based Pruning

Most of the association rule mining algorithms use


support measure to prune rules and itemsets

Study effect of support pruning on correlation of


itemsets
Generate 10000 random contingency tables
Compute support and pairwise correlation for each
table
Apply support-based pruning and examine the tables
that are removed

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

73

All Itempairs
1000
900
800
700
600
500

Effect of Support-based Pruning

400
300
200
100

0.

0.

0.

0.

0.

0.

0.

0.

1
0.

-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1

Correlation
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

74

Effect of Support-based Pruning


Support < 0.01

0.

0.

0.

-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1

Correlation

0
0.

50

0.

50

100

0.

100

150

0.

150

0.

200

200

250

250

0
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9

300

-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1

300

0.

Support < 0.03

Correlation

Support < 0.05


300
250
200
150
100
50

Correlation

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

75

0
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9

0
-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1

Support-based pruning
eliminates mostly
negatively correlated
itemsets

Effect of Support-based Pruning

Investigate how support-based pruning affects


other measures

Steps:
Generate 10000 contingency tables
Rank each table according to the different measures
Compute the pair-wise correlation between the
measures

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

76

Effect of Support-based Pruning


Without Support Pruning (All Pairs)

All Pairs (40.14%)


1

Conviction
Odds ratio

0.9

Col Strength
Correlation

0.8

Interest

0.7

PS
CF

0.6

Jaccard

Yule Y
Reliability
Kappa

0.5
0.4

Klosgen
Yule Q

0.3

Confidence
Laplace

0.2

IS

0.1

Support
Jaccard

0
-1

Lambda
Gini

-0.8

-0.6

-0.4

-0.2

0
0.2
Correlation

0.4

0.6

0.8

J-measure
Mutual Info
1

10

11

12

13

14

15

16

17

18

19

20

21

Red cells indicate correlation between


the pair of measures > 0.85

40.14% pairs have correlation > 0.85

Tan,Steinbach, Kumar

Introduction to Data Mining

Scatter Plot between Correlation


& Jaccard Measure

4/18/2004

77

Effect of Support-based Pruning


0.5%

support

50%
1

0.005 <= support <= 0.500 (61.45%)


Interest

0.9

Conviction
Odds ratio

0.8

Col Strength

0.7

Laplace
Confidence

0.6

Jaccard

Correlation
Klosgen
Reliability

0.5
0.4

PS
Yule Q

0.3

CF
Yule Y

0.2

Kappa

0.1

IS
Jaccard

0
-1

Support
Lambda

-0.8

-0.6

-0.4

-0.2

0
0.2
Correlation

0.4

0.6

0.8

Gini
J-measure
Mutual Info
1

10

11

12

13

14

15

16

17

18

19

20

21

Scatter Plot between Correlation


& Jaccard Measure:

61.45% pairs have correlation > 0.85

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

78

Effect of Support-based Pruning


0.5%

support

30%

0.005 <= support <= 0.300 (76.42%)


Support

Interest
Reliability

0.9

Conviction

0.8

Yule Q
Odds ratio

0.7

Confidence
CF

Jaccard

0.6

Yule Y
Kappa
Correlation

0.5
0.4

Col Strength
IS

0.3

Jaccard

0.2

Laplace
PS

0.1

Klosgen
Lambda

0
-0.4

Mutual Info
Gini

-0.2

0.2
0.4
Correlation

0.6

0.8

J-measure
1

10

11

12

13

14

15

16

17

18

19

20

21

Scatter Plot between Correlation


& Jaccard Measure

76.42% pairs have correlation > 0.85

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

79

Subjective Interestingness Measure

Objective measure:
Rank patterns based on statistics computed from data
e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).

Subjective measure:
Rank patterns according to users interpretation

A pattern is subjectively interesting if it contradicts the


expectation of a user (Silberschatz & Tuzhilin)

A pattern is subjectively interesting if it is actionable


(Silberschatz & Tuzhilin)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

80

Interestingness via Unexpectedness

Need to model expectation of users (domain knowledge)

+
-

Pattern expected to be frequent


Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent

+ - +

Expected Patterns

Unexpected Patterns

Need to combine expectation of users with evidence from


data (i.e., extracted patterns)

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

81

Interestingness via Unexpectedness

Web Data (Cooley et al 2001)


Domain knowledge in the form of site structure
Given an itemset F = {X1, X2, , Xk} (Xi : Web pages)

L: number of links connecting the pages

lfactor = L / (k

cfactor = 1 (if graph is connected), 0 (disconnected graph)

k-1)

Structure evidence = cfactor

Usage evidence

lfactor

P( X X ... X )
P( X X ... X )
1

Use Dempster-Shafer theory to combine domain


knowledge and evidence from data
Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

82

THANK YOU

Tan,Steinbach, Kumar

Introduction to Data Mining

4/18/2004

83

Anda mungkin juga menyukai