Anda di halaman 1dari 32

# ƠAssociation Rulesơ

Frequent Itemsets
A-priori Algorithm

c
îA large set of items, e.g., things sold in
a supermarket.
îA large set of baskets, each of which is
a small set of the items, e.g., the things
one customer buys on one day.

!
¦upport
î¦implest question: find sets of items
that appear Ơfrequentlyơ in the baskets.
î¦upport for itemset I = the number of
baskets containing all items in I.
îGiven a support threshold s, sets of
items that appear in > s baskets are
called frequent itemsets.


º ample
îItems={milk, coke, pepsi, beer, juice}.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
îFrequent itemsets: {m}, {c}, {b}, {j},
{m, b}, {c, b}, {j, c}.

£
Applications --- (1)
îReal market baskets: chain stores keep
Oells how typical customers navigate
stores, lets them position tempting items.
¦uggests tie-in Ơtricks,ơ e.g., run sale on
diapers and raise the price of beer.
îHigh support needed, or no \$\$ƞs .
-
Applications --- (2)
words in those documents.
Lets us find words that appear together
documents containing those sentences.
Items that appear together too often could
represent plagiarism.
D
Applications --- (3)
îƠBasketsơ = Web pages; Ơitemsơ =
Pairs of pages with many common
references may be about the same topic.
îƠBasketsơ = Web pages p ; Ơitemsơ =
pages that link to p .
Pages with many of the same links may be
mirrors or about the same topic.
[
Important Point
îƠMarket Basketsơ is an abstraction that
models any many-many relationship
between two concepts: Ơitemsơ and
Items need not be Ơcontainedơ in baskets.
îOhe only difference is that we count co-
occurrences of items related to a
r
¦cale of Problem
îWalMart sells 100,000 items and can
îOhe Web has over 100,000,000 words
and billions of pages.

o
Association Rules
îIf-then rules about the contents of
î{i1, i2,Ʀ,ik} Ë ÿ means: Ơif a basket
contains all of i1,Ʀ,ik then it is likely to
contain ÿ.
îâ ee  ths ass at  rule s
the pr bablty  ÿ gve 1,Ʀ,.

c
º ample
' B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} ' B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
îAn association rule: {m, b} Ë c.
âonfidence = 2/4 = 50%.

cc
Interest
îOhe eres of a assocao rule 
he abolue value of he amou by
whch he cofdece dffer from wha
you would e pec, were em eleced
depedely of oe aoher.

c!
º ample
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
îFor association rule {m, b} Ë c, item c
appears in 5/8 of the baskets.
îInterest = | 2/4 - 5/8 | = 1/8 --- not
very interesting.

Relationships Among Measures
îRules with high support and confidence
may be useful even if they are not
Ơinteresting.ơ
people to buy milk, or whether simply a lot
îBut high interest suggests a cause that
might be worth investigating.

Finding Association Rules
îA typical question: Ơfind all association
rules with support l  and confidence l c.ơ
ote: Ơsupportơ of an association rule is the
support of the set of items it mentions.
îHard part: finding the high-support
(frequent ) itemsets.
âhecking the confidence of association rules
involving those sets is relatively easy.
c-
âomputation Model
îOypically, data is kept in a Ơflat fileơ
rather than a database system.
¦tored on disk.
º pand baskets into pairs, triples, etc. as

cD
âomputation Model --- (2)
îOhe true cost of mining disk-resident
data is usually the number of disk I/Oƞs.
îIn practice, association-rule algorithms
îOhus, we measure the cost by the
number of passes an algorithm takes.

c[
Main-Memory Bottleneck
îIn many algorithms to find frequent
itemsets we need to worry about how
main memory is used.
something, e.g., occurrences of pairs.
Ohe number of different things we can
count is limited by main memory.
¦wapping counts in/out is a disaster.
cr
Finding Frequent Pairs
îOhe hardest problem often turns out to
be finding the frequent pairs.
îWeƞll concentrate on how to do that,
then discuss e tensions to finding
frequent triples, etc.

co
aïve Algorithm
îA simple way to find frequent pairs is:
Read file once, counting in main memory
the occurrences of each pair.
ƥ º pand each basket of n items into its
n (n -1)/2 pairs.
îFails if #items-squared e ceeds main
memory.

!
öetails of Main-Memory âounting
î Ohere are two basic approaches:
1. âount all item pairs, using a triangular
matri .
2. Keep a table of triples [i, ÿ, c] = the count
of the pair of items {i,ÿ } is c.
î (1) requires only (say) 4 bytes/pair;
(2) requires 12 bytes, but only for
those pairs with >0 counts.
!c
12 per
4 per pair
occurring pair

## Method (1) Method (2)

!!
öetails of Approach (1)
îumber items 1,2,Ʀ
îKeep pairs in the order {1,2}, {1,3},Ʀ,
{1,n }, {2,3}, {2,4},Ʀ,{2,n }, {3,4},Ʀ,
{3,n },Ʀ{n -1,n }.
îFind pair {i, ÿ } at the position
(i ƛ1)(n ƛi /2) + ÿ ƛ i.
îOotal number of pairs n (n ƛ1)/2; total

öetails of Approach (2)
îou need a hash table, with i and ÿ as the
key, to locate (i, ÿ, c) triples efficiently.
Oypically, the cost of the hash structure can be
neglected.
îOotal bytes used is about 12p, where p is
the number of pairs that actually occur.
Beats triangular matri if at most 1/3 of
possible pairs actually occur.

A-Priori Algorithm --- (1)
îA two-pass approach called a-pror
lmts the need for man memory.
îKey dea: monotoncty : f a set of
tems appears at least s tmes, so does
every subset.
âontrapostve for pars: f tem  does not
appear n s baskets, then no par ncludng
 can appear n s baskets.
!-
A-Priori Algorithm --- (2)
memory the occurrences of each item.
Requires only memory proportional to #items.
main memory only those pairs both of
which were found in Pass 1 to be frequent.
Requires memory proportional to square of
frequent items only.

!D
Picture of A-Priori

 
 

 

  
 

c

!
![
öetail for A-Priori
îou can use the triangular matri
method with n = number of frequent
items.
¦aves space compared with storing triples.
îOrick: number frequent items 1,2,Ʀ
and keep a table relating new numbers
to original item numbers.

!r
Frequent Oriples, ºtc.
îFor each , we construct two sets of
 ƛtuples:
â = candidate  ƛ tuples = those that
might be frequent sets (support > s )
based on information from the pass for
 ƛ1.
L = the set of truly frequent  ƛtuples.

!o
â1 Filter L1 âonstruct â2 Filter L2 âonstruct â3

First ¦econd
pass pass


A-Priori for All Frequent
Itemsets
îOne pass for each .
îeeds room in main memory to count
each candidate  ƛtuple.
îFor typical maret-baset data and
reasonable support (e.g., 1%),  = 2
requires the most memory.

c
Frequent Itemsets --- (2)
îâ1 = all items
îL1 = those counted on first pass to be
frequent.
îâ2 = pairs, both chosen from L1.
îIn general, â =  ƛtuples each  ƛ1 of
which is in L-1.
îL = those candidates with support l .
!