Anda di halaman 1dari 32

ƠAssociation Rulesơ

Market Baskets
Frequent Itemsets
A-priori Algorithm

c
Ohe Market-Basket Model
îA large set of items, e.g., things sold in
a supermarket.
îA large set of baskets, each of which is
a small set of the items, e.g., the things
one customer buys on one day.

!
¦upport
î¦implest question: find sets of items
that appear Ơfrequentlyơ in the baskets.
î¦upport for itemset I = the number of
baskets containing all items in I.
îGiven a support threshold s, sets of
items that appear in > s baskets are
called frequent itemsets.


º ample
îItems={milk, coke, pepsi, beer, juice}.
î¦upport = 3 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
îFrequent itemsets: {m}, {c}, {b}, {j},
{m, b}, {c, b}, {j, c}.

£
Applications --- (1)
îReal market baskets: chain stores keep
terabytes of information about what
customers buy together.
Oells how typical customers navigate
stores, lets them position tempting items.
¦uggests tie-in Ơtricks,ơ e.g., run sale on
diapers and raise the price of beer.
îHigh support needed, or no $$ƞs .
-
Applications --- (2)
îƠBasketsơ = documents; Ơitemsơ =
words in those documents.
Lets us find words that appear together
unusually frequently, i.e., linked concepts.
îƠBasketsơ = sentences, Ơitemsơ =
documents containing those sentences.
Items that appear together too often could
represent plagiarism.
D
Applications --- (3)
îƠBasketsơ = Web pages; Ơitemsơ =
linked pages.
Pairs of pages with many common
references may be about the same topic.
îƠBasketsơ = Web pages p ; Ơitemsơ =
pages that link to p .
Pages with many of the same links may be
mirrors or about the same topic.
[
Important Point
îƠMarket Basketsơ is an abstraction that
models any many-many relationship
between two concepts: Ơitemsơ and
Ơbaskets.ơ
Items need not be Ơcontainedơ in baskets.
îOhe only difference is that we count co-
occurrences of items related to a
basket, not vice-versa.
r
¦cale of Problem
îWalMart sells 100,000 items and can
store billions of baskets.
îOhe Web has over 100,000,000 words
and billions of pages.

o
Association Rules
îIf-then rules about the contents of
baskets.
î{i1, i2,Ʀ,ik} Ë ÿ means: Ơif a basket
contains all of i1,Ʀ,ik then it is likely to
contain ÿ.
îâ ee  ths ass at  rule s
the pr bablty  ÿ gve 1,Ʀ,.

c
º ample
' B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} ' B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
îAn association rule: {m, b} Ë c.
âonfidence = 2/4 = 50%.

cc
Interest
îOhe eres of a assocao rule 
he abolue value of he amou by
whch he cofdece dffer from wha
you would e pec, were em eleced
depedely of oe aoher.

c!
º ample
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
îFor association rule {m, b} Ë c, item c
appears in 5/8 of the baskets.
îInterest = | 2/4 - 5/8 | = 1/8 --- not
very interesting.

Relationships Among Measures
îRules with high support and confidence
may be useful even if they are not
Ơinteresting.ơ
We donƞt care if buying bread causes
people to buy milk, or whether simply a lot
of people buy both bread and milk.
îBut high interest suggests a cause that
might be worth investigating.

Finding Association Rules
îA typical question: Ơfind all association
rules with support l  and confidence l c.ơ
ƒote: Ơsupportơ of an association rule is the
support of the set of items it mentions.
îHard part: finding the high-support
(frequent ) itemsets.
âhecking the confidence of association rules
involving those sets is relatively easy.
c-
âomputation Model
îOypically, data is kept in a Ơflat fileơ
rather than a database system.
¦tored on disk.
¦tored basket-by-basket.
º pand baskets into pairs, triples, etc. as
you read baskets.

cD
âomputation Model --- (2)
îOhe true cost of mining disk-resident
data is usually the number of disk I/Oƞs.
îIn practice, association-rule algorithms
read the data in passes --- all baskets
read in turn.
îOhus, we measure the cost by the
number of passes an algorithm takes.

c[
Main-Memory Bottleneck
îIn many algorithms to find frequent
itemsets we need to worry about how
main memory is used.
As we read baskets, we need to count
something, e.g., occurrences of pairs.
Ohe number of different things we can
count is limited by main memory.
¦wapping counts in/out is a disaster.
cr
Finding Frequent Pairs
îOhe hardest problem often turns out to
be finding the frequent pairs.
îWeƞll concentrate on how to do that,
then discuss e tensions to finding
frequent triples, etc.

co
ƒaïve Algorithm
îA simple way to find frequent pairs is:
Read file once, counting in main memory
the occurrences of each pair.
ƥ º pand each basket of n items into its
n (n -1)/2 pairs.
îFails if #items-squared e ceeds main
memory.

!
öetails of Main-Memory âounting
î Ohere are two basic approaches:
1. âount all item pairs, using a triangular
matri .
2. Keep a table of triples [i, ÿ, c] = the count
of the pair of items {i,ÿ } is c.
î (1) requires only (say) 4 bytes/pair;
(2) requires 12 bytes, but only for
those pairs with >0 counts.
!c
12 per
4 per pair
occurring pair

Method (1) Method (2)

!!
öetails of Approach (1)
îƒumber items 1,2,Ʀ
îKeep pairs in the order {1,2}, {1,3},Ʀ,
{1,n }, {2,3}, {2,4},Ʀ,{2,n }, {3,4},Ʀ,
{3,n },Ʀ{n -1,n }.
îFind pair {i, ÿ } at the position
(i ƛ1)(n ƛi /2) + ÿ ƛ i.
îOotal number of pairs n (n ƛ1)/2; total
bytes about 2n 2.

öetails of Approach (2)
îou need a hash table, with i and ÿ as the
key, to locate (i, ÿ, c) triples efficiently.
Oypically, the cost of the hash structure can be
neglected.
îOotal bytes used is about 12p, where p is
the number of pairs that actually occur.
Beats triangular matri if at most 1/3 of
possible pairs actually occur.

A-Priori Algorithm --- (1)
îA two-pass approach called a-pror
lmts the need for man memory.
îKey dea: monotoncty : f a set of
tems appears at least s tmes, so does
every subset.
âontrapostve for pars: f tem  does not
appear n s baskets, then no par ncludng
 can appear n s baskets.
!-
A-Priori Algorithm --- (2)
îPass 1: Read baskets and count in main
memory the occurrences of each item.
Requires only memory proportional to #items.
îPass 2: Read baskets again and count in
main memory only those pairs both of
which were found in Pass 1 to be frequent.
Requires memory proportional to square of
frequent items only.

!D
Picture of A-Priori

 
 

 

  
 

c

!
![
öetail for A-Priori
îou can use the triangular matri
method with n = number of frequent
items.
¦aves space compared with storing triples.
îOrick: number frequent items 1,2,Ʀ
and keep a table relating new numbers
to original item numbers.

!r
Frequent Oriples, ºtc.
îFor each , we construct two sets of
 ƛtuples:
â = candidate  ƛ tuples = those that
might be frequent sets (support > s )
based on information from the pass for
 ƛ1.
L = the set of truly frequent  ƛtuples.

!o
â1 Filter L1 âonstruct â2 Filter L2 âonstruct â3

First ¦econd
pass pass


A-Priori for All Frequent
Itemsets
îOne pass for each .
îƒeeds room in main memory to count
each candidate  ƛtuple.
îFor typical maret-baset data and
reasonable support (e.g., 1%),  = 2
requires the most memory.

c
Frequent Itemsets --- (2)
îâ1 = all items
îL1 = those counted on first pass to be
frequent.
îâ2 = pairs, both chosen from L1.
îIn general, â =  ƛtuples each  ƛ1 of
which is in L-1.
îL = those candidates with support l .
!