Anda di halaman 1dari 25

Association Rule Mining:

Write shout notes on Association Rule Mining (Or) Discuss about Market Basket Analysis problem to motivate association rule mining Definition: Association rule mining searches for interesting relationships among items in a given data set. Market Basket Analysis: A Motivating Example for Association Rule Mining We need to find out the answer for following question. Which groups or sets of items are customers likely to purchase on a given trip to the store? Market basket analysis may be performed on the retail data of customer transactions at the store. The results may be used to plan marketing or advertising strategies, as well as catalog design. For instance, market basket analysis may help managers design different store layouts. Here the items that are frequently purchased together can be placed in close proximity in order to further encourage the sale of such items together.

Example: If customers who purchases computers also tend to buy financial management software at the same time, then placing the hardware display close to the software display may help to increase the sales of both of these items. Market basket analysis can also help retailer to plan which items to put on sales at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.

Which items are frequently purchased together by the customer? Shopping Baskets

Bread Milk Cereal

Milk

Bread Milk

Bread Butter

Sugar eggs

Sugar Eggs

Rule measures: Support: percentage of transactions in D that contain AUB. Confidence: percentage of transactions in D containing A that also contain B. Support(ab)=P(AUB) Confidence(ab)=P(B/A). Rules that satisfy both a minimum support threshold(min, sup) and a minimum confidence threshold(min, conf)are called strong. Let minimum support 50% and minimum confidence 50%, we have AC(50%,66.6%) CA(50%,100%) Transaction ID 2000 Items bought A,B,C

1000
4000 5000

A,C
A,D B,E,F

Classification of association rule mining: Based on the levels of abstractions involved in the rule set: Single level association rule refer items or attributes at only one level. Buys(x,computer) buys(x.,HP Printer) Multi-level association rule reference items or attributes at different levels of abstraction. Buys(x,laptop computer)buys(x, HP Printer) Association Rule Mining: A Road Map Based on the types of values handled in the rule:
Based on the dimensions of data involved in the rule: Some method for association rule mining can find rules at differing levels of abstraction. Based on various extensions to association mining:

Based on the types of values handled in the rule: If a rule concerns associations between the presence or absence of items, it is Boolean association rule. If rules describe associations between quantitative items or attributes, then it is a quantitative association rule. Buys(x,SQLServer)^buys(x,dbminer) age(X,2029) income(X,20K.29K)=> buys(X,CDplayer)

Based on the dimensions of data involved in the rule: If the items or attributes in an association rule reference only one dimension, then it is a single dimensional association rule. If a rule references two or more dimensions, such as the dimensions buys, time_or_transaction, and customer_category, then it is a multidimensional association rule. buys(X,computer)=>buys(X,financial_management_software) age(x,3039)^income(x,42k.48K)buys(x,pc)

Some method for association rule mining can find rules at differing levels of abstraction. age(X,30.39)=>buys(X,laptop computer) Multilevel association rules: The items bought are referenced at different levels of abstraction. Single association rules: The rules within a given set do not reference items or attributes at different levels of abstraction. Buys(x,computer) buys(x.,HP Printer) Buys(x,laptop computer)buys(x, HP Printer) Based on various extensions to association mining: Association mining can be extended to correlation analysis, where the absence or presence of correlated items can be identified. It can also be extended to mining maxpatterns and frequent closed itemset. Maxpattern: It is a frequent pattern, p, such that any proper superpattern of p is not frequent. Frequent closed itemset: It is a frequent closed itemset where an itemset c is closed if there exists no proper superset of c, c`, such that every transaction containing c also contains c`.

Mining Single Dimensional Boolean Association Rules from Transactional Databases: The Apriori algorithm: Fining frequent itemsets using candidate generation: 1.Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules. 2.Apriori employs an iterative approach known as level wise search, where k-itemsets are used to explore (k+1) itemsets. 3.First the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to fine L3, and so on, until no more frequent k-itemsets can be found. 4.The finding of each Lk requires one full scan of the database. 5.To improve the efficiency of the level wise generation of frequent itemsets, an important property called the Apriori property. 6.The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent itemset properties.

How is the Apriori property used in the algorithm ? The join step: To find Lk, a set of candidate k-items is generated by joining Lk-1 with itself. This set of candidates is denoted Ck. Let l1and l2 be itemsets in Lk-1. The notation li[j] referes to the jth item in li. By Convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order. The join, Lk-1 Lk-1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in common. That is, members l1 and l2 of Lk-1 are joined if (l1[1]=l2[1]) (l1[2]=l2[2]) . (l1[k-2]=l2[k-2])(l1[k-1]<l2[k-1]).

The condition l1[k-1]<l2[k-1] simply ensures that no duplicates are generated.

The Prune step:


Ck is a superset of Lk,that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. A scan of the database to determine the count of each candidate in Ck would result in the determination of Lk.

Algorithm: Apriori: Find frequent itemsets using an iterative level-wise approach based on candidate generation.
Input: Database, D, of transaction; minimum support threshold, min_sup. Output: L, frequent itemset in D.

Method: 1. L1=find_frequent_1-itemsets(D); 2. for(k=2;Lk-1;k++){ 3. Ck=apriori_gen(Lk-1,min_sup); 4. For each transaction tD{ 5. Ct=subset(Ck,t); 6. For each candidate cCt 7. c.count++; 8. } 9. Lk={cCk|c.countmin_sup} 10. } 11. return L=UkLk;

procedure apriori_gen(Lk-1;frequent (k-1) itemsets; min_sup; minimum support threshold) for each itemset l1 Lk-1 for each itemset l2 Lk-1 if(l1[1]=l2[1]) (l1[2]=l2[2]) . (l1[k-2]=l2[k-2]) (l1[k-1]<l2[k-1]) then l2; if has_infrequent_subset(c,Lk-1) then delete c; else add c to Ck; return Ck; c=l1

Procedure has_infrequent_subset(c:candidate k-itemset;Lk-1;frequent(k-1)itemsets); 1)for each (k-1) subset s of c 2) if s Lk-1 then 3) return TRUE; 4)return FALSE;

Apriori example:

Generating Association Rules from Frequent Itemsets


Once the frequent itemsets from transaction in a database D have been found, it is straightforware to generate strong association rules from them. support_count(AUB) Confidence(A=>B)=P(B|A)= support_count(A) Where support_count(AUB) is the number of transactions containing the itemsets AUB, and support_count(A) is the number of transactions containing the itemset A.

Improving the Efficiency of Apriori:


Different variations of Apriori algorithm have been proposed that focus on improving the efficiency of the original algorithm. Hash-based technique (hashing itemset counts): A hashed based technique can be used to reduce the size of the candidate k-itemsets, Ck. Transaction reduction (reducing the number of transactions scanned in future iterations) A transaction that does not contain any frequent kitemsets can not contain any frequent (k+1) itemsets.

Partitioning (partitioning the data to find candidate itemsets):

A partitioning technique can be used that requires just two database scans to mine the frequent itemsets. It consists of two phases. 1) The algorithm subdivides the transactions of D into n nonoverlapping partitions. 2) The second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemset.

Tran sacti ons in D

Divid eD into n partiti ons

Find the frequ ent item sets local to each partit ion

Combin e all local frequen t itemsets to form candida te itemset

Find global freque nt itemset s among candid ates

Freq uent items ets in D

Sampling (mining on a subset of the given data): The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. Dynamic itemset counting (adding candidate itemsets at different points during a scan) A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points. New candidate itemsets can be added at any start point, unlike in Apriori, which determine new candidate itemsets only immediately prior to each complete database scan.

Mining Frequent itemsets without candidate generation:


Problems in with candidate generation: It may need to generate a huge number of candidate sets. It may need to repeatedly scan the database and check a large set of candidates by pattern matching. An interesting method in this attempt is called frequent pattern growth or FP growth, which adopts a divide and conquer strategy as follows. Compress the database representing frequent items into a frequent pattern tree or FP tree, but retain the itemset association information. Divide such a compressed database into a set of conditional databases, each associated with one frequent item and mine each such database separately.

Example:

Set of frequent items is sorted in the order of descending support count. This resulting set or list is denoted L. Thus we have L=[I2:7,I1:6,I3:6,I4:2,I5:2] FP tree is then constructed 1)Create the root of the tree labeled with null 2)The items in each transaction are processed in L order and a branch is created for each transaction. Example: T100: I1,I2,I5 Order: I2,I1,I5 branches:<(I2:1),(I1:1),(I5:1)> i.I2 is linked as a child of the root ii.I1 is linked to I2 iii.I5 is linked to I1 ii. T200: I2,I4 Order: I2,I4 branches: <(I2:2),(I4:1)> I2 is linked to the root I4 is linked to I2.

Support count Item ID 12 11 Node Link 7 6

13
14

6
2

15

Conditional pattern base Mining of the FP tree proceeds as follows:


Each I pattern construct its conditional pattern base FP tree and perform mining recursively on such a tree.

Item Conditional base


I5

pattern Conditional tree


<i2:2,i1:2>

FP Frequent generated

patterns

{(i2 i1:1),(i2 i1 i3:1)}

i2 i5:2, i1 i5:2, i2 i1 i5:2

I4

{(i2 i1:1), (i2:1)}

<i2:2>

i2 i4:2

I3

{(i2 i1:2),(i2:2),(i1:2)}

<i2:4,i1:2>,<i1:2> i2 i3:4,i1,i3:2,i2 i1 i3:2

I1

{(i2:4)}

<i2:4>

i2 i1:4

Anda mungkin juga menyukai