6 (4 Files Merged)

Q6: The following contingency table summarizes supermarket transaction data, where
hotdogs refers to the transactions containing hot dogs, hot dogs refers to the
transactions that do not contain hot dogs, hamburgers refers to the transactions
containing hamburgers,and hamburgers refers to the transactions that do not contain
hamburgers.hot dogs hot dogs Srowhamburgers 2,000 500 2,500hamburgers 1,000
1,500 2,500Scol 3,000 2,000 5,000
a. Suppose that the association rule hot dogs) hamburgers is mined. Given a
minimum support threshold of 25% and a minimum confidence threshold of
50%, is this association rule strong?
Answer:- Based on the scenario, the association rule is mined for hotdogs and
hamburgers, so in this case it will discover the relation between variables, as it is a
large database. Here threshold is 25% and a minimum confidence threshold 0f 50%,
then this association rule will be strong for the current scenario, because it will set and
identify the relations between variables and here variables is the number of items
ordered per transaction.
(b) Based on the given data, is the purchase of hot dogs independent of the purchase
of hamburgers? If not, what kind of correlation relationship exists between the two?
Answer:Based on the given data the purchase of hot dogs is not independent of the purchase of
hamburgers, they will be correlated to each other. If a transaction is made on hotdogs,
it can contain hamburgers in the list too, and if a customer is ordering hamburgers
there may be hot dogs in the list, and in the final billing , it will contain both in the list
so, that it a customer is buying only hotdogs, the hamburgers will be
For attribute a2, the Gini index is given by:

2
5/9[1-(2/5) -(3/5) ] + 4/9[1-(2/4) -(2/4) ] = 0.4889.

Therefore a1 produces a better split.
5. Consider the following data set for a binary class problem.
A
T
T
T
T
T
F
F
F
T
T
B
F
T
T
F
T
F
F
F
T
F
Class Label
+
+
+
a) Calculate the information gain when splitting on A and B. Which attribute would the decision tree
induction algorithm choose?
Answer: The contingency tables after splitting on attributes A and B are:
A=T
4
3
A=F
0
3
B =T
3
1
B =F
1
5
The overall entropy before splitting is:

Eorig = 0.4 log 0.4 0.6 log 0.6 = 0.9710
The information gain after splitting on A is:
EA=T = -(4/7)log (4/7) (3/7)log (3/7) = 0.9852.
EA=F = -(3/3)log (3/3) (0/3) log (0/3) =0
= Eorig-(7/10) EA=T (3/10) EA=F = 0.2813.
Similarly, the information gain after splitting on B is given by:
Eorig 4/10EB=T 6/10EB=F = 0.2565
=
Therefore, attribute A will be chosen to split the node.
b) Calculate the gain in the Gini index when splitting on A and B. Which attribute would the decision tree
induction algorithm choose?
Answer: The overall Gini index before splitting is: Gorig = 1 0.4 0.6 = 0.48
The gain in the Gini index after splitting on A is:
2
GA=T=1 (4/7) (3/7) = 0.4898

2
2
GA=F = 1 (3/3) (0/3) = 0
Hence the corresponding gain is equal to Gorig (7/10)GA=T (3/10)GA=F = 0.1371.
2
Similarly, we can compute the gain after splitting on B, which will be given by:
Gorig (4/10)GB=T -(6/10) GB=F = 0.1633.
Therefore, attribute B will be chosen to split the node.
Problem 3 (Classification and Clustering) [30%]

3.1 [15%] Consider the database of a car insurance company shown below:
Name
Ben
Paul
Bill
James
John
Steven
AgeGroup
30-40
20-30
40-50
30-40
20-30
30-40
CarType
Family
Sports
Sports
Family
Family
Sports
CrashRisk
Low
High
High
Low
High
High
Assume that CrashRisk is the class attribute. Explain which of the remaining attributes
are appropriate for classification. Show the complete decision tree that is produced on
this dataset. Grow the tree from this root node, until the leaf nodes are pure, i.e.
contain only records from the same class. Explain what will be the class label for nodes
with no training samples. Show the split test used at each node. For each leaf node,
show the class and the records associated with it. Explain how you derived the split node
using the information gain and entropy concepts.
Using the produced classifier, determine the class label of the following records {Pete,
20-30, Sports} and {Bob, 40-50, Family}.
We will use attributes AgeGroup and CarType for classification.
I(2, 4) = 0.92
E(age) = 2/6I(2, 0) + 3/6I(2, 1) + 1/6I(0, 1) = 0.46
E(cartype) = 3/6I(2, 1) + 3/6I(3, 0) = 0.46
Since E(cartype) = E(age), we can use any of the two to split first.
We choose cartype arbitrarily.
If cartype = sports then class=high
If cartype = family then we use age to split futher:
If age=20-30 then high, if age=30-40 then low, if age=40-50 then low (majority voting)
{Pete, 20-30, Sports} goes to High
{Bob, 40-50, Family} goes to Low
3.2 [5%] Explain the concept class conditional independence assumption used by the Naive
Bayesian Classifiers. Briefly describe the difference between Naive Bayes Classification
and Bayesian Belief Networks.
The answer can be found in the chapter 6 in the book
3.3 [10%] Consider a two dimensional database D with the records : R1 (2, 2), R2 (2, 4),
R3 (4, 2), R4 (4, 4), R5 (3, 6), R6 (7, 6), R7 (9, 6), R8 (5, 10), R9 (8, 10), R10 (10, 10). The distance function is the L1 distance (Manhattan distance). Show the results of the k-means
algorithm at each step, assuming that you start with two clusters (k = 2) with centers
C1 = (6, 6) and C2 = (9, 7).
Use k-means. The first step assigns points 1,2,3,4,5,6, and 8 to C1 and the other points
to C2. The new centers are (3.85, 4.85) and (9, 8.66). In the next step, point 8 moves
from C1 to C2. The new centers are (3.33, 4) and (8, 9). In the next step, point 6
moves from C1 to C2. After that move the algorithm stops. The final clusters are points
(1, 2, 3, 4, 5) and (6, 7, 8, 9, 10).
Problem 4 (Misc) [10%]
True or False:
3
Problem 2 (Association Rules)

Consider the following transaction database:
TransID
T100
T200
T300
T400
Items
A, B, C, D
A, B, C, E
A, B, E, F, H
A, C, H
Suppose that minimum support is set to 50% and minimum confidence to 60%.
a) List all frequent itemsets together with their support.
A 100%,
A,B 75%
A,B,C 50%
B 75%,
A,C 75%
A,B,E 50%
C 75%,
A,E 50%
E 50%,
A,H 50%
H 50%
B,C 50%
B,E 50%
b) Which of the itemsets from a) are closed? Which of the itemsets from a) are maximal?
Closed:
A,
Maximal:
A,H
A,B
A,C
A,B,C
A,H
A,B,C
A,B,E
A,B,E
c) For all frequent itemsets of maximal length, list all corresponding association rules satisfying the
requirements on (minimum support and) minimum confidence together with their confidence.
A,B,C
A,B
A,C
B,C
B
C
C
B
A
A,C
A,B
66%
66%
100%
66%
66%
A,B
A,E
B,E
B
E
E
B
A
A,E
A,B
66%
100%
100%
66%
100%
A,B,E
d) The lift of an association rule is defined as follows:

lift = confidence / support(head)
Compute the lift for the association rules from c).

A,B
A,C
B,C
B
C
C
B
A
A,C
A,B
66%
66%
100%
66%
66%
lift = 0.66 / 0.75 = 0.89

lift = 0.66 / 0.75 = 0.89
lift = 1.0 / 1.0 = 1.00
lift = 0.66 / 0.75 = 0.89
lift = 0.66 / 0.75 = 0.89

6 (4 Files Merged)

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

6 (4 Files Merged)

Diunggah oleh

Hak Cipta:

Format Tersedia

Q6: The following contingency table summarizes supermarket transaction data, where

For attribute a2, the Gini index is given by:

5/9[1-(2/5) -(3/5) ] + 4/9[1-(2/4) -(2/4) ] = 0.4889.

The overall entropy before splitting is:

GA=T=1 (4/7) (3/7) = 0.4898

Problem 3 (Classification and Clustering) [30%]

Problem 2 (Association Rules)

d) The lift of an association rule is defined as follows:

Compute the lift for the association rules from c).

lift = 0.66 / 0.75 = 0.89

Anda mungkin juga menyukai