ITDDM14

Predicting Missing Items in Shopping Carts
Kasun Wickramaratna, Student Member, IEEE, Miroslav Kubat, Senior Member, IEEE, and
Kamal Premaratne, Senior Member, IEEE
AbstractExisting research in association mining has focused mainly on how to expedite the search for frequently co-occurring
groups of items in shopping cart type of transactions; less attention has been paid to methods that exploit these frequent itemsets
for prediction purposes. This paper contributes to the latter task by proposing a technique that uses partial information about the
contents of a shopping cart for the prediction of what else the customer is likely to buy. Using the recently proposed data structure of
itemset trees (IT-trees), we obtain, in a computationally efficient manner, all rules whose antecedents contain at least one item from the
incomplete shopping cart. Then, we combine these rules by uncertainty processing techniques, including the classical Bayesian
decision theory and a new algorithm based on the Dempster-Shafer (DS) theory of evidence combination.
Index TermsFrequent itemsets, uncertainty processing, Dempster-Shafer theory.
1 INTRODUCTION
T
HE primary task of association mining is to detect
frequently co-occurring groups of items in transactional
databases. The intention is to use this knowledge for
prediction purposes: if bread, butter, and milk often
appear in the same transactions, then the presence of
butter and milk in a shopping cart suggests that the
customer may also buy bread. More generally, knowing
which items a shopping cart contains, we want to predict
other items that the customer is likely to add before
proceeding to the checkout counter.
This paradigm can be exploited in diverse applications.
For example, in the domain discussed in [1], each
shopping cart contained a set of hyperlinks pointing to
a Web page [1]; in medical applications, the shopping cart
may contain a patients symptoms, results of lab tests, and
diagnoses; in a financial domain, the cart may contain
companies held in the same portfolio; and Bollmann-Sdorra
et al. [2] proposed a framework that employs frequent
itemsets in the field of information retrieval.
In all these databases, prediction of unknown items
can play a very important role. For instance, a patients
symptoms are rarely due to a single cause; two or more
diseases usually conspire to make the person sick. Having
identified one, the physician tends to focus on how to treat
this single disorder, ignoring others that can meanwhile
deteriorate the patients condition. Such unintentional
neglect can be prevented by subjecting the patient to all
possible lab tests. However, the number of tests one can
undergo is limited by such practical factors as time, costs,
and the patients discomfort. A decision-support system
advising a medical doctor about which other diseases may
accompany the ones already diagnosed can help in the
selection of the most relevant additional tests.
The prediction task was mentioned as early as in the
pioneering association mining paper by Agrawal et al. [3],
but the problem is yet to be investigated in the depth it
deserves. The literature survey in [4] indicates that most
authors have focused on methods to expedite the search for
frequent itemsets, while others have investigated such
special aspects as the search for time-varying associations
[5], [6], [7] or the identification of localized patterns [8], [9].
Still, some prediction-related work has been done as well.
An early attempt by Bayardo and Agrawal [10] reports a
method to convert frequent itemsets to rules. Some papers
then suggest that a selected item can be treated as a binary
class (absence 0; presence 1) whose value can be
predicted by such rules. A user asks: does the current status
of the shopping cart suggest that the customer will buy
bread? If yes, how reliable is this prediction? Early
attempts achieved promising results and some authors
even observed that the classification performance of
association mining systems may compare favorably with
that of machine-learning techniques [11], [12], [13].
In our work, we wanted to make the next logical step by
allowing any item to be treated as a class labelits value is
to be predicted based on the presence or absence of other
items. Put another way, knowing a subset of the shopping
carts contents, we want to guess (predict) the rest.
Suppose the shopping cart of a customer at the checkout
counter contains bread, butter, milk, cheese, and
pudding. Could someone who met the same customer
when the cart contained only bread, butter, and milk,
have predicted that the person would add cheese and
pudding? Implicitly or explicitly, this task stood at the
cradle of this field in the 1990s; now that many practical
obstacles (e.g., computational costs) have been reduced, we
want to return to it.
It is important to understand that allowing any item to be
treated as a class label presents serious challenges as
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009 985
. The authors are with the Department of Electrical and Computer
Engineering, University of Miami, 1251 Memorial Drive, Coral Gables,
FL 33146, Florida.
E-mail: k.wickramaratna@umiami.edu, {mkubat, kamal}@miami.edu.
Manuscript received 14 May 2008; revised 16 Sept. 2008; accepted 13 Nov.
2008; published online 21 Nov. 2008.
Recommended for acceptance by S. Zhang.
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number
TKDE-2008-05-0263.
Digital Object Identifier no. 10.1109/TKDE.2008.229.
1041-4347/09/$25.00 2009 IEEE Published by the IEEE Computer Society
compared with the case of just a single class label. The
number of different items can be very high, perhaps
hundreds, or thousand, or even more. To generate associa-
tion rules for each of them separately would give rise to
great many rules with two obvious consequences: first, the
memory space occupied by these rules can be many times
larger than the original database (because of the tasks
combinatorial nature); second, identifying the most relevant
rules and combining their sometimes conflicting predictions
may easily incur prohibitive computational costs. In our
work, we sought to solve both of these problems by
developing a technique that answers users queries (for
shopping cart completion) in a way that is acceptable not
only in terms of accuracy, but also in terms of time and
space complexity.
In Section 2, we specify the task more formally and then
briefly survey some earlier work in Section 3. Section 4
discusses how to address the problem within the traditional
Bayesian framework, while Sections 5 and 6 describe our
proposed solution. Experimental results on synthetic and
benchmark data are summarized in Section 7.
2 PROBLEM STATEMENT
Let 1 = i
1
.. . . .i
i
be a set of distinct items and let a database
consist of transactions T
1
.. . . .T
`
such that T
i
_ 1. \i. An
itemset, A, is a group of items, i.e., A _ 1. The support of
itemset A is the number, or the percentage, of transactions
that subsume A. An itemset that satisfies a user-specified
minimum support value is referred to as a frequent itemset or
a high support itemset.
An association rule has the form i
(o)
= i
(c)
, where i
(o)
and
i
(c)
are itemsets. The former, i
(o)
, is the rules antecedent and
the latter, i
(c)
, its consequent. The rule reads: if all items from
i
(o)
are present in a transaction, then all items from i
(c)
are
also present in the same transaction. The rule does not have
to be absolutely reliable. The probabilistic confidence in the
rule i
(o)
= i
(c)
can be defined with the help of supports
(relative frequencies) of the antecedent and consequent as
the percentage of transactions that contain i
(c)
among those
transactions that contain i
(o)
:
coi) = :njjoit (i
(o)
i
(c)
),:njjoit (i
(o)
). (1)
Let us assume that an association mining program has
already discovered all high support itemsets. For each such
itemset, A, any pair of subsets, i
(o)
and i
(c)
, such that i
(o)

i
(c)
= A and i
(o)
i
(c)
= O, we can define an association
rule: i : i
(o)
= i
(c)
. The number of rules implied by A grows
exponentially in the number of items in A; therefore, for
practical purposes, we usually consider only high-confi-
dence rules derived from high support itemsets.
Let : be a given itemset. An algorithm developed in [4]
generates, in a computationally feasible manner, all rules
: = that satisfy the user-supplied minimum support and
confidence values 0
:
and 0
c
, respectively. Of course, if no
frequent itemset subsumes :, no rules are generated.
However, we are also interested in rules with ante-
cedents that are subsumed by :. As an example, suppose no
frequent itemset subsumes : = (wine. milk. bread). To
claim that : does not imply any itemsas one may infer
from the algorithm in [4]would mean to ignore associa-
tions implied by subsets of :. For instance, the antecedent of
(milk. bread) = butter is subsumed by :; we will say that
(milk. bread) = butter matches :. If we have other such
matching rules, for instance, (wine.bread) = egg or
wine = beer, we would like to consider them as well. We
thus need a mechanism that not only generates, but also
combines rules that match :.
Furthermore, we need to be aware of the circumstance
that the presence of an item might suggest the absence of
other items. For instance, if the shopping cart contains
chips, cookies, cashews, the customer may not buy
nuts. We are therefore interested in rules such as
(chips. cookies. cashews) = nut, where nuts means
that no nuts will be added to the cart. Classical association
mining usually ignores this aspect, perhaps because
negated items tend to increase significantly the total
number of rules to be considered; another reason can be
that rules with mutually contradicting consequents are not
so easy to combine.
With all these issues in mind, we narrow down the space
of association rules by the following guidelines:
1. For a given itemset :, rule antecedents should be
subsumed by :.
2. The rule consequent is limited to any single un-
seen item (presence or absence of the unseen item).
In essence, this paper addresses the following tasks:
Given a transaction with the itemset : _ 1, find the set of
matching rules (entailed by the training data set) that are of
the form i
(o)
= i
,
. , = 1. i, such that i
(o)
_ : and i
,
, :, and
exceed the user-supplied minimum support, 0
:
, and mini-
mum confidence, 0
c
, thresholds. Then, devise a method to
combine the matching rules that have mutually contra-
dicting consequents and reach a decision on which other
items would be added to the transaction.
3 EARLIER WORK
Association mining systems that have been developed with
classification purposes in mind are sometimes dubbed
classification rule mining. Some of these techniques can be
adapted to our needs.
Take, for instance, the approach proposed in [12]. If i
,
is the item whose absence or presence is to be predicted,
the technique can be used to generate all rules that have
the form i
(o)
= 1
,
, where i
(o)
_ (1 i
,
) and 1
,
is the
binary class label (i
,
= jic:cit or i
,
= o/:cit). For a given
itemset :, the technique identifies among the rules with
antecedents subsumed by : those that have the highest
precedence according to the reliability of the rulesthis
reliability is assessed based on the rules confidence and
support values. The rule is then used for the prediction of
i
,
. The method suffers from three shortcomings. First, it is
clearly not suitable in domains with many distinct items
i
,
. Second, the consequent is predicted based on the
testimony of a single rule, ignoring the simple fact that
rules with the same antecedent can imply different
consequentsa method to combine these rules is needed.
Third, the system may be sensitive to the subjective user-
specified support and confidence thresholds.
Some of these weaknesses are alleviated in [11], where
a missing item is predicted in four steps. First, they use a
986 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 7, JULY 2009
so-called partitioned-ARM to generate a set of association
rules (a ruleset). The next step prunes the ruleset (e.g., by
removing redundant rules). From these, rules with the
smallest distance from the observed incomplete shopping
cart are selected. Finally, the items predicted by these
rules are weighed by the rules antecedents similarity to
the shopping cart.
The approach in [14] pursues a Dempster-Shafer (DS)
belief theoretic approach that accommodates general data
imperfections. To reduce the computational burden, Hewa-
wasam et al. employ a data structure called a belief itemset
tree. Here, too, rule generation is followed by a pruning
algorithm that removes redundant rules. In order to predict
the missing item, the technique selects a matching
ruleseta rule is included in the matching ruleset if the
incoming itemset is contained in rule antecedent. If no rules
satisfy this condition, then, from those rules that have
nonempty intersection with the itemset :, rules whose
antecedents are closer to : according to a given distance
criterion (and a user-defined distance threshold) are picked.
Confidence of the rule, its entropy, and the length of its
antecedent are used to assign DS theoretic parameters to the
rule. Finally, the evidence contained in each rule belonging
to the matching ruleset is combined or pooled via a
DS theoretic fusion technique.
In principle, at least, we could adopt any of the above
methodologies; but the trouble is that they were all
designed primarily for the classification task and not for
shopping cart completion. Specifically, the number of times
such classifiers have to be invoked would be equal to the
number of all distinct items in the database (i.e., i) minus
the number of those already present in the shopping cart.
This is why we sought to develop a predictor that would
predict all items in a computationally tractable manner.
Another aspect of these approaches is the enormous
amount of effort/cost it takes to obtain a tangible and
meaningful set of rules. The root of the problem lies in the a
priori-like algorithms used to generate frequent itemsets and
the corresponding association rulesthe costs become
prohibitive when the database is large and complicated.
Here, the size and difficulty are determined by four
parameters: number of transactions, number of distinct
items, average transaction length, and the minimumsupport
threshold. For example, the problem can become intractable
if the number of frequent items is large; and whether an item
is frequent or not is affected by the minimum support
threshold. It is well known that a priori-based algorithms
suffer from performance degradation in large-scale pro-
blems due to combinatorial explosion and repeated passes
through the database [15].
4 BAYESIAN APPROACH
We are not aware of any attempt to apply to our task the
Bayesian approach, a technique that has been around
since before World War II [16], [17]. The mathematically
clean version is known to be computationally expensive
in domains where many independent variables are present.
Fortunately, this difficulty can be sidestepped by the so-
called Naive-Bayes principle that assumes that all variables
are mutually pairwise conditionally independent [18].
Although this requirement is rarely strictly satisfied,
decades of machine learning research have shown that,
most of the time, Naive Bayes can be used nevertheless
conditional interdependence of variables usually affects
classification behavior of the resulting formulas only
marginally.
Suppose we want to establish whether the presence of
itemset : = i
(:)
1
.. . . .i
(:)
/
increases the chance that item i
,
, :
is also present. Bayes rule yields
1(i
,
[i
(:)
1
.. . . .i
(:)
/
) =
1(i
(:)
1
.. . . .i
(:)
/
[i
,
) 1(i
,
)
1(i
(:)
1
.. . . .i
(:)
/
)
. (2)
We select all items for which 1(i
,
[i
(:)
1
.. . . .i
(:)
/
)
1(i
,
[i
(:)
1
.. . . .i
(:)
/
), where 1(i
,
[i
(:)
1
.. . . .i
(:)
/
) is the probability
of item i
,
being absent given the itemset : is present. Since
the denominator is the same for any given itemset :, it is
enough if the classifier chooses the items that maximize the
value of the numerator.
With the Naive Bayes assumption, pairwise indepen-
dence of i
(:)
1
.. . . .i
(:)
/
conditional to i
,
yields the following
expression for the numerator of 2:
1(i
(:)
1
.. . . .i
(:)
/
[i
,
) =
Y
/
=1
1(i
(:)
[i
,
). (3)
Hence, i
,
is chosen if it maximizes 1(i
,
)
Q
1(i
(:)
[i
,
).
One practical problem with this formula is that its value
is zero if 1(i
(:)
[i
,
) = 0 for some = 1. /, a situation that will
occur quite often in view of the sparseness of the data in
association mining applications. This difficulty is easily
rectified if we estimate the conditional probabilities by the
m-estimate (originally proposed for machine learning
applications in [19]) that makes it possible to bias the
probabilities toward user-set prior values. Let us constrain
ourselves to a binary domain where a set of trials has
resulted either in or . Let us denote by
and
the
number of occurrences of the respective outcomes, and let
o||
=
. If, before the trials, the users prior expecta-

tion of the probability of was j
, after the trials, the

probability of is estimated as
1
= (
mj
),(
o||
m). (4)
The parameter mquantifies the experimenters confidence
inthis estimate. Note that, for
o||
= 0 (whichimplies
= 0),
4 degenerates to the prior expectation, j
. Conversely, the
equation converges to relative frequency if
o||
is so large that
the terms mj
andmcanbe neglected. Generally speaking, if

m is small, then even small evidence will affect the prior
estimate; the higher the value of m, the stronger evidence is
needed if the prior estimate is to be overturned.
In the prediction context, we want to compare the
probability estimates of the presence/absence of one item
based on another item. For m= 2, the m-estimate will be
calculated as
1(i
(:)
[i
,
) = (conit(i
(:)
. i
,
) 1),(conit(i
,
) 2). (5)
where conit(i
(:)
.i
,
) is the number of appearances of
itemset i
(:)
. i
,
and conit(i
,
) is the number of appearances
of itemi
,
in the data set. For a given i
,
, this formula is used to
calculate the values 1(i
(:)
[i
,
) for all items in the antecedent.
This makes it possible to evaluate the probabilities in (3),
WICKRAMARATNA ET AL.: PREDICTING MISSING ITEMS IN SHOPPING CARTS 987
which, in turn, are used to calculate the posteriors in (2). The
following rule is then used to predict the missing items:
. Finding : = i
(:)
1
.. . . .i
(:)
/
in the shopping cart,
predict all i
,
such that 1(i
,
)
Q
1(i
(:)
[i
,
)
1(i
,
)
Q
1(i
(:)
[i
,
).
5 THE PROPOSED APPROACH
Association rule mining (ARM) in its original form finds all
the rules that satisfy the minimum support and minimum
confidence constraints. Many later papers tried to integrate
classification and ARM. The goal was to build a classifier
using so-called class association rules. In classification rule
mining, there is one and only one predetermined target, the
class label. Most of the time, classification rule mining is
applied to databases in a table format, with a predefined
set of attributes and a class label. Attributes usually take a
value out of a finite set of values (although missing values
are often permitted).
Some papers, such as [11] and [14], demonstrated
encouraging results by incorporating DS theoretic notions
with class association rules. But most of these methods
were designed for data sets with limited number of
attributes (or data sets with small number of distinct items)
and one class label. In our task, we do not have a
predefined class label. In fact, all items in the shopping
cart become attributes and the presence/absence of the
other items has to be predicted. What is needed is a feasible
rule generation algorithm and an effective method to use to
this end the generated rules.
For the prediction of all missing items in a shopping cart,
our algorithm speeds up the computation by the use of the
itemset trees (IT-trees) and then uses DS theoretic notions to
combine the generated rules. The flowchart in Fig. 1 shows
an outline of our proposed system.
5.1 Itemset Tree (IT-Tree)
Let us briefly summarize the technique of IT-trees as
developed in [4]. Here, items are identified by an unin-
terrupted sequence of integers. Let us order the items
i
1
.. . . .i
i
so that i
/
< i
|
for / < |, where i
/
and i
|
are integers
identifying the /th and the |th terms, respectively. From
now on, we will refer to the items by their corresponding
integers. We will say that the /th item is greater than the
,th item if i
/
i
,
.
Definition 1 (Ancestor, Largest Common Ancestor, Child).
Let :, c, and denote itemsets.
1. : is an ancestor of c and write : _ c iff : =
i
1
. . . . .i
j
, c = i
1
.. . . .i
, and j _ .
2. is the largest common ancestor of : and c, and write
= : c iff _ :, _ c, and there is no
/
such that
/
_ :,
/
_ c, and _
/
. ,=
/
.
3. c is a child of : iff : _ c and there is no , different from
: and c, such that : _ _ c.
Note that an ancestor of c is an uninterrupted sequence
of the smallest items in c. For instance, [1. 2[ is an ancestor of
[1. 2. 3. 4[. Itemsets [1. 2[ and [2. 5[ cannot have a common
ancestor because any ancestor of [1. 2[ has to begin with
item 1 that is not contained in [2. 5[.
Definition 2 (IT-Tree). An itemset tree, T, consists of a root
and a (possibly empty) set, T
1
. . . . . T
/
, each element of
which is an itemset tree. The root is a pair [:. )(:)[, where : is
an itemset and )(:) is a frequency. If :
i
denotes the itemset
associated with the root of the ith subtree, then : _ :
i
. : ,= :
i
,
must be satisfied for all i.
An IT-tree is a partially ordered set of pairs, [itci:ct. )[,
where the )-value tells us how many occurrences of the
itemset the node represents. An algorithm that builds the
IT-tree in a single pass through the database is presented in
[4] that also proves some of the algorithms critical proper-
ties. For example, the number of nodes in the IT-tree is
upper-bounded by twice the number of transactions in the
original database (although experiments indicate that, in
practical applications, the size of the IT-tree rarely exceeds
the size of the database). Moreover, each distinct transaction
database is represented by a unique IT-tree and the original
transaction can be reproduced from the IT-tree.
Li and Kubat [20] made a useful modification. Note that
some of the itemsets in IT-tree (e.g., [1. 2. 4[ in Fig. 2) are
identical to at least one of the transactions contained in the
original database, whereas others (e.g., [1. 2[) were created
during the process of tree building where they came into
being as common ancestors of transactions from lower
levels. They modified the original tree building algorithm
by flagging each node that is identical to at least one
transaction. In Fig. 2, the flags are indicated by black dots.
This flagged IT-tree will become the base of our rule
generation algorithm.
Fig. 1. An overview of our proposed system.
Fig. 2. The IT-tree constructed from the database 1 in Example 1.
Example 1 (An IT-Tree). The flagged IT-tree of the database
1 = [1. 4[. [2. 5[. [1. 2. 3. 4. 5[. [1. 2. 4[. [2. 5[. [2. 4[ is shown
in Fig. 2.
5.2 Rule Generation Mechanism
Recall our objective: given a record with itemset : _ 1, find
all rules of the form i
(o)
= i
,
. , = 1. i, where i
(o)
_ : and
i
,
, :, that exceed minimum support and minimum
confidence thresholds. Note that the consequent i
,
is a
single unseen item, i.e., [i
,
[ = 1. Also, the prediction (of
the consequent) could be itci = jic:cit) or itci =
o/:cit). For each unseen item, the corresponding ruleset
is selected and a DS theoretic approach is used to combine
the rules. This prediction is given as a DS theoretic mass
structure over the set of singletons or the frame of
discernment = (i
,
= jic:cit). (i
,
= o/:cit). If no rule
consequent in the generated ruleset contains a particular
unseen item, no prediction is made regarding that item.
To expedite the rule generation process, we use the IT-
tree approach from [20] that modifies the rule generation
algorithm from [4] due to two reasons. First, the algorithm
from [4] addresses a slightly different task, generating all
rules of the form, : = , where : and are itemsets and
: = O (as explained earlier, we are interested in a slightly
different set of rules). Second, our goal is not to generate all
association rules, but, rather, to build a predictor from a set
of effective association rules.
5.3 Proposed Solution
The proposed rule generation algorithm makes use of the
flagged IT-tree created from the training data set. The
algorithm takes an incoming itemset as the input and
returns a graph that defines the association rules entailed by
the given incoming itemset.
The graph consists of two lists: the antecedents list 1
(o)
and the consequents list 1
(c)
. Each node, i
(o)
i
, in the
antecedents list keeps the corresponding frequency count
)(i
(o)
i
). As shown in Fig. 3, a line,
i.,
, between the two
lists links an antecedent i
(o)
i
with a consequent 1
,
. The
cardinality of the link, )(
i.,
), represents the support count
of the rule i
(o)
i
= 1
,
. The frequency counts denoted by )
o
(v)
are used in the process of building the graph. If the
incoming itemset is : and if T
i
represents a transaction in
the database, then )
o
(i
(o)
) records the number of times
: T
i
= i
(o)
. Thus, )
o
(
i.,
) records the number of times
where : T
i
= i
(o)
and 1
,
T
i
. All the frequency counts are
initialized to zero at the beginning of the algorithm and
updated as we traverse the IT-tree.
Algorithm 1 conducts a depth-first search in the IT-tree
to identify the nodes that have nonempty intersections with
:. Note that the items in : and c
i
are referred to by their
corresponding integer representations and sorted in the
ascending order.
Algorithm 1. Algorithm that process the itemset tree T and
returns the rule graph G that predicts unseen items in a
user-specified itemset :.
Let 1 denote the root of T and let [c
i
. )(c
i
)[ be 1s children.
Let T
i
denote the subtree whose root is [c
i
. )(c
i
)[.
To invoke Rule_mining use: G = Rule_mining(:. T. ).
1: Rule_mining(s,T,G):
2: for all c
i
such that first_item(c
i
) _ last_item(s) do
3: if : c
i
= O and last_item(c
i
) < last_tem(:) do
4: G Rule_mining(:. T
i
. G);
5: else if i
(o)
= : c
i
,= O then
6: if c
i
is not flagged then
7: G 1n|c iiiiiq(:. T
i
. G);
8: else
9: if c
i
does not have children than )
o
)(c
i
);
10: else)
o
)(c
i
)
P
)(c
i
: c/i|dici); >)
o
is
the frequency of c
i
in the database
11: end if
12: G ljdotc Gioj/(G. i
(o)
. c
i
. )
o
);
13: G 1n|c iiiiiq(:. T
i
. G);
14: end if
15: end if
16: end for
17: return G;
The individual steps of the algorithm can be summarized
as follows: Let 1 = [:
1
. )(:
1
)[ denote the root, c
i
denote the
children of 1, and : denote the incoming itemset. If the first
item in c
i
is greater than the last item in :, it is certain that
no tree node rooted at c
i
will contain items from itemset :. If
c
i
: = O and the last item in c
i
is greater than the last item
in :, then, again, it is certain that no nodes in the subsequent
trees have nonempty intersections with :. But if the first
item of c
i
is less than the last item of :, subtree T
i
with the
root [c
i
. )(c
i
)[ may contain one or more items from :.
The algorithm starts the search for rules in subtrees rooted
at the children of c
i
. If c
i
: ,= O, the intersection (say, i
(o)
) is
a candidate for a rule antecedent. However, if the node
[c
i
. )(c
i
)[ is not flagged (i.e., if itemset c
i
does not exist in
the actual data set), the candidate antecedent looses the
candidacy status. Now, the nodes in the subtrees starting
from children of c
i
possess intersections with : that are
equal to i
(o)
or larger than i
(o)
(i.e., the intersection is a
superset of i
(o)
). The algorithm thus continues the search for
rules in the subtrees rooted at the children of c
i
. If c
i
is
flagged, the number of occurrences of c
i
in the data set is
calculated as )(c
i
)
P
)(c
/
i
: c/i|dici). Then, i
(o)
becomes a
rule antecedent and each item in c
i
i
(o)
becomes a
Fig. 3. The Rule Graph, G. )(i
(o)
i
) = frequency count of antecedent,
)(
i.,
) = support count of rule i
(o)
i
= 1
,
.
consequent. The new rules, i
(o)
= i
,
, where i
,
(c
i
i
(o)
),
are added to the rule graph.
Each nonempty intersection of : with a flagged node of
the tree generates set of association rules of the form
i
(o)
= i
,
, where i
(o)
is the intersection of : with the node
and i
,
is an item in node such that i
,
, i
(o)
. Note that a
flagged node represents an actual transaction in the data set;
the number of flagged nodes is upper-bounded by ` (the
number of transactions in the data set). These rules are added
to the rule graph using Algorithm2. The idea is to update the
frequency counts of all rules i
(o)
i
= i
,
, where i
(o)
i
_ i
(o)
. If the
newrule does not exist in the rule graph, it has to be added to
the graph and the frequency count has to be updated using
all the rules of the form i
(o)
i
= i
,
, where i
(o)
_ i
(o)
i
.
Algorithm 2. Simplified algorithm to update the rule graph.
Let G denote the current rule graph.
Let c
i
denote an itemset from a node and )
o
denote the
number of appearances of c
i
in the database.
Let i
(o)
= c
i
: where : is the incoming itemset.
To invoke Update_Graph use: G = ljdotc Gioj/(G. i
(o)
.
c
i
. )
o
).
1: Update_Graph(G. i
(o)
. c
i
. )
o
):
2: for all i
(o)
i
in 1
(o)
do
3: if i
(o)
i
_ i
(o)
then
4: update the frequency count of i
(o)
i
, )(i
(o)
i
) )(i
(o)
i
)
)
o
;
5: for all i
(o)
i
= i
,
where i
,
(c
i
i
(o)
) do
6: update frequency count of rule i
(o)
i
= i
,
, )(
i.,
)
)(
i.,
) )
o
;
7: end for
8: if i
(o)
i
= i
(o)
then
9: update frequency count record, )
o
(i
(o)
i
) )
o
(i
(o)
i
)
)
o
;
10: for all i
(o)
i
= i
,
where i
,
(c
i
i
(o)
) do
11: update frequency count record of rule,)
o
(
i.,
)
)
o
(
i.,
) )
o
;
12: end for
13: end if
14: else if i
(o)
i
(o)
i
then
15: update the frequency count of new rule antecedent
)(i
(o)
) )(i
(o)
) )
o
(i
(o)
i
);
16: for all i
(o)
i
= i
,
in G where i
,
(c
i
i
(o)
) do
17: update the frequency count of new rule i
(o)
= i
,
,
(add )
o
(
i.,
));
18: end for
19: end if
20: end for
21: for all i
,
(c
i
i
(o)
) do
22: if,(i
(o)
= i
,
) G then
23: add the rule to the graph with corresponding
frequency counts;
24: end if
25: end for
26: return G;
The size of the antecedent list 1
(o)
of the rule graph G
is upper-bounded by iii(`. 2
j
), where j is the size of
the itemset :. The size of 1
(c)
is upper-bounded by
(i j), where i is the number of distinct items in the
data set. At the beginning, each list is empty; they grow
as we traverse through the itemset tree. Algorithm 1 scans
the IT-tree and calls Algorithm 2 each time it comes
across a flagged node. At each call, Algorithm 2 carries
out one traversal of the ruleset (and possibly adds some
rules to it). Thus the worst case complexity of the rule
generation process is O(`
2
). In reality, the computation
complexity is much lower because the number of nodes
that have nonempty intersections with : usually constitute
only a small fraction of `. In addition, when j is small,
the size of the rule antecedent list could be much smaller
than ` (i.e., when 2
j
< `).
Example 2 (A Rule Generation Example). We consider
the same data set as in the previous example, viz.,
1 = [1. 4[. [2. 5[. [1. 2. 3. 4. 5[. [1. 2. 4[. [2. 5[. [2. 4[. Assume
that the incoming itemset : = [2. 3[. Fig. 4 shows the
step-by-step building of the rule graph by Algorithm 1.
The itemset : has nonempty intersections with four
flagged nodes, [1. 2. 3. 4. 5[, [1. 2. 4[, [2. 4[, and [2. 5[. A
set of rules is added to the rule graph with each
nonempty intersection.
Consider the intersection with the flagged node
c
i
= [1. 2. 3. 4. 5[. The intersection, i
(o)
= [2. 3[, is added
to the antecedent list of the rule graph and the
consequents 1,4,5 are added to the consequent list
together with links connecting antecedent and conse-
quents. Since the frequency count of the node is )
o
= 1, all
the )(v) values in the graph assume 1, as shown in Fig. 4a
(lines 21-25 of Algorithm. 2). At the node [1. 2. 4[, the
intersection, [2[, is taken as i
(o)
. Frequency counts
)(i
(o)
). )
o
(i
(o)
) and all frequency counts of both candidate
rules [2[ = 1). [2[ = 4) are initialized to frequency count
of the node [1. 2. 4[. )
o
= 1. Since the current rule graph
contains [2. 3[ in 1
(o)
and [2[ is a subset of [2. 3[, update the
frequency counts according to lines 15-18. Add the new
rules to the graph (lines 21-25). At the node [2. 4[, the
intersection is again [2[. Since it is already in the graph,
update the frequency counts of the antecedent and
corresponding rule (i.e., [2[ = 4) according to lines 4-13.
Processing of the node [2. 5[ is similar to the previous case.
However, in this case, a new rule [2[ = 5 is added to the
graph (line 21-25).
The ruleset that resides in G is given in Table 1. The
rule [2. 3[ = 1 suggests that, if the itemset [2. 3[ is present
Fig. 4. Rule graph construction for the testing itemset [2, 3] using IT-tree
in Fig. 2 (Testing itemset possesses nonempty intersections with only
four nodes of the tree). (a) After node [1,2,3,4,5], (b) after node [1,2,4],
(c) after node [2,4], and (d) shows the final rule graph, G, after
node [2,5].
in a shopping cart, item 1 is likely to be added to the cart.
Support of this rule is 1,6 and the confidence is 1.
Note that the ruleset in Table 1 consists of only two
distinct antecedents: [2[ and [2. 3[. Since no minimum
support or confidence threshold is applied yet, one may
expect another ruleset with the antecedent [3[. However,
our algorithm does not generate rules having antecedent
[3[. Note that no transaction T
i
in the data set 1 provides
an intersection T
i
: = [3[, that is, whenever item 3
appears in a transaction, one or more of other items from
: happen to appear in T
i
, too. So, item 3 alone does not
provide any additional evidence for the given itemset :.
This is why our rule-generating algorithm ignores such
rules.
It is important to note here that one might be interested in
rules that suggest the absence of items. For instance,
[2. 3[ = 1 = o/:cit), that is, when items 2 and 3 are already
present in the cart, then item 1 is unlikely to be added to the
cart inthe future. Inthis event, the IT-tree-building algorithm
has to regard the itci. .o|nc) pair as an item. For instance,
1 = jic:cit) is one item and 1 = o/:cit) is another. Then,
the generated ruleset will eventually consist of rules
suggesting both the presence and absence of items. These
rules then have to be combined to yield the final decision.
Note that we select only rules that exceed the minimum
support and the minimum confidence in the rule combina-
tion step. In addition, if two rules with the same consequent
have overlapping antecedents such that the antecedent of
one rule is a subset of the antecedent of the other rule (e.g.,
o = c),o. / = c)), we only consider the rule with the
higher confidence. How the selected rules are used for
prediction is described in the next section.
6 EMPLOYING DS THEORY
When searching for a way to predict the presence or
absence of an item i
,
in a partially observed shopping cart :,
we wanted to use association rules. However, many rules
with equal antecedents differ in their consequentssome of
these consequents contain i
,
, others do not. The question is
how to combine (and how to quantify) the potentially
conflicting evidence. One possibility is to rely on the DS
theory of evidence combination. Let us now describe our
technique, which we refer to by the acronym DS-ARM
(Dempster-Shafer-based Association Rule Mining).
6.1 Preliminaries
Consider a set of mutually exclusive and exhaustive
propositions, = 0
1
. . . . . 0
1
, referred to as the frame of
discernment (FoD). A proposition 0
i
, referred to as a singleton,
represents the lowest level of discernible information. In our
context, 0
i
states that the value of attribute is equal to 0
i
.
Elements in 2
, the power set of , form all propositions of

interest. Any proposition that is not a singleton, e.g., (0
1
. 0
2
),
is referred to as composite. DS theory assigns to any set,
_ , a numeric value i() [0. 1[, called a basic belief
assignment (BBA) or mass that quantifies the evidence one has
toward the proposition that the given attribute value is
and only . The mass function has to satisfy the following
conditions [21]:
i(O) = 0;
X
_
i() = 1. (6)
Note that, if
"
is the complement of , then
i() i(
"
) _ 1.
Any proposition that possesses a nonzero mass, i.e.,
i() 0, is called a focal element; the set of focal elements,
F, is referred to as the core. The triple . F. i(v) is called
the body of evidence (BoE).
An indication of the evidence one has toward all
propositions that may themselves imply a given proposition
_ is quantified via the belief, 1c|() [0. 1[, defined as
1c|() =
X
1_
i(1). (7)
Note that 1c|() = i() if is a singleton. Plausibility of
is defined as 1|() = 1 1c|(); it represents the extent to
which one finds plausible.
A probability distribution 1i() satisfying 1c|() _
1i() _ 1|(). \ _ , is said to be compatible with the
underlying BBA i(v). An example of such a probability
distribution is the pignistic probability distribution 1ct1(v)
defined for each singleton 0
i
as follows [22]:
1ct1(0
i
) =
X
0
i
_
i(),[[. (8)
Here, [[ denotes the cardinality of set .
The Dempsters rule of combination (DRC) makes it
possible to arrive at a new BoE by fusing the information
from several BoEs that span the same FoD. Consider the
two BoEs, . F
1
. i
1
(v) and . F
2
. i
2
(v). Then,
1
12
=
X
1
i
C
,
=O
i
1
(1
i
) i
2
(C
,
) (9)
is referred to as the conflict because it indicates how much
the evidence of the two BoEs are in conflict. If 1
12
< 1, then
the two BoEs are compatible and can be combined to obtain
the overall BoE . F. i(v) as follows: For all _ ,
i() = (i
1
i
2
)() =
P
1
i
C
,
=
i
1
(1
i
) i
2
(C
,
)
1 1
12
. (10)
A variation of the DRC that can be used to address the
reliability of the evidence provided by each contributing
TABLE 1
Rule Set That Resides in G
BoE is to incorporate a discounting factor d
i
. d
i
_ 1, to each
BoE [21]. The BBA thus generated is
i() = ( ^ i
1
^ i
2
)(). where, for i = 1. 2,
^ i
i
() =
d
i
i
i
. for .
(1 d
i
) d
i
i
i
(). for = .
(11)
6.2 Concrete Application to Our Task
6.2.1 Basic Belief Assignment
In association mining techniques, a user-set minimum
support decides about which rules have high support.
Once the rules are selected, they are all treated the same,
irrespective of how high or how low their support.
Decisions are then made solely based on the confidence
value of the rule. However, a more intuitive approach
would give more weight to rules with higher support.
Therefore, we propose a novel method to assign to the rules
masses based on both their confidence and support values.
However, the support value should have a smaller impact
on the mass.
In many applications, the training data set is skewed.
Thus, in a supermarket scenario, the percentage of shop-
ping carts containing, say canned fish, can be 5 percent,
the remaining 95 percent shopping carts not containing this
item. Hence, the rules that suggest the presence of canned
fish will have very low support while rules suggesting the
absence of canned fish will have a higher support. Unless
compensated for, a predictor built from a skewed training
set typically tends to favor the majority classes at the
expense of minority classes. In many scenarios, such a
situation must be avoided.
To account for this data set skewness, we propose to
adopt a modified support value as follows:
Definition 3 (Partitioned-Support). The partitioned-sup-
port j :njj of the rule, i
(o)
= i
(c)
, is the percentage of
transactions that contain i
(o)
among those transactions that
contain i
(c)
, i.e.,
j :njj = :njjoit(i
(o)
i
(c)
),:njjoit(i
(c)
).
With Definition 3 in place, we take inspiration from the
traditional 1
c
-measure [24] and use the weighted harmonic
mean of support and confidence to assign the following
BBA to the rule i
(o)
= i
,
:
i(i
,
[i
(o)
) =
u. for i
,
= jic:cit.
1 u. for i
,
= .
0. otherwise.
8
<
:
(12)
where
u =
(1 c
2
) coi) j :njj
c
2
coi) j :njj
; c [0. 1[. (13)
Note that, for the task at hand, [i
,
[ = 1 and hence,
= (i
,
= jic:cit). (i
,
= o/:cit). Note that, as c de-
creases, the emphasis placed upon the partitioned-support
measure in i(v) decreases as well.
With this mass allocation, the effectiveness of a rule is
essentially tied to both its confidence and partitioned-
support. Moreover, just as the 1
c
-measure enables one to
trade the precision and recall measures of an algorithm,
the mass allocation above allows one to trade the effective-
ness of the confidence and partitioned-support of a rule.
Indeed, the parameter c can be thought of as a way to
quantify the users willingness to trade increased confi-
dence for a loss in partitioned-support.
6.2.2 Discounting Factor
Following the work in [15], the reliability of the evidence
provided by each contributing BoE is addressed by
incorporating the following discounting factor:
d = 1 1it [ [
1
1 ln (i [i
(o)
[)
h i
1
. with
1it =
X
i
,
_
i(i
,
[i
(o)
) ln [i(i
,
[i
(o)
)[.
(14)
Recall that i denotes the number of items in the
database. The term 1,(1 1it) accounts for the uncertainty
of the rule about its consequent. The term 1,(1 ln [i
[i
(o)
[[) accounts for the nonspecificity in the rule antecedent.
Note that d increases as 1it decreases and length of rule
antecedent increases. As dictated by 11, the BBA then gets
accordingly modified. The DRC is then used on the
modified BoEs to combine the evidence.
Example 3. Table 2 shows a ruleset generated for itemset
(bread, milk) and a supper market data set that
contains five distinct items viz. {egg, bread, but-
ter, milk, wheat bread}. Integers from 1 to 5 are
used to denote the presence of items and 6 to 10 are
used to denote the absence of items. For instance,
TABLE 2
An Example Ruleset
(cqq = jic:cit) =1. (/icod = jic:cit) = 2. (cqq = o/:cit)
= 6 etc. Last two columns show the computed BBA and
d values of the rules. In this example, mass assignment
is done using three times less weight for the parti-
tioned-support compared to the confidence, i.e., c =
0.33 and d is computed using (14).
To keep the rules as independent as possible, we then
removed the overlapping rules while keeping the highest
confidence rule. If two overlapping rules have the same
confidence, the rule with the lower support is dropped. For
instance, rules 2 and 3 both suggest no egg, and the
antecedent of the second rule is a subset of the antecedent of
the first rule. However, rule 2 has lower confidence than
rule 3. The ruleset obtained after pruning the overlapping
rules is given in Table 3, where ^ i denotes discounted BBA.
Rules are combined using the Dempsters rule of combina-
tion. For instance, to predict whether egg is present or not,
the first two rules are combined. Pignistic probability can be
used to pick the most probable event. According to the
values in the table and using the pignistic approximation,
the predictor may predict that only egg will be added to the
cart (no butter, no wheat bread). However, the predictor
assigns higher belief to the prediction: no wheat bread
(1ct1(io n/cot/icod) = 0.71) compared to the other pre-
dictions (1ct1(cqq) = 1ct1(io /nttci) = 0.52).
7 EXPERIMENTS
7.1 Experimental Data
In our experiments, we relied on two sources of data. To start
with, we followed the common practice of experimenting
with the synthetic data obtained from the IBM-generator.
1
This gave us the chance to control critical data parameters
and to explore our programs under diverse circumstances.
The generator employs various user-set parameters to
create an initial set of frequent itemsets whose sizes are
obtained by a random sampling of the Poisson distribution.
Then, transactions (shopping carts) are created in a
manner that guarantees that they contain these itemsets or
their fractions. The lengths of the transactions are estab-
lished by a random number generator that follows the
Poisson distribution [24].
Table 4 summarizes the user-set parameters. In the data
we worked with, we varied the average transaction length
and the average size of the artificially added itemsets. The
specific values of these parameters are indicated in the
names of the generated files. For instance, T10.14 means
that the average transaction contained 10 items and that the
artificially implanted itemsets contained 4 items on average.
Apart from using the IBM generator, we experimented
with two benchmark domains from the UCI repository
2
that are broadly known, with characteristics well under-
stood by the research community. To be more specific,
we used the congressional vote domain and the
SPECT-Heart domain.
The congressional vote data set has the form of a
table where each row represents one congressman or
congresswoman and each column represents a bill. The
individual fields contain 1 if the person voted in favor of
the bill, 0 if he or she voted against, and ? otherwise.
We numbered the bills sequentially as they appeared in the
table (from left to right) and then converted the data by
creating for each politician a shopping cart containing
the numbers of those (and only those) bills that he or she
voted for. For instance, the shopping cart containing [1. 4. 8[
indicated that the politician voted for bills represented by
the first, fourth, and eighth column in the table. In our
experiments, we ignored the information about party
affiliations (the class label in the original data).
Our research questions were formulated as follows:
being told about a subset of a politicians voting history,
can we tell (ignoring party affiliation) how he or she has
voted on the other bills? How much does the correctness of
these predictions depend on the number of known voting
decisions? How will alternative algorithms handle this task?
To find out, we removedin the shopping cart version of
the datacertain percentage of items to be predicted. For
instance, the original cart containing [1. 3. 12. 25. 26[ now
contained only [3. 25. 26[. From this incomplete information,
we induced the vote predictor and measured its accuracy
against the known values of the votesin this particular cart,
a perfect predictor would tell us that the politician voted also
for bills 1 and 12 (and no other bills). The evaluation of such
predictions called for appropriate performance criteria.
These will be discussed in the next section.
Apart from congressional vote, we used another binary
data set, the SPECT-Heart domain, where 267 instances
are characterized by 23 Boolean attributes. Again, we
converted these data into the shopping carts paradigm,
following the same methodology as described above,
seeking to predict items that have been hidden.
In all the experiments mentioned below, we set the
minimum support threshold to 0.01, minimum confidence
threshold to 0.5, and c to (1/3), if not otherwise mentioned.
TABLE 4
IBM-Generator Parameters
TABLE 3
Pruned Ruleset and Combined BBA
1. http://www.almaden.ibm.com/software/quest/Resources/data
sets/syndata.html. 2. http://archive.ics.uci.edu/ml/.
7.2 Performance Criteria
Literature on traditional classification often measures a
classifiers performance in terms of the percentage of
correctly predicted classes, or, conversely, as the percentage
of incorrect classifications. In our domains, though, this
criterion was not so appropriate. For one thing, we needed to
evaluate the prediction accuracy in a situation where several
class attributes had to be predicted at the same time. For
another, plain error rate failed to characterize the predictors
performance in terms of the two fundamental error types: a
false negative, where the predictor incorrectly labeled a
positive example of the given class as negative, and a false
positive, where the predictor incorrectly labeled a negative
example as positive. For these reasons, we preferred to rely
on the precision (Pr) and recall (Re) criteria borrowed from the
information retrieval literature.
Let us now define these metrics. To start with, suppose,
for simplicity, that we are interested only in one specific
class. Let us denote by TP the number of true positives
(correctly labeled positive instances); by FN the number of
false negatives; by FP the number of false positives; and by
TN the number of true negatives (correctly labeled negative
instances). These quantities are used to define 1i and 1c in
the following way:
1i = T1,(T1 11). 1c = T1,(T1 1`). (15)
For the needs of combining these two into a single
metric, Vilar et al. [25] proposed a so-called 1
1
-measure:
1
1
= 2 1i 1c,(1i 1c). (16)
Thus armed, we can proceed to the multiclass case where
each example can belong to two or more classes at the same
time, and we need to average the prediction performance
over all classes. Godbole and Sarawagi [26] proposed two
alternative ways to handle this situation: 1) macro averaging,
where the above metrics are computedfor eachitemandthen
averaged; and 2) micro averaging, where they are computed
by summing over all individual decisions. The formulas are
summarizedinTable 5, where 1i
i
and1c
i
standfor precision
and recall as measured on item i. T1
i
. T`
i
. 11
i
, and 1`
i
denote the values of the four basic variables for the item i.
7.3 Experiment 1: Simple Synthetic Data
The first question we asked was how the performance of the
technique DS-ARM, proposed in Section 6, compared to
that of Bayesian classifiers and to that of the older
techniques from [14], [12]. For the sake of fairness, we had
to respect that the older techniques (with the exception of
Bayes) had been developed for domains with small
numbers of items and with only one class attribute.
Therefore, for the first round of experiments, we generated
synthetic domains with only 10 distinct items. We used
transaction lengths from 4 to 6 and equally sized artificial
itemsets: (T6.16), (T5.15), (T5.14), and (T4.14). Note that
these domains were simple enough for all the algorithms to
be computationally affordable.
In each of these synthetic domains, we organized the
prediction task in the following manner. For each item in
each shopping cart, we generated a random number
(uniform distribution) from the interval [0. 1[. If the number
was greater than 0.7, we would remove the item from the
shopping cart. In other words, each item had 30 percent
chance of being removed. From these incomplete shopping
carts, we induced association rules to be employed when
guessing which items were removed. We evaluated the
TABLE 5
Macro Averaging and Micro Averaging of Precision and Recall
i denotes the number of different classes (items).
Fig. 5. Macro recall, precision, and 1
1
of prediction in the synthetic
domains.
prediction performance by comparing this prediction with
the known lists of removed items.
The results are summarized in Fig. 5 in terms of macro
recall, precision, and 1
1
(for the micro versions, the results
were similar). The reader can see that, in terms of macro 1
1
,
our technique clearly outperformed the other techniques in
each domain. A closer look at the bar charts reveals that its
performance edge is particularly well pronounced in
precision, whereas recall is not much better than that of
the other techniques. We regarded these results as encoura-
ging in view of the fact that we used domains that were
intentionally made suitable for the other techniques.
7.4 Experiment 2: UCI Benchmark Domains
In the next experiment, we asked whether DS-ARM would
fare equally well in more realistic domains. It should be
noted that, here, the shopping carts contained on average
many more items than in the synthetic domains from
Experiment 1; this rendered the older two DS theory-based
techniques computationally so inefficient (almost prohibi-
tively so) that we decided to compare DS-ARM only with
the Bayesian classifier. We wanted to ascertain whether the
latter would beat DS-ARM along at least some of the
criteria; also, we wanted to know how the prediction
performance depended on the amount of available in-
formation (or, conversely, on how many of the items in the
shopping carts have been removed/hidden).
For the congressional vote domain, the results are
summarized in the graphs in Figs. 6 and 7, and for the
SPECT-Heart domain, the results are summarizedin Figs. 8
and 9. Percentage of hidden or removed items is shown
along the x-axis and y-axis shows the performance metric. In
both domains, whether along the macro metrics or the micro
metrics, we always observed the same behavior. First, DS-
ARM outperformed the Bayesian approach along the more
general F
1
metric. Second, DS-ARM had almost perfect
recallfor instance, that if a politician voted for a certain bill,
the system almost always correctly predicted this vote based
Fig. 6. Macro performance in the congressional vote domain.
Fig. 7. Micro performance in the congressional vote domain.
Fig. 8. Macro performance in the SPECT-Heart domain.
on the partial voting record. Third, the Bayesian approach
had a slightly better precisionfor instance, that when the
system said a politician had voted for a bill, the Bayesian
approach was less frequently mistaken. Fourth, the predic-
tion performance dropped only very slightly with even as
many as 50 percent of the items in the shopping carts
removed. Fifth, the behavior was about the same whether the
macro or micro criteria were used (this was perhaps due to
the relative balance in the representation of 1s and 0s in these
two domains).
7.5 Experiment 3: Computational Costs
on Larger Data
In domains of more realistic size and more realistic
numbers of distinct items, earlier techniques are prohibi-
tively expensive due to the costs associated with the need to
generate all association rules with the given antecedents.
DS-ARM significantly reduces these costs by relying on the
recently proposed technique of IT-trees: The whole data-
base is permanently organized in a data structure that
makes it possible to obtain relevant rules in a time that does
not grow faster than linearly in the number of transactions,
and has been shown in [4] not to grow prohibitively in the
size of the shopping carts. The maintenance of the IT-tree
does not entail any additional costs compared to the case
when the shopping carts are stored in a conventional
transactional databasethe only additional cost is the need
to modify the IT-tree after the arrival of a new set of data;
however, Kubat et al. [4] show that this can be done very
efficiently andimportantlyoffline.
Our next experiment was meant to evaluate the computa-
tional costs of DS-ARMon a fewsynthetic data sets, using the
same parameter settings as in [24]. We considered ` =
10.000 shopping carts, i = 100 distinct items, instructing the
data generator that the number of frequent itemsets should
be about [1[ = 2.000. We generated three data sets differing
in the average size of the shopping cart and average length of
the frequent itemsets as summarized in Table 6.
The older algorithms are all very sensitive to the value of
the user-set minimum supportas its value decreases, the
number of frequent itemsets from which the classification
rules are obtained grows very fast; this adds to the
computations needed to select the matching rules and to
combine them. Fig. 10 illustrates, for three different
synthetic domains, the situation in the case of DS-ARM.
Since the same IT-tree was always used, it was fair to
ignore the costs of its building, particularly so because the
tree could be built offline. The vertical axis shows the time
needed to complete all 1,000 incomplete shopping carts in
the case where 20 percent of the shopping cart contents
were hidden. The measured time includes the costs of
finding all relevant rules for the given incomplete cart, and
of combining them when making the final decision about
TABLE 6
Parameter Settings
Fig. 9. Micro performance in the SPECT-Heart domain.
Fig. 10. Execution time versus minimum support: Even though the rule
generation algorithm is not sensitive to minimum support threshold, rule
combination costs reduce with increasing minimum support due to
reduction in number of rules.
which items to predict. The time is plotted against the
varying values of the minimum support. The reader can see
that, expectedly, the computation time significantly drops
with growing minimum support threshold, which implies
the obvious trade-off between completeness and computa-
tional demands.
Fig. 11 shows how DS-ARM scaled up with growing
average transaction length. For this experiment, the number
of distinct items was 100 and the number of shopping carts in
the data set was 1,000. When generating the data, we usedthe
following parameter settings: T5.12.T10.12.T15.12, and
T20.12. That the costs grew very fast with transaction
lengths is to be attributed to the fact that increasing size of
shopping carts meant that only a small portion of all
shopping carts had empty intersections; the number of
generated rules then grew exponentially. Still the costs were
affordablemany real-world applications rarely have more
than a few dozen items in each shopping cart.
7.6 Experiment 4: The Impact of the Number
of Distinct Items
Finally, we wanted to knowhowDS-ARMwould react to the
growing number of distinct items. This was the task of the
fourth round of experiments. Here, we fixed the number of
shopping carts to 10,000 and generated synthetic data with
the following parameter settings: T5.12, T10.14, and T20.16.
For the relatively low (and, hence, expensive) minimum
support of 1 percent, and for the number of distinct items
varied between 100 and 1,000, we obtained the results shown
in Fig. 12. The curves indicate that, as the number of items
increases, DS-ARMs computational costs grow exponen-
tially. This tells us that, although this system can handle
more difficult domains than the older approaches, much
more work remains to be done.
8 CONCLUSION
The mechanism reported in this paper focuses on one of the
oldest tasks in association mining: based on incomplete
information about the contents of a shopping cart, can we
predict which other items the shopping cart contains? Our
literature survey indicates that, while some of the recently
published systems can be used to this end, their practical
utility is constrained, for instance, by being limited to
domains with very few distinct items. Bayesian classifier
can be used too, but we are not aware of any systematic
study of how it might operate under the diverse circum-
stances encountered in association mining.
We refer to our technique by the acronym DS-ARM. The
underlying idea is simple: when presented with an
incomplete list : of items in a shopping cart, our program
first identifies all high-support, high-confidence rules that
have as antecedent a subset of :. Then, it combines the
consequents of all these (sometimes conflicting) rules and
creates a set of items most likely to complete the shopping
cart. Two major problems complicate the task: first, how to
identify the relevant rules in a computationally efficient
manner; second, how to combine (and quantify) the
evidence of conflicting rules. We addressed the former
issue by the recently proposed technique of IT-trees and the
latter by a few simple ideas from the DS theory.
Our experimental results are promising: DS-ARM
compares favorably with the Bayesian approach and out-
performs more traditional approaches even in domains
designed in a manner meant to be tailored to them. In
particular, DS-ARM performs well in applications where
the older approaches incur intractable computational costs
(e.g., if there are many distinct items).
Besides the encouraging results, our experiments have
also identified ample room for further improvements. As
indicated in Experiments 3 and 4, the computational costs of
DS-ARM still grow very fast with the average length of the
transactions and with the number of distinct itemsin real-
world applications, this can become a serious issue. Also
our implementation of the Bayesian classifier can perhaps
be found suboptimal. Finally, completely different ap-
proaches (beyond Bayesian classification and DS theory)
should be exploreda research strand that we strongly
advocate. As a matter of fact, to attract the scientific
communitys attention to all these issues was one of our
major motives in writing this paper.
ACKNOWLEDGMENTS
This material is based upon work supported by the US
National Science Foundation (NSF) under Grant IIS-0513702.
REFERENCES
[1] S. Noel, V.V. Raghavan, and C.H. Chu, Visualizing Association
Mining Results through Hierarchical Clusters, Proc. Intl Conf.
Data Mining (ICDM 01) pp. 425-432, Nov./Dec. 2001.
[2] P. Bollmann-Sdorra, A. Hafez, and V.V. Raghavan, A Theoretical
Framework for Association Mining Based on the Boolean
Retrieval Model, Data Warehousing and Knowledge Discovery: Proc.
Third Intl Conf. (DaWaK 01), pp. 21-30, Sept. 2001.
[3] R. Agrawal, T. Imielinski, and A. Swami, Mining Association
Rules between Sets of Items in Large Databases, Proc. ACM
Special Interest Group on Management of Data (ACM SIGMOD),
pp. 207-216, 1993.
[4] M. Kubat, A. Hafez, V.V. Raghavan, J.R. Lekkala, and W.K. Chen,
Itemset Trees for Targeted Association Querying, IEEE Trans.
Knowledge and Data Eng., vol. 15, no. 6, pp. 1522-1534, Nov./Dec.
2003.
Fig. 11. Average time per prediction versus average transaction length.
Fig. 12. Average time per prediction versus the number of items in three
synthetic domains.
[5] V. Ganti, J. Gehrke, and R. Ramakrishnan, Demon: Mining and
Monitoring Evolving Data, Proc. Intl Conf. Data Eng., 1999.
[6] J. Gehrke, V. Ganti, and R. Ramakrishnan, Detecting Change in
Categorical Data: Mining Contrast Sets, Proc. ACM SIGMOD-
SIGACT-SIGART Symp. Principles of Database Systems, pp. 126-137,
2000.
[7] A. Rozsypal and M. Kubat, Association Mining in Time-Varying
Domains, Intelligent Data Analysis, vol. 9, pp. 273-288, 2005.
[8] V. Raghavan and A. Hafez, Dynamic Data Mining, Proc. 13th
Intl Conf. Industrial and Eng. Applications of Artificial Intelligence and
Expert Systems IEA/AIE, pp. 220-229, June 2000.
[9] C.C. Aggarwal, C. Procopius, and P.S. Yu, Finding Localized
Associations in Market Basket Data, IEEE Trans. Knowledge and
Data Eng., vol. 14, no. 1, pp. 51-62, Jan./Feb. 2002.
[10] R. Bayardo and R. Agrawal, Mining the Most Interesting Rules,
Proc. ACM SIGKDD Intl Conf. Knowledge Discovery and Data
Mining, pp. 145-154, 1999.
[11] J. Zhang, S.P. Subasingha, K. Premaratne, M.-L. Shyu, M. Kubat,
and K.K.R.G.K. Hewawasam, A Novel Belief Theoretic Associa-
tion Rule Mining Based Classifier for Handling Class Label
Ambiguities, Proc. Workshop Foundations of Data Mining (FDM
04), Intl Conf. Data Mining (ICDM 04), Nov. 2004.
[12] B. Liu, W. Hsu, and Y.M. Ma, Integrating Classification and
Association Rule Mining, Proc. ACM SIGKDD Intl Conf. Know.
Disc. Data. Mining (KDD 98), pp. 80-86, Aug. 1998.
[13] W. Li, J. Han, and J. Pei, CMAR: Accurate and Efficient
Classification Based on Multiple Class-Association Rules, Proc.
IEEE Intl Conf. Data Mining (ICDM 01), pp. 369-376, Nov./Dec.
2001.
[14] K.K.R.G.K. Hewawasam, K. Premaratne, and M.-L. Shyu, Rule
Mining and Classification in a Situation Assessment Application:
A Belief Theoretic Approach for Handling Data Imperfections,
IEEE Trans. Systems, Man, Cybernetics, B, vol. 37, no. 6 pp. 1446-
1459, Dec. 2007.
[15] H.H. Aly, A.A. Amr, and Y. Taha, Fast Mining of Association
Rules in Large-Scale Problems, Proc. IEEE Symp. Computers and
Comm. (ISCC 01), pp. 107-113, 2001.
[16] J. Neyman and E. Pearson, On the Use and Interpretation of
Certain Test Criteria for Purposes of Statistical Inference,
Biometrica, vol. 20A, pp. 175-240, 1928.
[17] R. Fisher, The Use of Multiple Measurement in Taxonomic
Problems, Ann. Eugenics, vol. 7, pp. 111-132, 1936.
[18] I. Good, The Estimation of Probabilities: An Essay on Modern Bayesian
Methods. MIT Press, 1965.
[19] B. Cestnik and I. Bratko, On Estimating Probabilities in Tree
Pruning, Proc. European Working Session Machine Learning,
pp. 138-150, 1991.
[20] Y. Li and M. Kubat, Searching for High-Support Itemsets in
Itemset Trees, Intelligent Data Analysis, vol. 10, no. 2, pp. 105-120,
2006.
[21] G. Shafer, A Math. Theory of Evidence. Princeton Univ. Press, 1976.
[22] P. Smets, Practical Uses of Belief Functions, Proc. Conf.
Uncertainty in Artifical Intelligence (UAI 99), pp. 612-621, 1999.
[23] C.J. van Rijsbergen, Information Retrieval. Butterworths, 1979.
[24] R. Agrawal and R. Srikant, Fast Algorithms for Mining
Association Rules, Proc. Intl Conf. Very Large Data Bases (VLDB
94), pp. 487-499, 1994.
[25] D. Vilar, M.J. Castro, and E. Sanchis, Multilabel Text Classifica-
tion Using Multinomial Models, Proc. Espana for Natural Language
Processing (EsTAL 04), pp. 220-230, Oct. 2004.
[26] S. Godbole and S. Sarawagi, Discriminative Methods for Multi-
Labeled Classification, Proc. Pacific-Asia Conf. (PAKDD 04),
pp. 22-30, 2004.
Kasun Wickramaratna (S08) received the BSc
degree (with first class honors) in electronics
and telecommunication engineering in 2002
from the University of Moratuwa, Sri Lanka,
and the MS degree in computer science in 2005
from Florida International University, Miami,
Florida. He is presently working toward the
PhD degree at University of Miami. From 2002
to 2003, he served as an assistant lecturer at Sri
Lanka Institute of Information Technology, Co-
lombo, Sri Lanka. From 2003 to 2004, he was a engineer/IT at Sri Lanka
Telecom Ltd., Colombo, Sri Lanka. His research interests include data
mining and machine learning. He is a student member of the IEEE.
Miroslav Kubat (SM00) is an associate
professor at the University of Miami, Florida.
In the fields of machine learning, neural
networks, and data mining, he has published
more than 90 refereed papers, having helped
pioneer research in the fields of induction of
time-varying concepts and induction from
imbalanced training sets. He also contributed
to the problem of how to initialize neural
networks and their architectures. He coedited
(with R. Michalski and I. Bratko) the book Machine Learning and
Data Mining: Methods and Applications (Wiley, 1998) and (with
J. Fuernkranz) the book Machines that Learn to Play Games (NOVA
Science Publishers, 2001). He has served on more than 40 program
committees of conferences and workshops, and is on editorial boards
of three journals: Intelligent Data Analysis, Informatica: An Interna-
tional Journal on Computing and Informatics, and Applied Artificial
Intelligence. He is a senior member of the IEEE.
Kamal Premaratne (SM94) received the BSc
degree (with first class honors) in electronics and
telecommunication engineering in 1982 from the
University of Moratuwa, Sri Lanka. He received
the MS and PhD degrees, both in electrical and
computer engineering, in 1984 and 1988, re-
spectively, from the University of Miami, Coral
Gables, Florida, where he is presently a profes-
sor. He has received the 1992/1993 Mather
Premium and 1999/20000 Heaviside Premium
of the Institution of Electrical Engineers (IEE), London, United Kingdom,
and the 1991, 1994, and 2001 Eliahu I. Jury Excellence in Research
Award of the College of Engineering, University of Miami. He has served
as an associate editor of the IEEE Transactions on Signal Processing
(1994-1996) and the Journal of the Franklin Institute (1993-2005). He is a
fellowof IET (formerly IEE) and a senior member of the IEEE. His current
research interests include evidence fusion and resource management in
distributed decision and sensor networks, knowledge discovery from
imperfect data, and network congestion control.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

ITDDM14

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

ITDDM14

Diunggah oleh

Hak Cipta:

Format Tersedia

Predicting Missing Items in Shopping Carts

. If, before the trials, the users prior expecta-

, after the trials, the

andmcanbe neglected. Generally speaking, if

, the power set of , form all propositions of

Anda mungkin juga menyukai