Newmont Ghana Eng

Relational Data Mining
Donato Malerba
Dipartimento di Informatica
Università degli studi di Bari
malerba@di.uniba.it
http://www.di.uniba.it/~malerba/
Overview
• Single-table assumption
• (Multi-)relational data mining and ILP
• FO representations
• Upgrading propositional DM systems to FOL
• A case study: Mining Association rules
• Conclusions
MRDM – Prof. D. Malerba 2

Standard Data Mining
Approach
• Most existing data mining approaches look for patterns in
a single table of data (or DB relation)
ID Name First Street City Sex Social Income Age Resp
Name Status onse
3478 Smith John 38 Lake St Seattle M single 160k 32 Y
3479 Doe Jane 45 Sea St Venice F married 180k 45 N
… … … … … … … … … …
•Each row represents an object and columns represent

properties of objects.
Single table assumption
Approach
• In the customer table we can add as many attributes about our customers
as we like.
 A person’s number of children
• For other kinds of information the single-table assumption turns out to be
a significant limitation
 Add information about orders placed by a customer, in particular
 Delivery and payment modes
 With which kind of store the order was placed (size, ownership, location)
 For simplicity, no information on the goods ordered
ID Name First … Resp Delivery Payment Store Store Locat

Name onse mode mode size type ion
3478 Smith John … Y regular cash small franchis city
3479 Doe Jane … N express credit large indep rural
… … … … … … … … … …
Approach
• This solution works fine for once-only customers
• What if our business has repeat customers?
• Under the single-table assumption we can
1. Make one entry for each order in our customer table
ID Name First … Resp Delivery Payment Store Store Locat
Name onse mode mode size type ion
3478 Smith John … Y regular cash small franchis city
3478 Smith John … Y express check small franchis city
… … … … … … … … … …
• We have usual problems of non-normalized tables

• Redundancy, anomalies, …
Approach
• one line per order  analysis results will really be about
orders, not customers, which is not what we might want!
2. Aggregate order data into a single tuple per customer.
ID Name First … Response No. of No. of

Name orders stores
3478 Smith John … Y 3 2
3479 Doe Jane … N 2 2
… … … … … … …
• No redundancy. Standard DM methods work fine, but

• There is a lot less information in the new table
• What if the payment mode and the store type are important?
Relational Data
• A database designer would represent the information in
our problem as a set of tables (or relations)
ID Name First Street City Sex Social Income Age Resp
Name Status onse
3478 Smith John 38 Lake St Seattle M single 160k 32 Y
3479 Doe Jane 45 Sea St Venice F married 180k 45 N
… … … … … … … … … …
Cust Order Store Delivery Payment Store size Type Location

ID ID ID mode mode ID
3478 213444 12 regular cash 12 small franchis city
3478 372347 19 regular cash 19 large indep rural
3478 334555 12 express check
… … …
… … … … …
Relational Data Mining
• (Multi-)Relational data mining algorithms can analyze data
distributed in multiple relations, as they are available in
relational database systems.
• These algorithms come from the field of inductive logic
programming (ILP)
• ILP has been concerned with finding patterns expressed as
logic programs
• Initially, ILP focussed on automated program synthesis from
examples
• In recent years, the scope of ILP has broadened to cover
the whole spectrum of data mining tasks (association rules,
regression, clustering, …)

ILP successes in scientific
fields
• In the field of chemistry/biology

 Toxicology
 Prediction of Dipertene classes from nuclear magnetic
resonance (NMR) spectra
• Analysis of traffic accident data
• Analysis of survey data in medicine
• Prediction of ecological biodegradation rates
The first commercial data mining systems with ILP
technology are becoming available.

Relational patterns
• Relational patterns involve multiple relations from a relational

database.
• They are typically stated in a more expressive language than
patterns defined on a single data table.
 Relational classification rules
 Relational regression trees
 Relational association rules
IF Customer(C1,N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1)
AND order(C1,O1,S1,Deliv1, Pay1)
AND Pay1 = credit_card
AND In1  108000
THEN Resp1 = Yes
Relational patterns
IF Customer(C1,N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1)
AND order(C1,O1,S1,Deliv1, Pay1)
AND Pay1 = credit_card
AND In1  108000
THEN Resp1 = Yes
good_customer(C1) 
customer(C1, N1,FN1,Str1,City1,Zip1,Sex1,SoSt1, In1,Age1,Resp1)
 order(C1,O1,S1,Deliv1, credit_card) 
In1  108000
This relational pattern is expressed in a subset of first-order logic!
A relation in a relational database corresponds to a predicate in
predicate logic (see deductive databases)

Relational decision tree
Equivalent Prolog program:

class(sendback) :- worn(X), not_replaceable(X), !.
class(fix) :- worn(X), !.
class(keep).
Relational regression rule
Background knowledge
Induced model

Relational association rule
Relational database
LIKES HAS PREFERS
KID OBJECT KID OBJECT KID OBJECT TO
Joni ice-cream Joni ice-cream Joni ice-cream pudding
Joni dolphin Joni piglet Joni pudding raisins
Elliot piglet Elliot ice-cream Joni giraffe gnu
Elliot gnu Elliot lion ice-cream
Elliot lion Elliot piglet dolphin
likes(KID, piglet), likes(KID, ice-cream)

 likes (KID, dolphin) (9%, 85%)
likes(KID, A), has(KID, B)  prefers (KID, A, B) (70%, 98%)

First-order representations
• An example is a set of ground facts, that is a set of tuples
in a relational database
• From the logical point of view this is called a (Herbrand)
interpretation because the facts represent all atoms
which are true for the example, thus all facts not in the
example are assumed to be false.
• From the computational point of view each example is a
small relational database or a Prolog knowledge base
• A Prolog interpreter can be used for querying an example.

FO representation (ground
clauses)
• Example:
eastbound(t1):
car(t1,c1),rectangle(c1),short(c1),none(c1),two_wheels(c1),
load(c1,l1),circle(l1),one_load(l1),
car(t1,c2),rectangle(c2),long(c2),none(c2),three_wheels(c2),
load(c2,l2),hexagon(l2),one_load(l2),
car(t1,c3),rectangle(c3),short(c3),peaked(c3),two_wheels(c3),
load(c3,l3),triangle(l3),one_load(l3),
car(t1,c4),rectangle(c4),long(c4),none(c4),two_wheels(c4),
load(c4,l4),rectangle(l4),three_load(l4).
• Background theory:
polygon(X) :- rectangle(X)
polygon(X) :- triangle(X)
• Hypothesis:
eastbound(T):car(T,C),short(C),not none(C).

Background knowledge
• As background knowledge is visible for each

example, all the facts that can be derived from the
background knowledge and an example are part of
the extended example.
• Formally, an extended example is the minimal
Herbrand model of the example and the
background theory.
• When querying an example, it suffices to assert the
background knowledge and the example; the Prolog
interpreter will do the necessary derivations.

Learning from interpretations
• The ground-clause representation is peculiar of an ILP
setting denoted as learning from interpretations.
• Similar to older work on structural matching.
• It is common to several relational data mining systems,
such as
 CLAUDIEN: searches for a set of clausal regularities that hold on
the set of examples
 TILDE: top-down induction of logical decision trees
 ICL: Inductive classification logic (upgrade of CN2)
• It contrasts with the classical ILP setting employed by
the systems PROGOL and FOIL.

FO representation (flattened)
• Example:
eastbound(t1).
• Background theory:
car(t1,c1). car(t1,c2). car(t1,c3). car(t1,c4).
rectangle(c1). rectangle(c2). rectangle(c3). rectangle(c4).
short(c1). long(c2). short(c3). long(c4).
none(c1). none(c2). peaked(c3). none(c4).
two_wheels(c1). three_wheels(c2). two_wheels(c3). two_wheels(c4).
load(c1,l1). load(c2,l2). load(c3,l3). load(c4,l4).
circle(l1). hexagon(l2). triangle(l3). rectangle(l4).
one_load(l1). one_load(l2). one_load(l3). three_loads(l4).
• Hypothesis:
eastbound(T):car(T,C),short(C),not none(C).

FO representation (terms)
• Example:
eastbound([c(rectangle,short,none,2,l(circle,1)),
c(rectangle,long,none,3,l(hexagon,1)),
c(rectangle,short,peaked,2,l(triangle,1)),
c(rectangle,long,none,2,l(rectangle,3))]).
• Background theory: empty
• Hypothesis:
eastbound(T):member(C,T),arg(2,C,short),
not arg(3,C,none).

FO representation (strongly
typed)
• Type signature:
data Shape = Rectangle | Hexagon | …; data Length = Long | Short;
data Roof = None | Peaked | …; data Object = Circle | Hexagon | …;
type Wheels = Int; type Load = (Object,Number); type Number = Int

type Car = (Shape,Length,Roof,Wheels,Load); type Train = [Car];
eastbound::Train>Bool;
• Example:
eastbound([(Rectangle,Short,None,2,(Circle,1)),
(Rectangle,Long,None,3,(Hexagon,1)),
(Rectangle,Short,Peaked,2,(Triangle,1)),
(Rectangle,Long,None,2,(Rectangle,3))]) = True
• Hypothesis: eastbound(t) = (exists \c > member(c,t) &&
proj2(c)==Short && proj3(c)!=None)
• Example language: Escher™ functional logic programming

FO representation (database)
LOAD_TABLE TRAIN_TABLE
TRAIN_TABLE
LOAD TRAIN
TRAIN EASTBOUND
EASTBOUND
LOAD CAR
CAR OBJECT
OBJECT NUMBER
NUMBER
l1l1 c1 circle 11 t1
t1 TRUE
TRUE
c1 circle
l2l2 c2 hexagon 11 t2
t2 TRUE
TRUE
c2 hexagon
l3l3 c3 triangle 11 …… ……
c3 triangle
l4l4 c4 rectangle 33 t6
t6 FALSE
FALSE
c4 rectangle
…… …… …… …… ……
CAR_TABLE
CAR
CAR TRAIN
TRAIN SHAPE
SHAPE LENGTH
LENGTH ROOF
ROOF WHEELS
WHEELS
c1
c1 t1
t1 rectangle
rectangle short
short none
none 22
c2
c2 t1
t1 rectangle
rectangle long
long none
none 33
c3
c3 t1
t1 rectangle
rectangle short
short peaked
peaked 22
c4
c4 t1
t1 rectangle
rectangle long
long none
none 22
…… …… …… ……
LECT DISTINCT TRAIN_TABLE.TRAIN FROM TRAIN_TABLE, CAR_TABLE

HERE TRAIN_TABLE.TRAIN = CAR_TABLE.TRAIN AND
CAR_TABLE.LENGTH = ‘short’ AND CAR_TABLE.ROOF != 'none'

Individual-centered
representation
• The database contains information on a number of
trains.
• Each train is an individual.
• The database can be partitioned according to individual
to obtain a ground-clause representation
• Problem: sometime individuals share common parts.
• Example: we want to discriminate
black and white figures on the basis of their
position.
Each geom. figure is an individual

Object-centered
representation
The whole sequence is an object, which can be represented by

a multiple-head ground clause:
black(x11)  black(x12)  white(x13)  black(x14) :-

first(x11), crl(x11), next(x12,x11), crl(x12),
sqr(x13), crl(x14), next(x14,x13), next(x13,x12)
This is the representation adopted in ATRE.

How to upgrade propositional
DM algorithms to first-order
1. Identify the propositional DM system that best matches the DM task
2. Use interpretations to represent examples
3. Upgrade the representation of propositional hypotheses attribute-
value tests by first-order literals and modify the coverage test
accordingly.
4. Structure the search-space by a more-general-than relation that
works on first-order representations
 -subsumption
5. Adapt the search operators for searching the corresponding rule
space
6. Use a declarative bias mechanism to limit the search space
7. Implement
8. Evaluate your (first-order) implementation on propositional and
relational data
9. Add
MRDM interesting
– Prof.extra features
D. Malerba 25
Mining association rules: a
case study
A set I of literals called items.
A set D of transactions t’s such that t  I.
X  Y (s%, c%) Association rule
"IF a pattern X appears in a transaction t, THEN the pattern
Y tends to hold in the same transaction t"
• X I, Y I, XY=
• s% = p(XY) support
• c% = p(Y|X) = p(XY) / p(X) confidence
Agrawal, Imielinsky & Swami.
Mining association rules between sets of items in large databases.
Proc. SIGMOD 1993

What is an association rule?
Example: market basket analysis.
Each transaction is the list of items bought by a customer on a
single visit to a store. It is represented as a row in a table
Bread Butter Cheese Beer
1 yes yes yes no
2 yes no yes Yes
3 … … … …
IF a customer buys bread and butter THEN he also buys cheese
(20%, 66%) =
Given that 20% of customers buy bread, cheese and butter,
66% of customers who buy bread and butter also buy cheese
Mining association rules
The propositional approach
Problem statement
Given:
• a set of transactions D
• a couple of thresholds, minsup and minconf
Find
all association rules that have support and
confidence greater than minsup and minconf
respectively.

Problem decomposition
• Find large (or frequent) itemsets
• Generate highly-confident association rules
Representation issues
• The transaction set D may be a data file, a relational
table or the result of a relational expression
• Each transaction is a binary vector

Solution to the first sub-problem

The APRIORI algorithm (Agrawal & Srikant, 1999)
Find large 1-itemsets
Cycle on the size (k>1) of the itemsets
 APRIORI-gen Generate candidate k-itemsets from
large (k-1)-itemsets
 Generate large k-itemsets from candidate k-itemsets
(cycle on the transactions in D)
until no more large itemsets are found.
Solution to the second sub-problem

• For every large itemset Z, find all non-empty subsets X’s
of Z
• For every subset X, output a rule of the form X  (Z-X) if
support(Z)/support(X)  minconf.
Relevant work
Agrawal & Srikant (1999). Fast Algorithms for Mining Association Rules,
in Readings in Database Systems, Morgan Kaufmann Publishers.
Han & Fu (1995). Discovery of Multiple-Level Association Rules from
Large Databases, in Proc. 21st VLDB Conference

The ILP approach
Problem statement
Given:
• a deductive relational database D
• a couple of thresholds, minsup and minconf
Find
all association rules that have support and
confidence greater than minsup and minconf
respectively.

The ILP approach
Problem decomposition
• Find large (or frequent) atomsets
• Generate highly-confident association rules
Representation issues
A deductive relational database is a relational database
which may be represented in first-order logic as follows:
• Relation  Set of ground facts (EDB)
• View  Set of rules (IDB)

The ILP approach
Example Relational database
LIKES HAS PREFERS
KID OBJECT KID OBJECT KID OBJECT TO
Joni ice-cream Joni ice-cream Joni ice-cream pudding
Joni dolphin Joni piglet Joni pudding raisins
Elliot piglet Elliot ice-cream Joni giraffe gnu
Elliot gnu Elliot lion ice-cream
Elliot lion Elliot piglet dolphin
likes(joni, ice-cream) atom
likes(KID, piglet), likes(KID, ice-cream) atomset

 likes (KID, dolphin) (9%, 85%)
likes(KID, A), has(KID, B)  prefers (KID, A, B) (70%, 98%)
The ILP approach
Solution to the first sub-problem

The WARMR algorithm (Dehaspe & De Raedt, 1997)
L. Dehaspe & L. De Raedt (1997). Mining Association Rules in Multiple
Relations, Proc. Conf. Inductive Logic Programming
Compute large 1-atomsets
Cycle on the size (k>1) of the atomsets
 WARMR-gen Generate candidate k-atomsets from large
(k-1)-atomsets
 Generate large k-atomsets from candidate k-atomsets
(cycle on the observations loaded from D)
until no more large atomsets are found.

The ILP approach
WARMR APRIORI
• Breadth-first search on • Breadth-first search on
the atomset lattice the itemset lattice
• Loading of an • Loading of a transaction t
observation o from D from D (tuple)
(query result)
• Largeness of candidate • Largeness of candidate
atomsets computed by a itemsets computed by a
coverage test
subset check

The ILP approach
Pattern Space false
false

Q1   is_a(X, large_town)
Q1
 intersects(X, R)
 is_a(R, road)

Q2
Q2  is_a(X, large_town)
 intersects(X,Y)

Q3
Q3  is_a(X, large_town)
 
true true

The ILP approach
Candidate generation
is_a(X, large_town), intersects(X,R), is_a(R, road) Operator under

-subsumption
is_a(X,large_town), intersects(X,R), is_a(R,road), adjacent_to(X,W), is_a(W, water)

Refinement step
yes Does it -subsume no

infrequent patterns?
Pruning step
The ILP approach
Candidate evaluation
is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water)
?- is_a(X, large_town),
intersects(X,R), is_a(R, road),
adjacent_to(X,W), is_a(W, water) D
no
<X=barletta,R=a14,W=adriatico>
Large?
<X=bari,R=ss16bis,W=adriatico>
...
yes
The ILP approach
is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water)
Rule generation
is_a(X, large_town), intersects(X,R), is_a(R, road), is_a(W, water)

 adjacent_to(X,W) (62%, 86%)
yes High no
confidence?

Conclusions and future work
• Multi-relational data mining: more data mining than
logic program synthesis
 choice of representation formalisms
 input format more important than output format
 data modelling — e.g. object-oriented data mining
 new learning tasks and evaluation measures
Reference
Saso Dzeroski and Nada Lavrac, editors,
Relational Data Mining,
Springer-Verlag, September 2001

Newmont Ghana Eng

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Newmont Ghana Eng

Diunggah oleh

Hak Cipta:

Format Tersedia

Relational Data Mining

MRDM – Prof. D. Malerba 2

•Each row represents an object and columns represent

ID Name First … Resp Delivery Payment Store Store Locat

• We have usual problems of non-normalized tables

ID Name First … Response No. of No. of

• No redundancy. Standard DM methods work fine, but

Cust Order Store Delivery Payment Store size Type Location

MRDM – Prof. D. Malerba 8

• In the field of chemistry/biology

MRDM – Prof. D. Malerba 9

• Relational patterns involve multiple relations from a relational

MRDM – Prof. D. Malerba 11

Equivalent Prolog program:

MRDM – Prof. D. Malerba 13

likes(KID, piglet), likes(KID, ice-cream)

MRDM – Prof. D. Malerba 14

MRDM – Prof. D. Malerba 15

MRDM – Prof. D. Malerba 16

• As background knowledge is visible for each

MRDM – Prof. D. Malerba 17

MRDM – Prof. D. Malerba 18

MRDM – Prof. D. Malerba 19

MRDM – Prof. D. Malerba 20

type Wheels = Int; type Load = (Object,Number); type Number = Int

MRDM – Prof. D. Malerba 21

LECT DISTINCT TRAIN_TABLE.TRAIN FROM TRAIN_TABLE, CAR_TABLE

MRDM – Prof. D. Malerba 22

MRDM – Prof. D. Malerba 23

The whole sequence is an object, which can be represented by

black(x11)  black(x12)  white(x13)  black(x14) :-

This is the representation adopted in ATRE.

MRDM – Prof. D. Malerba 24

MRDM – Prof. D. Malerba 26

MRDM – Prof. D. Malerba 28

MRDM – Prof. D. Malerba 29

Solution to the first sub-problem

Solution to the second sub-problem

MRDM – Prof. D. Malerba 31

MRDM – Prof. D. Malerba 32

MRDM – Prof. D. Malerba 33

likes(joni, ice-cream) atom

likes(KID, piglet), likes(KID, ice-cream) atomset

Solution to the first sub-problem

MRDM – Prof. D. Malerba 35

MRDM – Prof. D. Malerba 36

MRDM – Prof. D. Malerba 37

is_a(X, large_town), intersects(X,R), is_a(R, road) Operator under

is_a(X,large_town), intersects(X,R), is_a(R,road), adjacent_to(X,W), is_a(W, water)

yes Does it -subsume no

is_a(X, large_town), intersects(X,R), is_a(R, road), adjacent_to(X,W), is_a(W, water)

is_a(X, large_town), intersects(X,R), is_a(R, road), is_a(W, water)

MRDM – Prof. D. Malerba 40

MRDM – Prof. D. Malerba 41

Anda mungkin juga menyukai