Anda di halaman 1dari 133

TAM002

Data mining with GUHA Part 1


Does my data contain something interesting?
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
The aim of data mining is to give answers to a question Does
my data contain something interesting?
In this chapter we introduce the following basic concepts:
knowledge discovery in databases and data mining
data, typical data mining tasks and data mining tasks outputs
GUHA and Bcourse: two dissimilar approaches to data
mining
We also get acquainted with a real life data collected from
Indonesia. We use this data to illustrate issues all over during
this course.
These push buttons open short verbal comments that might
be useful.
Esko Turunen http://www.vrtuosi.com
Knowledge discovery in databases (KDD) was initially
dened as the nontrivial extraction of implicit, previously
unknown, and potentially useful information from data [1]. A
revised version of this denition states that KDD is the
nontrivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data [2] .
According to this denition, data mining is a step in the KDD
process concerned with applying computational techniques
(i.e., data mining algorithms implemented as computer
programs) to actually nd patters in the data.
In a sense, data mining is the central step in the KDD process.
The other steps in the KDD process are concerned with
preparing data for data mining, as well as evaluating the
discovered patterns, the results of data mining.
Esko Turunen http://www.vrtuosi.com
I Data The input to a data mining algorithm is most
commonly a single at table comprising a number of elds
(columns) and records (rows). In general, each row represents
an object and columns represent properties of objects.
II Typical data mining tasks
One task is to predict the value of one eld from other elds.
If the class is continuous, the task is called regression. If the
class is discrete the task is called classication.
Clustering is concerned with grouping objects into classes of
similar objects. A cluster is a collection of objects that are
similar to each other and are dissimilar to objects in other
clusters.
Association analysis is the discovery of association rules.
Association rules specify correlation between frequent item
sets.
Data characterization sums up the general characteristics or
features of the target class of data: this class is typically
collected by a database query.
Esko Turunen http://www.vrtuosi.com
Outlier detection is concerned with nding data objects that
do not t the general behavior or model of the data: these are
called outliers.
Evaluation analysis describes and models regularities or
trends whose behavior changes over time.
III Outputs of data mining procedures can be
Equations e.g. TotalSpent = 189.5275 Age + 7146[$]
Predictive rules e.g. IF income is 100.000[$] and Gender =
Male THEN Not a Big Spender
Association rules e.g.
{Gender = Female, Age 52} {Big Spender = Yes}
Probabilistic models e.g. Bayesian networks
Distance and similarity measures, decision trees
Many others
Esko Turunen http://www.vrtuosi.com
Our aim is to study in detail a particular data mining method
called GUHA its principle was formulated in a paper by Hjek,
Havel and Chytil already in 1966 [3]. GUHA is the acronym for
General Unary Hypotheses Automaton and its computer
implementation called LISpMiner developed in Prague
University of Economics by Jan Rauch and Milan imunek.
LISpMiner is freely downloadable from http://lispminer.vse.cz/ .
GUHA approach is suitable e.g. for association analysis,
classication, clustering and outlier detection tasks.
We start be introducing a real life data which will serve as a
benchmark data test set during the whole course. To show how
GUHA differs from a Bayesian approach we briey take a quick
look at Bcourse, see http://b-course.cs.helsinki./obc/.
Esko Turunen http://www.vrtuosi.com
The data we use is Tjen-Sien Lims publicly available data set
from the 1987 National Indonesia Contraceptive Prevalence
Survey. These are the responses from interviews of m = 1473
married women who were not pregnant at the time of interview.
The challenge is to learn to predict a womans contraceptive
method from knowledge about her demographic and
socio-economic characteristics. The 10 survey response
variables and their types are
Age integer 1649
Education 4 categories
Husbands education 4 categories
Number of children borne integer 015
Islamic binary (yes/no)
Working binary (yes/no)
Husbands occupation 4 categories
Standard of living 4 categories
Good media exposure binary (yes/no)
Contraceptive method used 3 categories (None, Long-term, Short-term)
Esko Turunen http://www.vrtuosi.com


http://b-course.cs.helsinki.fi/obc/
Esko Turunen http://www.vrtuosi.com

http://b-course.cs.helsinki.fi/obc/
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Speculating about causalities
Remember that dependencies are not necessarily
causalities. However, the theory of inferred causation
makes it possible to speculate about the causalities
that have caused the dependencies of the model.
There are two different speculations (called naive
model and not so naive model) which are based on
different background assumptions.
Esko Turunen http://www.vrtuosi.com
How to read naive causal model ?
Naive causal models are easy to read, but they are built on assumptions that are many times
unrealistic, namely that there are no latent (unmeasured) variables in the domain that causes the
dependencies between variables. A simple example of the situation where this assumption is violated
can be placed to Finland where cold winter makes lakes and sea ice covered. Because of that most
drowning accidents happen in summertime. The warm summer also makes people eat much more
ice-cream than in wintertime. If you measure both the number of drowning accidents and the ice-
cream consumption, but don't include the variable indicating the season there is clear dependency
between ice-cream consumption and drowning. Evidently this dependency is not causal (ice cream
does not cause drowning or other way round), but due to the excluded variable summer (technically
this is called confounding). Naive causal models are built on the assumption that there is no
confounding.
In naive causal models there may be two kind of connections between variables: undirected arcs and
directed arcs. Directed arcs denote the causal influence from cause to effect and the undirected arcs
denote the causal influence directionality of which cannot be automatically inferred from the data.
You can also read the naive causal models as representing the set of dependency models sharing the
same directed arcs. Unfortunately, this does not give you the freedom to re-orient the undirected arcs
any way you want. You are free to re-orient the undirected arcs as long as re-orienting them does not
create new V-structures in a graph. V-structure is the system of three variables A B C such that there
is directed arc from A to B and there is directed arc from C to B, but there is no arc (neither directed
nor undirected) between A and C.
Esko Turunen http://www.vrtuosi.com


How to read causal graph produced by B-course?
Causal models are not difficult to read once you learn the difference between different kinds of
arcs. There are two kinds of lines in arcs, solid and dashed. With solid lines we indicate relations
that can be determined from the data. Dashed lines are used when we know that there is a
dependency, but we are not sure about its exact nature. The table below lists the different types of
arcs that can be found in causal models.
Solid arc from A to B
A has direct causal influence to B (direct meaning that causal influence
is not mediated by any other variable that is included in the study)
Dashed arc from A to B.
There are two possibilities, but we do not know which holds. Either A
is cause of B or there is a latent cause for both A and B.
Dashed line without any
arrow heads between A and
B.
There is a dependency but we do not know whether A causes B or if B
causes A or if there is a latent cause of them both the dependency
(confounding).
Esko Turunen http://www.vrtuosi.com
W. Frawley, G. Piatetsky-Shapiro and C. Matheus:
Knowledge Discovery in Databases: An Overview. In
Knowledge Discovery in Databases, eds. G.
Piatetsky-Shapiro and W. Frawley (1991) 127. Cambridge,
Mass.: AAAI Press / The MIT Press.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and P.
Uthurusamy: Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press (1996).
I. Havel , M. Chytil M.and P. Hjek: The GUHAMethod of
Automatic Hypotheses Determination. Computing, Vol. 1,
(1966) 293308.
Esko Turunen http://www.vrtuosi.com
TAM002
Data mining with GUHA Part 2
GUHA produces hypothesis
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
GUHA does not test hypothesis, instead GUHA produces them
In this chapter we learn the outlines of GUHA e.g. that
GUHA is suitable for exploratory analysis of large data
GUHA goes through 2 2 contingency tables and picks up
the interesting ones
in logic terms, GUHA is based on rst order monoidal logic
whose models are nite
a fundamental reference, the GUHA book, is free to be
downloaded from www.cs.cas.cz/hajek/guhabook/
Esko Turunen http://www.vrtuosi.com
1. GUHA (General Unary Hypotheses Automaton) is a
method of automatic generation of hypotheses based on
empirical data, thus a method of data mining.
GUHA is one of the oldest methods of data mining GUHA
was introduced by Hjek, Havel and Chytil in The GUHA
method of automatic hypotheses determination published in
Computing 1 (1966) 293308.
GUHA still develops: there are dozens of new features added
to GUHA during the past ten years.
GUHA is a kind of automated exploratory data analysis.
Instead of testing given hypothesis supported by a data, GUHA
procedure generates systematically hypotheses supported by
the data. By GUHA it is possible to nd all interesting
dependences hidden in a data.
Esko Turunen http://www.vrtuosi.com
2. GUHA is suitable for exploratory analysis of large data.
The processed data form a rectangle matrix, where rows
corresponds to objects belonging to the sample and each
column corresponds to one investigated variable. A typical data
matrix processed by GUHA has hundreds or thousands of rows
and tens of columns.
Cells of the analyzed data matrix can contain whatever
symbols, however, before a concrete data mining task the data
must be categorized to contain only 0s or 1s (or are empty).
Exploratory analysis means that there is no single specic
hypothesis that should be tested by our data; rather, our aim is
to get orientation in the domain of investigation, analyze the
behavior of chosen variables, interactions among them etc.
Such inquiry is not blind but directed by some general (possibly
vague) direction of research (some general problem).
Esko Turunen http://www.vrtuosi.com
3. GUHA systematically creates all hypotheses interesting
from the point of view of a given general problem and on the
base of given data.
This is the main principle: all interesting hypotheses. Clearly,
this contains a dilemma: all means most possible, only
interesting means not too many. To cope with this dilemma,
one may use different GUHA procedures, implemented in
LISpMiner system, and having selected one, by xing in
various ways its numerous parameters.
The LISpMiner system leads the user and makes the
selection of parameters relatively easy but somewhat laborious.
LISpMiner cannot be as automatized as much as e.g.
Bcourse is automatized, however, the results of LISpMiner
are much more illustrative and detailed.
Esko Turunen http://www.vrtuosi.com
Three remarks:
GUHA procedures polyfactorial hypotheses i.e. not only
hypotheses relating one variable with another one, but
expressing relations among single variables, pairs, triples,
quadruples of variables etc.
GUHA offers hypotheses. Exploratory character implies that
the hypotheses produced by the computer (numerous in
number: typically tens or hundreds of hypotheses) are just
supported by the data, not veried. You are assumed to use
this offer as inspiration, and possibly select some few
hypotheses for further testing.
GUHA is not suitable for testing a single hypothesis: routine
packages are good for this.
Esko Turunen http://www.vrtuosi.com
4. 4ftminer procedure generates hypotheses (or
observational statements) on association between complex
Boolean formulae (attributes). These formulae are constructed
from unary predicates (corresponding to the columns of the
processed 0/1data matrix) by logical connectives , ,
(conjunction, disjunction, negation).
Examples of predicates are
TEMPERATURE : 38

C , PRESURE: high , DIAGNOSIS: infection


Examples of formulae are
[TEMPERATURE : 38

C] [PRESURE: high] and [DIAGNOSIS: infection]


An example of a hypothesis is
[TEMPERATURE : 38

C] [PRESURE: high] [DIAGNOSIS: infection]


More generally, hypotheses are of form .
Notice that our terminology differs from the original GUHA
approach: we want to keep things as simple as possible!
Esko Turunen http://www.vrtuosi.com
Given the 0/1data matrix, each pair of Boolean attributes ,
determines its four-fold frequency table; the association of with
is dened by choosing an associational or generalized
quantier i.e. a function assigning to each four-fold table either
1 (associated or true in the data) or 0 (not associated or
false in the data) and satisfying some natural monotonicity
conditions.
The fourfold table has the form:

a b a + b = r
c d c + d = s
a + c = k b + d = l m
where a + b + c + d = m and
a is the number of objects satisfying both and ,
b is the number of objects satisfying but not ,
c is the number of objects not satisfying but satisfying ,
d is the number of objects not satisfying nor .
Esko Turunen http://www.vrtuosi.com
6. There are various types of generalized quantiers
formalizing various kinds of associations:
Implicational quantiers formalize the association many are
, they do not depend on the values c, d.
Comparative quantiers formalize the association makes
more likely than .
Some quantiers just express observations on the data, some
others serve as tests of statistical hypotheses on unknown
probabilities.
Some quantiers ones symmetric: implies , some
admit negation: implies .
4ft-Miner procedure contains dozen of generalized quantiers,
a novel procedure Ac4ftMiner offers also action quantiers. An
advantage of GUHA is that new quantiers can be dened and
their properties can be analyzed in a well establish logic
framework.
Esko Turunen http://www.vrtuosi.com
7. After preparing the original data matrix to a 0/1-format, the
user can start the 4ftMiner procedure. The input of such a
procedure consists of
(1) the 0/1data matrix
(2) parameters determining symbolic restriction to the pairs ,
of Boolean attributes to be generated; is called antecedent
and is called succedent
(3) the quantier to be used and some parameters of the
quantier
(4) some other determinations, for example the user has to
declare predicates that can occur in the antecedent and the
succedent, the use of logic connectives, minimal and maximal
length of antecedent and succedent, way to process of missing
data etc.
Esko Turunen http://www.vrtuosi.com
8. 4ftMiner produces all associations satisfying the
syntactic restrictions and true in the data. The generation is not
done blindly but uses various techniques serving to avoid
exhaustive search. The found associations together with
various parameters are not mechanically printed but saved in a
solution le for further processing.
9. 4ftResult module for interpretation of results enables the
user to browse the associations format, sort them according to
various criteria, select reasonably dened subsets and output
concise information of various kinds.
There are however other procedures implemented in the
LISpMiner system that do mine for other kinds of patterns,
even for more complex one than associational rules.
Esko Turunen http://www.vrtuosi.com
10. The GUHA method has deep logical and statistical
foundations. GUHA is being further developed at the institute of
Computer Science of the Academy of Sciences of the Czech
Republic (Petr Hjek and his group), at the Prague University of
Economics (Jan Rauch and his group) and at Tampere
University of Technology (Esko Turunen and his group).
Esko Turunen http://www.vrtuosi.com
Example
Assume we are observing children who have an allergic
reaction to, say, tomato, apple, orange, cheese or milk. These
observations are presented in the following table:
Child Tomato Apple Orange Cheese Milk
Anna 1 1 0 1 1
Aina 1 1 1 0 0
Naima 1 1 1 1 1
Rauha 0 1 1 0 1
Kai 0 1 0 1 1
Kille 1 1 0 0 1
Lempi 0 1 1 1 1
Ville 1 0 0 0 0
Ulle 1 1 0 1 1
Dulle 1 0 1 0 0
Dof 1 0 1 0 1
Kinge 0 1 1 0 1
Laade 0 1 0 1 1
Koff 1 1 0 0 1
Olvi 0 1 1 1 1
Esko Turunen http://www.vrtuosi.com
Thus, we have observations as Child x is allergic to milk and
Child y is allergic to cheese, We write shorter Milk(x) and
Cheese(y).
Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are
unary predicates of our observational language and x, y, z,
are variables.
Expressions like Milk(x) or Cheese(y) are atomic (open)
formulae. Combine formulae by logical connectives (not),
(and) (or). E.g. Milk(x)Cheese(x) would mean
Child x is allergic to milk and is not allergic to cheese.
However, in stead of open formulae, we are more interested
in universal closed formulae, e.g. All children are allergic to
milk, Most children are not allergic to orange, There is a child
allergic to tomato, In most cases if a child is allergic to milk then
she/he is allergic to cheese, too.
Esko Turunen http://www.vrtuosi.com
Everyone who passed a basic course in logic would know that
statements like All children are allergic to milk and There is a
child allergic to tomato are expressible by xMilk(x) and
xTomato(x).
Unfortunately classical mathematical quantiers and are
rather useless in the real world. Much more valuable are
generalized quantiers like In most cases, e.g. in the statement
In most cases if a child is allergic to milk then she/he is allergic
to cheese, too.
GUHA method is a logic formalism of generalized quantiers for
data mining purposes.
Esko Turunen http://www.vrtuosi.com
TAM002
Data mining with GUHA Part 3
GUHA is a logic approach to data mining
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
Theoretical basis of GUHA
In this chapter we learn e.g.
what is an observational predicate language
what it means that some statement is true in a data
how to transfer a data matrix to a Boolean data matrix
that Boolean data matrices are models where statement is
true or false
that GUHA theory rests on a rm foundation that allows to
introduce new quantiers and study their properties. This
chapter also contains some exercises.
Esko Turunen http://www.vrtuosi.com
GUHA is based on a rst order logic formalism. We start be
setting
Denition (A simplied GUHA language)
A observational predicate language L
n
consists of
unary predicates P
1
, , P
n
, and an innite sequence
x
1
, , x
m
, . . . of variables,
logical connectives (negation) and (conjunction),
nonstandard or generalized binary quantiers Q
1
, , Q
k
,
however, usually denoted by , , , etc with some
subscripts and superscripts.
Given an observational predicate language L
n
the atomic
formulae are the symbols P(x), where P is a predicate and x is
a variable. Atomic formulae are formulae and if , are
formulae, then , and (x) (x) are formulae.
Esko Turunen http://www.vrtuosi.com
, , the classical quantiers, are included in the original
denition of the GUHA language as well as truth constants ,
. Like in classical logic, the logical connectives (disjunction),
(implication) and (equivalence) are denable by and .
However, not all this staff is implemented into LISpMiner, so we
mostly omit it.
Free and bound variables are dened as in classical predicate
logic:
in P(x) and y[P(y) R(x)] x is a free but y is a bound
variable,
in (x) (x) x is a bound variable.
Formulae containing free variables are open formulae (not
much of interest!), closed formulae do not contain free
variables. Closed formulae are also called sentences.
Formulae containing only the variable x and of a form
(x) (x) are in the scope of GUHA method.
Esko Turunen http://www.vrtuosi.com
Exercises
1. Write by GUHA language the natural language expression
(a) Most x that are red are round, too. (b) Almost all x that are
red are round and vice versa.
2. (a) Is Qx((x) (x)), where Q is a generalized quantier,
and (b) x((x) (x)) a wellformed observational formula?
3. We write (x) (x) for a generalized quantier and
(x) (x) for the logical connective . Discuss the difference
between and .
Esko Turunen http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com
Models in GUHA logic are data matrices
Given an observational language L
n
, consider all mn
matrices composed of 0.s and 1.s; it is the set / of models M
of L
n
. Thus, in a model M, rows correspond to variables and
columns correspond to predicates. Fix such a model M. For
example, an M
45
matrix:
Object P
1
P
2
P
3
P
4
P
5
A 1 1 0 0 0
B 0 0 0 1 0
C 1 1 0 1 1
D 1 1 0 0 0
Associate to each cell (a
ij
), i = 1, , m (row) and j = 1, , n
(column) a value True/False such that
v(P
j
(x
i
)) =

True if (a
ij
) = 1
False if (a
ij
) = 0
For example, v(P
2
(B)) = False, while v(P
4
(B)) = True.
Esko Turunen http://www.vrtuosi.com
We have now dened truth values v(P(x)) TRUE, FALSE of
all atomic formulae P(x). Next we extend truth values to all
formulae, that is, the set VAL of all valuation functions
v : /L
n
TRUE, FALSE.
Let v() and v() be dened. Like in classical logic we set:
v() v() v() v( )
TRUE TRUE FALSE TRUE
TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE
FALSE FALSE TRUE FALSE
Exercises
4. Write the truth tables of the connectives , and .
5. v(x(x)) = TRUE in a model M iff v((x)) = TRUE for some
x (corresponding to a row of M!). How is v(x(x)) dened?
Esko Turunen http://www.vrtuosi.com
Truth value of formulae with nonclassical quantiers
Given a model M, the value v() of any formulae of the
language L
n
can be calculated immediately. In some cases we
consider several models M, N. Thus, to avoid confusion, we
write v
M
, v
N
.
To dene the truth value of formulae with nonclassical
quantiers, recall the fourfold table:

a b
c d
where a + b + c + d = m.
Quantier of simple association (x) (x) [read:
coincidence of (x) and (x) predominates over difference].
Given a model M, we dene
v((x) (x)) = TRUE iff ad > bc.
Esko Turunen http://www.vrtuosi.com
This quantier is known in most data mining frameworks, not
only in GUHA. In LISpMiner it is implemented under a name
Simple deviation: the truth denition is more general
ad > e

bc, where the parameter 0 can be chosen by the


user.
Quantiers of founded pimplications or basic implications

p,Base
, where Base N, 0 < p 1, p rational: (x)
p,n
(x)
[read: (x) implies (x) with condence p and support Base].
Given a model M, we dene v((x)
p,Base
(x)) = TRUE iff
a
a+b
p and a Base.
Example
The Allergy matrix induces a four-fold frequency table
M Apple Apple
Tomato 6 3
Tomato 6 0
v(Tomato Apple) = FALSE, v(Tomato
0.6,5
Apple) = TRUE.
Esko Turunen http://www.vrtuosi.com
Some theoretical observations about GUHA
Denition
Given an observational language L
n
, the set / of all models
M, the set VAL of all valuations v and the set
V = TRUE, FALSE of truth values, a system L
n
, /, VAL, V
is an observational semantic system.
We say that a sentence L
n
is a tautology (note [= ) if
v() = TRUE for all valuations v VAL (thus, in all models M).
Moreover, L
n
is a logical consequence of a nite set A of
sentences of L
n
if, whenever v() = TRUE for all A, then
v() = TRUE, too.
In theoretical considerations we are interested in tautologies
(true in all models) while in practical GUHAresearch we
consider only one model M induced by a given data matrix.
Esko Turunen http://www.vrtuosi.com
Denition
An observational semantic system L
n
, /, VAL, V is
axiomatizable if the set of all tautologies is recursively
denumerable.
We can prove
Theorem
For any natural number n, the semantic system L
n
, /, VAL, V
is axiomatizable.
Remark
Axiomatizable means that there is a nite set of schemas of
tautologies called axioms and a nite set of rules of inference
such that all tautologies (and only them) can be reduced from
axioms by means of rules of inference, i.e. that all tautologies
have a proof (noted by ). Thus, axiomatizability means: [=
iff .
Esko Turunen http://www.vrtuosi.com
For example, if and are axioms (or, if they have a
proof), then one can infer by means of a rule of inference
called Modus Ponens. Thus, has a proof, too.
Here we are not interested in the general axiom system of
GUHA. However, we will need rules of inference to understand
how LISpMiner decreases the amount of outcomes of practical
GUHA procedures.
We give the denition of a (sound) rule of inference. It is of form

1
, ,
n

such that, whenever v(


i
) = TRUE for all i = 1, , n, then
v() = TRUE, too.
1
, ,
n
are premises, is the conclusion.
Rules of inference are called deduction rules, too.
Esko Turunen http://www.vrtuosi.com
Exercises
6. Prove or disprove
(a) [= [ ( ] [( ) ( )],
(b) [= ( ) [( )] and
(c) [= ( ),
where , and are wellformed formulae.
7. Is
(a) a logical consequences of a set , ,
(b) a logical consequences of a set , ,
(c) a logical consequences of a set , of
wellformed formulae?
8. Prove that Modus Ponens is a sound rule of inference.
Esko Turunen http://www.vrtuosi.com
Exercises
In exercises 9 11, consider the Allergy matrix.
9. Write down all sentences P
i
(x)
0.7,7
P
j
(x) such that
v(P
i
(x)
0.7,7
P
j
(x)) = TRUE, (i ,= j ).
10. Does v(Apple(x) Orange(x)) = TRUE hold true, where
is simple association?
11. Is v(Tomato(x) Cheese(x)) = TRUE true ( is simple
association)?
Esko Turunen http://www.vrtuosi.com
TAM002
Data mining with GUHA Part 4
More about foundations of GUHA
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
Theoretical foundations of GUHA continuation
We study desirable properties the nonstandard quantiers in
GUHA should have, e.g.
if a data contains evidence for an interdependence of and
, i.e. that is true in this data, then should remain
true in all other such data where there is even more evidence.
Such quantiers are called associational or implicational.
that truth value should really depend on values in the fourfold
table; by altering these values is either true or false.
We also introduce sound rules of inference to minimize
computations in practical data mining tasks. This chapter
contains exercises.
Esko Turunen http://www.vrtuosi.com
Denition (Associational quantiers)
Let (x),(x) be two xed formulae of a language L
n
such that
x is the only free variable in both of them and they dont have
common predicates. Let M and N be two models. Then we
have the following two fourfold tables:
M
a
1
b
1
c
1
d
1
N
a
2
b
2
c
2
d
2
We dene: N is associationally better than M if a
2
a
1
,
d
2
d
1
b
2
b
1
and c
2
c
1
. Moreover, a binary quantier is
associational if, for all formulae (x),(x), all models M, N: if
v
M
((x) (x)) = TRUE, N associational better than M, then
v
N
((x) (x)) = TRUE.
Esko Turunen http://www.vrtuosi.com
Obviously, the quantier of simple association is associational:
this follows by the fact that, under the given circumstances,
a
2
d
2
a
1
d
1
> b
1
c
1
b
2
c
2
.
Also quantiers of basic implication are associational: if
a
2
a
1
Base, b
2
b
1
, then a
2
b
1
a
1
b
2
, therefore
a
2
a
1
+ a
2
b
1
a
2
a
1
+ a
1
b
2
and nally,
a
2
a
2
+b
2

a
1
a
1
+b
1
p.
In stead of being associational, a nonclassical quantier may
satisfy a stronger condition. We dene
Esko Turunen http://www.vrtuosi.com
Denition (Implicational quantiers)
Let (x),(x) be two xed formulae of a language L
n
such that
x is the only free variable in both of them and they dont have
common predicates. Let M and N be two models. Then we
have the following two fourfold tables:
M
a
1
b
1
c
1
d
1
N
a
2
b
2
c
2
d
2
We dene: N is implicational better than M if a
2
a
1
and
b
2
b
1
. Moreover, a binary quantier is implicational if, for
all formulae (x),(x), all models M, N: if
v
M
((x) (x)) = TRUE, N implicational better than M, then
v
N
((x) (x)) = TRUE.
Esko Turunen http://www.vrtuosi.com
Basic implication quantiers are implicational, however, simple
association quantiers are not implicational. To see this,
consider the following two frequency tables
M
a
1
= 1 b
1
= 1
c
1
= 1 d
1
= 2
N
a
2
= 1 b
2
= 1
c
2
= 2 d
2
= 1
Clearly N is implicational better than M, and a
1
d
1
> c
1
b
1
thus,
v
M
((x) (x)) = TRUE. However, a
2
d
2
< c
2
b
2
, thus
v
N
((x) (x)) = FALSE.
Remark
Let be an implicational quantier. Then is associational.
Proof.
Let be implicational and v
M
((x) (x)) = TRUE. If N is
associational better than M, then N is clearly also implicational
better than M, so v
N
((x) (x)) = TRUE. Therefore is
associational, too.
Esko Turunen http://www.vrtuosi.com
Theorem
Let (x), (x), (x) be formulae, and let be an implicational
quantier. Then

[ ]
is a sound rule of inference.
Proof.
Let v
M
((x) (x)) = TRUE and
M
a
1
b
1
c
1
d
1
M ( )
a
2
b
2
c
2
d
2
We reason that a
1
= {x|v
M
((x)) = v
M
((x)) = TRUE}
{x|v
M
((x) (x)) = v
M
((x)) = TRUE} = a
2
and
b
1
= {x|v
M
((x)) = v
M
((x)) = TRUE}
{x|v
M
((x) (x)) = v
M
((x)) = TRUE}
= {x|v
M
((((x) (x)) = v
M
((x)) = TRUE} = b
2
.
Obviously, since is implicational, v
M
( [ ]) = TRUE.
Esko Turunen http://www.vrtuosi.com
Theorem (Reduction Theorem 1.)
Let (x), (x), (x) be formulae, and let be an implicational
quantier. Then
[ ]
[ ]
is a sound rule of inference.
Theorem (Reduction Theorem 2.)
Let (x), (x), (x) be formulae, and let be simple
association quantier. Then




are sound
rules of inference.
We have introduced sound rules of inference mainly to
minimize computations in practical data mining tasks. For
example, if is an implicational quantier and is true in a
given model M, so is [ ] true, too. Due to the solid
logical foundations of GUHA, there are several means to
reduce the amount of computa tions. However, we do not study
them here.
Esko Turunen http://www.vrtuosi.com
We introduce
Basic equivalence quantiers
p
, where 0 < p 1. In any
model M, v(((x)
p
(x)) = TRUE iff
(a + d) p(a + b + c + d) except for a case a + d = 0,
b + c = 0; then v(((x)
p
(x)) = FALSE.
Basic or double implication quantiers
p
, where
0 < p 1. In any model M, v(((x)
p
(x)) = TRUE iff
a p(a + b + c) except for a case a + b + c = 0, d = 0; then
v(((x)
p
(x)) = FALSE.
Exercises
12. Prove Reduction Theorem 1.
13. Prove Reduction Theorem 2.
14. Prove that Reduction Theorem 2 does not hold for basic
implication quantiers.
15. Prove that Basic equivalence quantiers are associational.
16. Prove that double implication quantiers are
associational.
Esko Turunen http://www.vrtuosi.com
It is unquestionable that any valuable quantier should be (at
least) implicational (associational). However, there are other
obvious conditions, too. We say that an implicational quantier
is adependent if there are models M and N such that
M
a b
c d
and
N
a

b
c d
where v
M
(((x) (x)) = v
N
(((x) (x)). Similarly we
dene a bdependent implicational quantier . If an
adependent and bdependent quantier satises a
condition
a = b = 0 implies v(((x) (x)) = FALSE,
then is called interesting.
Exercise 17
Show that basic implication quantiers are interesting. What
about the case p = 1?
Esko Turunen http://www.vrtuosi.com
We can also dene (a+d)dependent, (b+c)dependent, etc
implicational quantiers. For example double implication
quantiers
p
are adependent. Indeed, let 0 < p 1. Then
a

+b+c
<
a
a+b+c
(= p) iff a

a + a

b + a

c < a

a + ab + ac = p iff a < a

.
Thus, there are models M and N such that the corresponding
frequency tables differ only with respect to the value a and
v
N
(((x)
p
(x)) = FALSE while v
M
(((x)
p
(x)) = TRUE.
In a similar manner we prove that double implication
quantiers
p
are (b+c)dependent, too.
We continue by having a closer look at quantiers implemented
to LISpMiner and see what kinds of tasks can be performed.
We use National Indonesia Contraceptive Prevalence Survey
data as a benchmark data test set. However, we often study the
above mentioned theoretical properties each individual
quantier meets.
Esko Turunen http://www.vrtuosi.com
TAM002
Data mining with GUHA Part 5
Introduction to LISpMiner software
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
Instructions to download LISpMiner
In this chapter we show how to start working with LISpMiner.
We also introduce the rst GUHA task where we use founded
implication quantier.
It is highly recommended that, to download LISpMiner, you
print these slides and follow the instructions step by step from
them. The situation is even better if you have two computers
and two terminals available.
LISpMiner is not a commercial product and you might nd
some parts somewhat cumbersome. There are often updates
so the layout might not look today precisely as it looked when
these slides were completed patience!
LISpMiner is built on Microsoft products, so you need Micro
Soft Access database 2002 or later (.mdble).
You might also like to look at http://lispminer.vse.cz/, the
homepage of LISpMiner.
Esko Turunen http://www.vrtuosi.com
LISpMiner
is an academic software based on GUHA method
is developed and maintained at Prague University of
Economics by Jan Rauch and Milan imunek.
can be downloaded free from http://lispminer.vse.cz/.
We show step by step how you can download LISpMiner on
your own computer. Here we have LISpMiner version 12.01.00
from February 24, 2010.
Step 1. Open a new folder (e.g. to position C:\) and name it
e.g. LispMiner2010.
Step 2. Go to the web page http://lispminer.vse.cz/, click
Download and download the following le to your folder
LispMiner2010. (Later we will download more les from this
course there are also some more instructions there.)
Esko Turunen http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Your C:\LispMiner2010 should look like this now.
Add there a new folder and name it e.g. My Data you
will put your data files there!
Esko Turunen http://www.vrtuosi.com

Then go to the address http://b-course.cs.helsinki.fi/obc/readymade.html
->(Under Contraseptive methods ->) The data set itself (a .txt-file)
to download Indonesia data. Put it to My Data
Esko Turunen http://www.vrtuosi.com

Name it Indonesia1.txt
Delete empty space from the names!
Esko Turunen http://www.vrtuosi.com

It should look like this now.
Next we create a Micro Soft Access data base
through the following steps:
Esko Turunen http://www.vrtuosi.com

Take a copy of LMEmpty, rename its LM_Indonesia1.mdb
and save it to My Data
Esko Turunen http://www.vrtuosi.com

Your My Data folder should now look like this. Next we
create there still one data matrix. It will be called
Indonesia1.mdb. Do the following steps:
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Your data file Indonesia1.mdb should now look like
this. Close it and go to LispMiner2010
Esko Turunen http://www.vrtuosi.com

Open LMAdmin.exe
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Put here Indonesia1.mdb
...
and here
LM_Indonesia1.mdb
Esko Turunen http://www.vrtuosi.com

You should now have this on your screen!
Close it and go again to LispMiner2010
Esko Turunen http://www.vrtuosi.com

Open now LMDataSource.exe
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Open from Database ->Data Matrices
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Open from Database ->Attributes Three
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com


Here is the distribution of ages
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

In this way we go through the whole list of Attributes.
Depending on how many categories we divide each attribute
into, we get the total amount of Predicates (=columns
composed of 0/1/-cells in the final data matrix) of our logic
language. We should always have them as few as possible.
Esko Turunen http://www.vrtuosi.com

We are now ready to start the simplest data mining
tasks. Open 4ftTask.exe
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

We are interested in knowing which Attributes imply
Contraception Method, so we name the task like this.
Esko Turunen http://www.vrtuosi.com

Open the Succedentpart: click it!
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com


Since there are only 3 Predicates
(No use, Short Term and Long
Term) it is natural to take subset
with min- and max-length=1
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Next open the Antecedent-part: click
here. There are much more choices
Esko Turunen http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Select all but Contraceptive
Esko Turunen http://www.vrtuosi.com

Age is typically an interval, say 16-18, 17-19,..., 46-48,
16-19, 17-20,, 45-48 or 16-20, 17-21,, 44-48.
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

All the others contain only 2 or
4 Categories (=Predicates) so
let them be just 1-1 subsets.
With this division there are
34+15+4+4+4+2+4+2+2=71
columns in the data to be mined
Esko Turunen http://www.vrtuosi.com

Choosing the quantifier:
2*Click BASE, choose =5%
2*Click FUI, choose p=0.900
Push Generate and see what
happens
Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Let us see the shortest
hypotheses in detail
click here!
Esko Turunen http://www.vrtuosi.com

There are 95 +2 =97 women
who have no children
95 of them use no contraception
Esko Turunen http://www.vrtuosi.com