http://b-course.cs.helsinki.fi/obc/
Esko Turunen http://www.vrtuosi.com
http://b-course.cs.helsinki.fi/obc/
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Speculating about causalities
Remember that dependencies are not necessarily
causalities. However, the theory of inferred causation
makes it possible to speculate about the causalities
that have caused the dependencies of the model.
There are two different speculations (called naive
model and not so naive model) which are based on
different background assumptions.
Esko Turunen http://www.vrtuosi.com
How to read naive causal model ?
Naive causal models are easy to read, but they are built on assumptions that are many times
unrealistic, namely that there are no latent (unmeasured) variables in the domain that causes the
dependencies between variables. A simple example of the situation where this assumption is violated
can be placed to Finland where cold winter makes lakes and sea ice covered. Because of that most
drowning accidents happen in summertime. The warm summer also makes people eat much more
ice-cream than in wintertime. If you measure both the number of drowning accidents and the ice-
cream consumption, but don't include the variable indicating the season there is clear dependency
between ice-cream consumption and drowning. Evidently this dependency is not causal (ice cream
does not cause drowning or other way round), but due to the excluded variable summer (technically
this is called confounding). Naive causal models are built on the assumption that there is no
confounding.
In naive causal models there may be two kind of connections between variables: undirected arcs and
directed arcs. Directed arcs denote the causal influence from cause to effect and the undirected arcs
denote the causal influence directionality of which cannot be automatically inferred from the data.
You can also read the naive causal models as representing the set of dependency models sharing the
same directed arcs. Unfortunately, this does not give you the freedom to re-orient the undirected arcs
any way you want. You are free to re-orient the undirected arcs as long as re-orienting them does not
create new V-structures in a graph. V-structure is the system of three variables A B C such that there
is directed arc from A to B and there is directed arc from C to B, but there is no arc (neither directed
nor undirected) between A and C.
Esko Turunen http://www.vrtuosi.com
How to read causal graph produced by B-course?
Causal models are not difficult to read once you learn the difference between different kinds of
arcs. There are two kinds of lines in arcs, solid and dashed. With solid lines we indicate relations
that can be determined from the data. Dashed lines are used when we know that there is a
dependency, but we are not sure about its exact nature. The table below lists the different types of
arcs that can be found in causal models.
Solid arc from A to B
A has direct causal influence to B (direct meaning that causal influence
is not mediated by any other variable that is included in the study)
Dashed arc from A to B.
There are two possibilities, but we do not know which holds. Either A
is cause of B or there is a latent cause for both A and B.
Dashed line without any
arrow heads between A and
B.
There is a dependency but we do not know whether A causes B or if B
causes A or if there is a latent cause of them both the dependency
(confounding).
Esko Turunen http://www.vrtuosi.com
W. Frawley, G. Piatetsky-Shapiro and C. Matheus:
Knowledge Discovery in Databases: An Overview. In
Knowledge Discovery in Databases, eds. G.
Piatetsky-Shapiro and W. Frawley (1991) 127. Cambridge,
Mass.: AAAI Press / The MIT Press.
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and P.
Uthurusamy: Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press (1996).
I. Havel , M. Chytil M.and P. Hjek: The GUHAMethod of
Automatic Hypotheses Determination. Computing, Vol. 1,
(1966) 293308.
Esko Turunen http://www.vrtuosi.com
TAM002
Data mining with GUHA Part 2
GUHA produces hypothesis
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
GUHA does not test hypothesis, instead GUHA produces them
In this chapter we learn the outlines of GUHA e.g. that
GUHA is suitable for exploratory analysis of large data
GUHA goes through 2 2 contingency tables and picks up
the interesting ones
in logic terms, GUHA is based on rst order monoidal logic
whose models are nite
a fundamental reference, the GUHA book, is free to be
downloaded from www.cs.cas.cz/hajek/guhabook/
Esko Turunen http://www.vrtuosi.com
1. GUHA (General Unary Hypotheses Automaton) is a
method of automatic generation of hypotheses based on
empirical data, thus a method of data mining.
GUHA is one of the oldest methods of data mining GUHA
was introduced by Hjek, Havel and Chytil in The GUHA
method of automatic hypotheses determination published in
Computing 1 (1966) 293308.
GUHA still develops: there are dozens of new features added
to GUHA during the past ten years.
GUHA is a kind of automated exploratory data analysis.
Instead of testing given hypothesis supported by a data, GUHA
procedure generates systematically hypotheses supported by
the data. By GUHA it is possible to nd all interesting
dependences hidden in a data.
Esko Turunen http://www.vrtuosi.com
2. GUHA is suitable for exploratory analysis of large data.
The processed data form a rectangle matrix, where rows
corresponds to objects belonging to the sample and each
column corresponds to one investigated variable. A typical data
matrix processed by GUHA has hundreds or thousands of rows
and tens of columns.
Cells of the analyzed data matrix can contain whatever
symbols, however, before a concrete data mining task the data
must be categorized to contain only 0s or 1s (or are empty).
Exploratory analysis means that there is no single specic
hypothesis that should be tested by our data; rather, our aim is
to get orientation in the domain of investigation, analyze the
behavior of chosen variables, interactions among them etc.
Such inquiry is not blind but directed by some general (possibly
vague) direction of research (some general problem).
Esko Turunen http://www.vrtuosi.com
3. GUHA systematically creates all hypotheses interesting
from the point of view of a given general problem and on the
base of given data.
This is the main principle: all interesting hypotheses. Clearly,
this contains a dilemma: all means most possible, only
interesting means not too many. To cope with this dilemma,
one may use different GUHA procedures, implemented in
LISpMiner system, and having selected one, by xing in
various ways its numerous parameters.
The LISpMiner system leads the user and makes the
selection of parameters relatively easy but somewhat laborious.
LISpMiner cannot be as automatized as much as e.g.
Bcourse is automatized, however, the results of LISpMiner
are much more illustrative and detailed.
Esko Turunen http://www.vrtuosi.com
Three remarks:
GUHA procedures polyfactorial hypotheses i.e. not only
hypotheses relating one variable with another one, but
expressing relations among single variables, pairs, triples,
quadruples of variables etc.
GUHA offers hypotheses. Exploratory character implies that
the hypotheses produced by the computer (numerous in
number: typically tens or hundreds of hypotheses) are just
supported by the data, not veried. You are assumed to use
this offer as inspiration, and possibly select some few
hypotheses for further testing.
GUHA is not suitable for testing a single hypothesis: routine
packages are good for this.
Esko Turunen http://www.vrtuosi.com
4. 4ftminer procedure generates hypotheses (or
observational statements) on association between complex
Boolean formulae (attributes). These formulae are constructed
from unary predicates (corresponding to the columns of the
processed 0/1data matrix) by logical connectives , ,
(conjunction, disjunction, negation).
Examples of predicates are
TEMPERATURE : 38
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Models in GUHA logic are data matrices
Given an observational language L
n
, consider all mn
matrices composed of 0.s and 1.s; it is the set / of models M
of L
n
. Thus, in a model M, rows correspond to variables and
columns correspond to predicates. Fix such a model M. For
example, an M
45
matrix:
Object P
1
P
2
P
3
P
4
P
5
A 1 1 0 0 0
B 0 0 0 1 0
C 1 1 0 1 1
D 1 1 0 0 0
Associate to each cell (a
ij
), i = 1, , m (row) and j = 1, , n
(column) a value True/False such that
v(P
j
(x
i
)) =
True if (a
ij
) = 1
False if (a
ij
) = 0
For example, v(P
2
(B)) = False, while v(P
4
(B)) = True.
Esko Turunen http://www.vrtuosi.com
We have now dened truth values v(P(x)) TRUE, FALSE of
all atomic formulae P(x). Next we extend truth values to all
formulae, that is, the set VAL of all valuation functions
v : /L
n
TRUE, FALSE.
Let v() and v() be dened. Like in classical logic we set:
v() v() v() v( )
TRUE TRUE FALSE TRUE
TRUE FALSE FALSE FALSE
FALSE TRUE TRUE FALSE
FALSE FALSE TRUE FALSE
Exercises
4. Write the truth tables of the connectives , and .
5. v(x(x)) = TRUE in a model M iff v((x)) = TRUE for some
x (corresponding to a row of M!). How is v(x(x)) dened?
Esko Turunen http://www.vrtuosi.com
Truth value of formulae with nonclassical quantiers
Given a model M, the value v() of any formulae of the
language L
n
can be calculated immediately. In some cases we
consider several models M, N. Thus, to avoid confusion, we
write v
M
, v
N
.
To dene the truth value of formulae with nonclassical
quantiers, recall the fourfold table:
a b
c d
where a + b + c + d = m.
Quantier of simple association (x) (x) [read:
coincidence of (x) and (x) predominates over difference].
Given a model M, we dene
v((x) (x)) = TRUE iff ad > bc.
Esko Turunen http://www.vrtuosi.com
This quantier is known in most data mining frameworks, not
only in GUHA. In LISpMiner it is implemented under a name
Simple deviation: the truth denition is more general
ad > e
p,Base
, where Base N, 0 < p 1, p rational: (x)
p,n
(x)
[read: (x) implies (x) with condence p and support Base].
Given a model M, we dene v((x)
p,Base
(x)) = TRUE iff
a
a+b
p and a Base.
Example
The Allergy matrix induces a four-fold frequency table
M Apple Apple
Tomato 6 3
Tomato 6 0
v(Tomato Apple) = FALSE, v(Tomato
0.6,5
Apple) = TRUE.
Esko Turunen http://www.vrtuosi.com
Some theoretical observations about GUHA
Denition
Given an observational language L
n
, the set / of all models
M, the set VAL of all valuations v and the set
V = TRUE, FALSE of truth values, a system L
n
, /, VAL, V
is an observational semantic system.
We say that a sentence L
n
is a tautology (note [= ) if
v() = TRUE for all valuations v VAL (thus, in all models M).
Moreover, L
n
is a logical consequence of a nite set A of
sentences of L
n
if, whenever v() = TRUE for all A, then
v() = TRUE, too.
In theoretical considerations we are interested in tautologies
(true in all models) while in practical GUHAresearch we
consider only one model M induced by a given data matrix.
Esko Turunen http://www.vrtuosi.com
Denition
An observational semantic system L
n
, /, VAL, V is
axiomatizable if the set of all tautologies is recursively
denumerable.
We can prove
Theorem
For any natural number n, the semantic system L
n
, /, VAL, V
is axiomatizable.
Remark
Axiomatizable means that there is a nite set of schemas of
tautologies called axioms and a nite set of rules of inference
such that all tautologies (and only them) can be reduced from
axioms by means of rules of inference, i.e. that all tautologies
have a proof (noted by ). Thus, axiomatizability means: [=
iff .
Esko Turunen http://www.vrtuosi.com
For example, if and are axioms (or, if they have a
proof), then one can infer by means of a rule of inference
called Modus Ponens. Thus, has a proof, too.
Here we are not interested in the general axiom system of
GUHA. However, we will need rules of inference to understand
how LISpMiner decreases the amount of outcomes of practical
GUHA procedures.
We give the denition of a (sound) rule of inference. It is of form
1
, ,
n
b
c d
where v
M
(((x) (x)) = v
N
(((x) (x)). Similarly we
dene a bdependent implicational quantier . If an
adependent and bdependent quantier satises a
condition
a = b = 0 implies v(((x) (x)) = FALSE,
then is called interesting.
Exercise 17
Show that basic implication quantiers are interesting. What
about the case p = 1?
Esko Turunen http://www.vrtuosi.com
We can also dene (a+d)dependent, (b+c)dependent, etc
implicational quantiers. For example double implication
quantiers
p
are adependent. Indeed, let 0 < p 1. Then
a
+b+c
<
a
a+b+c
(= p) iff a
a + a
b + a
c < a
a + ab + ac = p iff a < a
.
Thus, there are models M and N such that the corresponding
frequency tables differ only with respect to the value a and
v
N
(((x)
p
(x)) = FALSE while v
M
(((x)
p
(x)) = TRUE.
In a similar manner we prove that double implication
quantiers
p
are (b+c)dependent, too.
We continue by having a closer look at quantiers implemented
to LISpMiner and see what kinds of tasks can be performed.
We use National Indonesia Contraceptive Prevalence Survey
data as a benchmark data test set. However, we often study the
above mentioned theoretical properties each individual
quantier meets.
Esko Turunen http://www.vrtuosi.com
TAM002
Data mining with GUHA Part 5
Introduction to LISpMiner software
Esko Turunen
1
1
Tampere University of Technology
Esko Turunen http://www.vrtuosi.com
Instructions to download LISpMiner
In this chapter we show how to start working with LISpMiner.
We also introduce the rst GUHA task where we use founded
implication quantier.
It is highly recommended that, to download LISpMiner, you
print these slides and follow the instructions step by step from
them. The situation is even better if you have two computers
and two terminals available.
LISpMiner is not a commercial product and you might nd
some parts somewhat cumbersome. There are often updates
so the layout might not look today precisely as it looked when
these slides were completed patience!
LISpMiner is built on Microsoft products, so you need Micro
Soft Access database 2002 or later (.mdble).
You might also like to look at http://lispminer.vse.cz/, the
homepage of LISpMiner.
Esko Turunen http://www.vrtuosi.com
LISpMiner
is an academic software based on GUHA method
is developed and maintained at Prague University of
Economics by Jan Rauch and Milan imunek.
can be downloaded free from http://lispminer.vse.cz/.
We show step by step how you can download LISpMiner on
your own computer. Here we have LISpMiner version 12.01.00
from February 24, 2010.
Step 1. Open a new folder (e.g. to position C:\) and name it
e.g. LispMiner2010.
Step 2. Go to the web page http://lispminer.vse.cz/, click
Download and download the following le to your folder
LispMiner2010. (Later we will download more les from this
course there are also some more instructions there.)
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Your C:\LispMiner2010 should look like this now.
Add there a new folder and name it e.g. My Data you
will put your data files there!
Esko Turunen http://www.vrtuosi.com
Then go to the address http://b-course.cs.helsinki.fi/obc/readymade.html
->(Under Contraseptive methods ->) The data set itself (a .txt-file)
to download Indonesia data. Put it to My Data
Esko Turunen http://www.vrtuosi.com
Name it Indonesia1.txt
Delete empty space from the names!
Esko Turunen http://www.vrtuosi.com
It should look like this now.
Next we create a Micro Soft Access data base
through the following steps:
Esko Turunen http://www.vrtuosi.com
Take a copy of LMEmpty, rename its LM_Indonesia1.mdb
and save it to My Data
Esko Turunen http://www.vrtuosi.com
Your My Data folder should now look like this. Next we
create there still one data matrix. It will be called
Indonesia1.mdb. Do the following steps:
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Your data file Indonesia1.mdb should now look like
this. Close it and go to LispMiner2010
Esko Turunen http://www.vrtuosi.com
Open LMAdmin.exe
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Put here Indonesia1.mdb
...
and here
LM_Indonesia1.mdb
Esko Turunen http://www.vrtuosi.com
You should now have this on your screen!
Close it and go again to LispMiner2010
Esko Turunen http://www.vrtuosi.com
Open now LMDataSource.exe
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Open from Database ->Data Matrices
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Open from Database ->Attributes Three
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Here is the distribution of ages
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
In this way we go through the whole list of Attributes.
Depending on how many categories we divide each attribute
into, we get the total amount of Predicates (=columns
composed of 0/1/-cells in the final data matrix) of our logic
language. We should always have them as few as possible.
Esko Turunen http://www.vrtuosi.com
We are now ready to start the simplest data mining
tasks. Open 4ftTask.exe
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
We are interested in knowing which Attributes imply
Contraception Method, so we name the task like this.
Esko Turunen http://www.vrtuosi.com
Open the Succedentpart: click it!
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Since there are only 3 Predicates
(No use, Short Term and Long
Term) it is natural to take subset
with min- and max-length=1
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Next open the Antecedent-part: click
here. There are much more choices
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Select all but Contraceptive
Esko Turunen http://www.vrtuosi.com
Age is typically an interval, say 16-18, 17-19,..., 46-48,
16-19, 17-20,, 45-48 or 16-20, 17-21,, 44-48.
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
All the others contain only 2 or
4 Categories (=Predicates) so
let them be just 1-1 subsets.
With this division there are
34+15+4+4+4+2+4+2+2=71
columns in the data to be mined
Esko Turunen http://www.vrtuosi.com
Choosing the quantifier:
2*Click BASE, choose =5%
2*Click FUI, choose p=0.900
Push Generate and see what
happens
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Esko Turunen http://www.vrtuosi.com
Let us see the shortest
hypotheses in detail
click here!
Esko Turunen http://www.vrtuosi.com
There are 95 +2 =97 women
who have no children
95 of them use no contraception
Esko Turunen http://www.vrtuosi.com