6 tayangan

Diunggah oleh blackmatrix2007

Data mining

- BA Logic 8
- Data Mining
- 36-350, Data Mining_ Introduction
- Application of Data Mining Techniques to Support Customer Relationship Management at Ethiopian Airlines 2002 Thesis
- ETEBMS-2016_ENG-EE7
- International Journal of Engineering Research and Development (IJERD)
- A Survey on Soil Data Mining
- Demystifying Data Mining
- DM Introduction
- Performance Analysis of Hybrid Approach for Privacy Preserving in Data Mining
- Tableau CV Latest
- Arnold poofs text.pdf
- Parametric Comparison Based on Split Criterion on Classification Algorithm
- Data Preparation
- logicb
- MULTI-CRITERIA DECISION SUPPORT GUIDED BY CASE-BASED REASONING
- Data Mining
- Literature Survey
- 3l4 Gane.pdf
- SRAS

Anda di halaman 1dari 133

Does my data contain something interesting?

Esko Turunen

1

1

Tampere University of Technology

Esko Turunen http://www.vrtuosi.com

The aim of data mining is to give answers to a question Does

my data contain something interesting?

In this chapter we introduce the following basic concepts:

knowledge discovery in databases and data mining

data, typical data mining tasks and data mining tasks outputs

GUHA and Bcourse: two dissimilar approaches to data

mining

We also get acquainted with a real life data collected from

Indonesia. We use this data to illustrate issues all over during

this course.

These push buttons open short verbal comments that might

be useful.

Esko Turunen http://www.vrtuosi.com

Knowledge discovery in databases (KDD) was initially

dened as the nontrivial extraction of implicit, previously

unknown, and potentially useful information from data [1]. A

revised version of this denition states that KDD is the

nontrivial process of identifying valid, novel, potentially useful,

and ultimately understandable patterns in data [2] .

According to this denition, data mining is a step in the KDD

process concerned with applying computational techniques

(i.e., data mining algorithms implemented as computer

programs) to actually nd patters in the data.

In a sense, data mining is the central step in the KDD process.

The other steps in the KDD process are concerned with

preparing data for data mining, as well as evaluating the

discovered patterns, the results of data mining.

Esko Turunen http://www.vrtuosi.com

I Data The input to a data mining algorithm is most

commonly a single at table comprising a number of elds

(columns) and records (rows). In general, each row represents

an object and columns represent properties of objects.

II Typical data mining tasks

One task is to predict the value of one eld from other elds.

If the class is continuous, the task is called regression. If the

class is discrete the task is called classication.

Clustering is concerned with grouping objects into classes of

similar objects. A cluster is a collection of objects that are

similar to each other and are dissimilar to objects in other

clusters.

Association analysis is the discovery of association rules.

Association rules specify correlation between frequent item

sets.

Data characterization sums up the general characteristics or

features of the target class of data: this class is typically

collected by a database query.

Esko Turunen http://www.vrtuosi.com

Outlier detection is concerned with nding data objects that

do not t the general behavior or model of the data: these are

called outliers.

Evaluation analysis describes and models regularities or

trends whose behavior changes over time.

III Outputs of data mining procedures can be

Equations e.g. TotalSpent = 189.5275 Age + 7146[$]

Predictive rules e.g. IF income is 100.000[$] and Gender =

Male THEN Not a Big Spender

Association rules e.g.

{Gender = Female, Age 52} {Big Spender = Yes}

Probabilistic models e.g. Bayesian networks

Distance and similarity measures, decision trees

Many others

Esko Turunen http://www.vrtuosi.com

Our aim is to study in detail a particular data mining method

called GUHA its principle was formulated in a paper by Hjek,

Havel and Chytil already in 1966 [3]. GUHA is the acronym for

General Unary Hypotheses Automaton and its computer

implementation called LISpMiner developed in Prague

University of Economics by Jan Rauch and Milan imunek.

LISpMiner is freely downloadable from http://lispminer.vse.cz/ .

GUHA approach is suitable e.g. for association analysis,

classication, clustering and outlier detection tasks.

We start be introducing a real life data which will serve as a

benchmark data test set during the whole course. To show how

GUHA differs from a Bayesian approach we briey take a quick

look at Bcourse, see http://b-course.cs.helsinki./obc/.

Esko Turunen http://www.vrtuosi.com

The data we use is Tjen-Sien Lims publicly available data set

from the 1987 National Indonesia Contraceptive Prevalence

Survey. These are the responses from interviews of m = 1473

married women who were not pregnant at the time of interview.

The challenge is to learn to predict a womans contraceptive

method from knowledge about her demographic and

socio-economic characteristics. The 10 survey response

variables and their types are

Age integer 1649

Education 4 categories

Husbands education 4 categories

Number of children borne integer 015

Islamic binary (yes/no)

Working binary (yes/no)

Husbands occupation 4 categories

Standard of living 4 categories

Good media exposure binary (yes/no)

Contraceptive method used 3 categories (None, Long-term, Short-term)

Esko Turunen http://www.vrtuosi.com

http://b-course.cs.helsinki.fi/obc/

Esko Turunen http://www.vrtuosi.com

http://b-course.cs.helsinki.fi/obc/

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Speculating about causalities

Remember that dependencies are not necessarily

causalities. However, the theory of inferred causation

makes it possible to speculate about the causalities

that have caused the dependencies of the model.

There are two different speculations (called naive

model and not so naive model) which are based on

different background assumptions.

Esko Turunen http://www.vrtuosi.com

How to read naive causal model ?

Naive causal models are easy to read, but they are built on assumptions that are many times

unrealistic, namely that there are no latent (unmeasured) variables in the domain that causes the

dependencies between variables. A simple example of the situation where this assumption is violated

can be placed to Finland where cold winter makes lakes and sea ice covered. Because of that most

drowning accidents happen in summertime. The warm summer also makes people eat much more

ice-cream than in wintertime. If you measure both the number of drowning accidents and the ice-

cream consumption, but don't include the variable indicating the season there is clear dependency

between ice-cream consumption and drowning. Evidently this dependency is not causal (ice cream

does not cause drowning or other way round), but due to the excluded variable summer (technically

this is called confounding). Naive causal models are built on the assumption that there is no

confounding.

In naive causal models there may be two kind of connections between variables: undirected arcs and

directed arcs. Directed arcs denote the causal influence from cause to effect and the undirected arcs

denote the causal influence directionality of which cannot be automatically inferred from the data.

You can also read the naive causal models as representing the set of dependency models sharing the

same directed arcs. Unfortunately, this does not give you the freedom to re-orient the undirected arcs

any way you want. You are free to re-orient the undirected arcs as long as re-orienting them does not

create new V-structures in a graph. V-structure is the system of three variables A B C such that there

is directed arc from A to B and there is directed arc from C to B, but there is no arc (neither directed

nor undirected) between A and C.

Esko Turunen http://www.vrtuosi.com

How to read causal graph produced by B-course?

Causal models are not difficult to read once you learn the difference between different kinds of

arcs. There are two kinds of lines in arcs, solid and dashed. With solid lines we indicate relations

that can be determined from the data. Dashed lines are used when we know that there is a

dependency, but we are not sure about its exact nature. The table below lists the different types of

arcs that can be found in causal models.

Solid arc from A to B

A has direct causal influence to B (direct meaning that causal influence

is not mediated by any other variable that is included in the study)

Dashed arc from A to B.

There are two possibilities, but we do not know which holds. Either A

is cause of B or there is a latent cause for both A and B.

Dashed line without any

arrow heads between A and

B.

There is a dependency but we do not know whether A causes B or if B

causes A or if there is a latent cause of them both the dependency

(confounding).

Esko Turunen http://www.vrtuosi.com

W. Frawley, G. Piatetsky-Shapiro and C. Matheus:

Knowledge Discovery in Databases: An Overview. In

Knowledge Discovery in Databases, eds. G.

Piatetsky-Shapiro and W. Frawley (1991) 127. Cambridge,

Mass.: AAAI Press / The MIT Press.

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and P.

Uthurusamy: Advances in Knowledge Discovery and Data

Mining. AAAI/MIT Press (1996).

I. Havel , M. Chytil M.and P. Hjek: The GUHAMethod of

Automatic Hypotheses Determination. Computing, Vol. 1,

(1966) 293308.

Esko Turunen http://www.vrtuosi.com

TAM002

Data mining with GUHA Part 2

GUHA produces hypothesis

Esko Turunen

1

1

Tampere University of Technology

Esko Turunen http://www.vrtuosi.com

GUHA does not test hypothesis, instead GUHA produces them

In this chapter we learn the outlines of GUHA e.g. that

GUHA is suitable for exploratory analysis of large data

GUHA goes through 2 2 contingency tables and picks up

the interesting ones

in logic terms, GUHA is based on rst order monoidal logic

whose models are nite

a fundamental reference, the GUHA book, is free to be

downloaded from www.cs.cas.cz/hajek/guhabook/

Esko Turunen http://www.vrtuosi.com

1. GUHA (General Unary Hypotheses Automaton) is a

method of automatic generation of hypotheses based on

empirical data, thus a method of data mining.

GUHA is one of the oldest methods of data mining GUHA

was introduced by Hjek, Havel and Chytil in The GUHA

method of automatic hypotheses determination published in

Computing 1 (1966) 293308.

GUHA still develops: there are dozens of new features added

to GUHA during the past ten years.

GUHA is a kind of automated exploratory data analysis.

Instead of testing given hypothesis supported by a data, GUHA

procedure generates systematically hypotheses supported by

the data. By GUHA it is possible to nd all interesting

dependences hidden in a data.

Esko Turunen http://www.vrtuosi.com

2. GUHA is suitable for exploratory analysis of large data.

The processed data form a rectangle matrix, where rows

corresponds to objects belonging to the sample and each

column corresponds to one investigated variable. A typical data

matrix processed by GUHA has hundreds or thousands of rows

and tens of columns.

Cells of the analyzed data matrix can contain whatever

symbols, however, before a concrete data mining task the data

must be categorized to contain only 0s or 1s (or are empty).

Exploratory analysis means that there is no single specic

hypothesis that should be tested by our data; rather, our aim is

to get orientation in the domain of investigation, analyze the

behavior of chosen variables, interactions among them etc.

Such inquiry is not blind but directed by some general (possibly

vague) direction of research (some general problem).

Esko Turunen http://www.vrtuosi.com

3. GUHA systematically creates all hypotheses interesting

from the point of view of a given general problem and on the

base of given data.

This is the main principle: all interesting hypotheses. Clearly,

this contains a dilemma: all means most possible, only

interesting means not too many. To cope with this dilemma,

one may use different GUHA procedures, implemented in

LISpMiner system, and having selected one, by xing in

various ways its numerous parameters.

The LISpMiner system leads the user and makes the

selection of parameters relatively easy but somewhat laborious.

LISpMiner cannot be as automatized as much as e.g.

Bcourse is automatized, however, the results of LISpMiner

are much more illustrative and detailed.

Esko Turunen http://www.vrtuosi.com

Three remarks:

GUHA procedures polyfactorial hypotheses i.e. not only

hypotheses relating one variable with another one, but

expressing relations among single variables, pairs, triples,

quadruples of variables etc.

GUHA offers hypotheses. Exploratory character implies that

the hypotheses produced by the computer (numerous in

number: typically tens or hundreds of hypotheses) are just

supported by the data, not veried. You are assumed to use

this offer as inspiration, and possibly select some few

hypotheses for further testing.

GUHA is not suitable for testing a single hypothesis: routine

packages are good for this.

Esko Turunen http://www.vrtuosi.com

4. 4ftminer procedure generates hypotheses (or

observational statements) on association between complex

Boolean formulae (attributes). These formulae are constructed

from unary predicates (corresponding to the columns of the

processed 0/1data matrix) by logical connectives , ,

(conjunction, disjunction, negation).

Examples of predicates are

TEMPERATURE : 38

Examples of formulae are

[TEMPERATURE : 38

An example of a hypothesis is

[TEMPERATURE : 38

More generally, hypotheses are of form .

Notice that our terminology differs from the original GUHA

approach: we want to keep things as simple as possible!

Esko Turunen http://www.vrtuosi.com

Given the 0/1data matrix, each pair of Boolean attributes ,

determines its four-fold frequency table; the association of with

is dened by choosing an associational or generalized

quantier i.e. a function assigning to each four-fold table either

1 (associated or true in the data) or 0 (not associated or

false in the data) and satisfying some natural monotonicity

conditions.

The fourfold table has the form:

a b a + b = r

c d c + d = s

a + c = k b + d = l m

where a + b + c + d = m and

a is the number of objects satisfying both and ,

b is the number of objects satisfying but not ,

c is the number of objects not satisfying but satisfying ,

d is the number of objects not satisfying nor .

Esko Turunen http://www.vrtuosi.com

6. There are various types of generalized quantiers

formalizing various kinds of associations:

Implicational quantiers formalize the association many are

, they do not depend on the values c, d.

Comparative quantiers formalize the association makes

more likely than .

Some quantiers just express observations on the data, some

others serve as tests of statistical hypotheses on unknown

probabilities.

Some quantiers ones symmetric: implies , some

admit negation: implies .

4ft-Miner procedure contains dozen of generalized quantiers,

a novel procedure Ac4ftMiner offers also action quantiers. An

advantage of GUHA is that new quantiers can be dened and

their properties can be analyzed in a well establish logic

framework.

Esko Turunen http://www.vrtuosi.com

7. After preparing the original data matrix to a 0/1-format, the

user can start the 4ftMiner procedure. The input of such a

procedure consists of

(1) the 0/1data matrix

(2) parameters determining symbolic restriction to the pairs ,

of Boolean attributes to be generated; is called antecedent

and is called succedent

(3) the quantier to be used and some parameters of the

quantier

(4) some other determinations, for example the user has to

declare predicates that can occur in the antecedent and the

succedent, the use of logic connectives, minimal and maximal

length of antecedent and succedent, way to process of missing

data etc.

Esko Turunen http://www.vrtuosi.com

8. 4ftMiner produces all associations satisfying the

syntactic restrictions and true in the data. The generation is not

done blindly but uses various techniques serving to avoid

exhaustive search. The found associations together with

various parameters are not mechanically printed but saved in a

solution le for further processing.

9. 4ftResult module for interpretation of results enables the

user to browse the associations format, sort them according to

various criteria, select reasonably dened subsets and output

concise information of various kinds.

There are however other procedures implemented in the

LISpMiner system that do mine for other kinds of patterns,

even for more complex one than associational rules.

Esko Turunen http://www.vrtuosi.com

10. The GUHA method has deep logical and statistical

foundations. GUHA is being further developed at the institute of

Computer Science of the Academy of Sciences of the Czech

Republic (Petr Hjek and his group), at the Prague University of

Economics (Jan Rauch and his group) and at Tampere

University of Technology (Esko Turunen and his group).

Esko Turunen http://www.vrtuosi.com

Example

Assume we are observing children who have an allergic

reaction to, say, tomato, apple, orange, cheese or milk. These

observations are presented in the following table:

Child Tomato Apple Orange Cheese Milk

Anna 1 1 0 1 1

Aina 1 1 1 0 0

Naima 1 1 1 1 1

Rauha 0 1 1 0 1

Kai 0 1 0 1 1

Kille 1 1 0 0 1

Lempi 0 1 1 1 1

Ville 1 0 0 0 0

Ulle 1 1 0 1 1

Dulle 1 0 1 0 0

Dof 1 0 1 0 1

Kinge 0 1 1 0 1

Laade 0 1 0 1 1

Koff 1 1 0 0 1

Olvi 0 1 1 1 1

Esko Turunen http://www.vrtuosi.com

Thus, we have observations as Child x is allergic to milk and

Child y is allergic to cheese, We write shorter Milk(x) and

Cheese(y).

Milk(-), Cheese(-), Tomato(-), Orange(-) and Apple(-) are

unary predicates of our observational language and x, y, z,

are variables.

Expressions like Milk(x) or Cheese(y) are atomic (open)

formulae. Combine formulae by logical connectives (not),

(and) (or). E.g. Milk(x)Cheese(x) would mean

Child x is allergic to milk and is not allergic to cheese.

However, in stead of open formulae, we are more interested

in universal closed formulae, e.g. All children are allergic to

milk, Most children are not allergic to orange, There is a child

allergic to tomato, In most cases if a child is allergic to milk then

she/he is allergic to cheese, too.

Esko Turunen http://www.vrtuosi.com

Everyone who passed a basic course in logic would know that

statements like All children are allergic to milk and There is a

child allergic to tomato are expressible by xMilk(x) and

xTomato(x).

Unfortunately classical mathematical quantiers and are

rather useless in the real world. Much more valuable are

generalized quantiers like In most cases, e.g. in the statement

In most cases if a child is allergic to milk then she/he is allergic

to cheese, too.

GUHA method is a logic formalism of generalized quantiers for

data mining purposes.

Esko Turunen http://www.vrtuosi.com

TAM002

Data mining with GUHA Part 3

GUHA is a logic approach to data mining

Esko Turunen

1

1

Tampere University of Technology

Esko Turunen http://www.vrtuosi.com

Theoretical basis of GUHA

In this chapter we learn e.g.

what is an observational predicate language

what it means that some statement is true in a data

how to transfer a data matrix to a Boolean data matrix

that Boolean data matrices are models where statement is

true or false

that GUHA theory rests on a rm foundation that allows to

introduce new quantiers and study their properties. This

chapter also contains some exercises.

Esko Turunen http://www.vrtuosi.com

GUHA is based on a rst order logic formalism. We start be

setting

Denition (A simplied GUHA language)

A observational predicate language L

n

consists of

unary predicates P

1

, , P

n

, and an innite sequence

x

1

, , x

m

, . . . of variables,

logical connectives (negation) and (conjunction),

nonstandard or generalized binary quantiers Q

1

, , Q

k

,

however, usually denoted by , , , etc with some

subscripts and superscripts.

Given an observational predicate language L

n

the atomic

formulae are the symbols P(x), where P is a predicate and x is

a variable. Atomic formulae are formulae and if , are

formulae, then , and (x) (x) are formulae.

Esko Turunen http://www.vrtuosi.com

, , the classical quantiers, are included in the original

denition of the GUHA language as well as truth constants ,

. Like in classical logic, the logical connectives (disjunction),

(implication) and (equivalence) are denable by and .

However, not all this staff is implemented into LISpMiner, so we

mostly omit it.

Free and bound variables are dened as in classical predicate

logic:

in P(x) and y[P(y) R(x)] x is a free but y is a bound

variable,

in (x) (x) x is a bound variable.

Formulae containing free variables are open formulae (not

much of interest!), closed formulae do not contain free

variables. Closed formulae are also called sentences.

Formulae containing only the variable x and of a form

(x) (x) are in the scope of GUHA method.

Esko Turunen http://www.vrtuosi.com

Exercises

1. Write by GUHA language the natural language expression

(a) Most x that are red are round, too. (b) Almost all x that are

red are round and vice versa.

2. (a) Is Qx((x) (x)), where Q is a generalized quantier,

and (b) x((x) (x)) a wellformed observational formula?

3. We write (x) (x) for a generalized quantier and

(x) (x) for the logical connective . Discuss the difference

between and .

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Models in GUHA logic are data matrices

Given an observational language L

n

, consider all mn

matrices composed of 0.s and 1.s; it is the set / of models M

of L

n

. Thus, in a model M, rows correspond to variables and

columns correspond to predicates. Fix such a model M. For

example, an M

45

matrix:

Object P

1

P

2

P

3

P

4

P

5

A 1 1 0 0 0

B 0 0 0 1 0

C 1 1 0 1 1

D 1 1 0 0 0

Associate to each cell (a

ij

), i = 1, , m (row) and j = 1, , n

(column) a value True/False such that

v(P

j

(x

i

)) =

True if (a

ij

) = 1

False if (a

ij

) = 0

For example, v(P

2

(B)) = False, while v(P

4

(B)) = True.

Esko Turunen http://www.vrtuosi.com

We have now dened truth values v(P(x)) TRUE, FALSE of

all atomic formulae P(x). Next we extend truth values to all

formulae, that is, the set VAL of all valuation functions

v : /L

n

TRUE, FALSE.

Let v() and v() be dened. Like in classical logic we set:

v() v() v() v( )

TRUE TRUE FALSE TRUE

TRUE FALSE FALSE FALSE

FALSE TRUE TRUE FALSE

FALSE FALSE TRUE FALSE

Exercises

4. Write the truth tables of the connectives , and .

5. v(x(x)) = TRUE in a model M iff v((x)) = TRUE for some

x (corresponding to a row of M!). How is v(x(x)) dened?

Esko Turunen http://www.vrtuosi.com

Truth value of formulae with nonclassical quantiers

Given a model M, the value v() of any formulae of the

language L

n

can be calculated immediately. In some cases we

consider several models M, N. Thus, to avoid confusion, we

write v

M

, v

N

.

To dene the truth value of formulae with nonclassical

quantiers, recall the fourfold table:

a b

c d

where a + b + c + d = m.

Quantier of simple association (x) (x) [read:

coincidence of (x) and (x) predominates over difference].

Given a model M, we dene

v((x) (x)) = TRUE iff ad > bc.

Esko Turunen http://www.vrtuosi.com

This quantier is known in most data mining frameworks, not

only in GUHA. In LISpMiner it is implemented under a name

Simple deviation: the truth denition is more general

ad > e

user.

Quantiers of founded pimplications or basic implications

p,Base

, where Base N, 0 < p 1, p rational: (x)

p,n

(x)

[read: (x) implies (x) with condence p and support Base].

Given a model M, we dene v((x)

p,Base

(x)) = TRUE iff

a

a+b

p and a Base.

Example

The Allergy matrix induces a four-fold frequency table

M Apple Apple

Tomato 6 3

Tomato 6 0

v(Tomato Apple) = FALSE, v(Tomato

0.6,5

Apple) = TRUE.

Esko Turunen http://www.vrtuosi.com

Some theoretical observations about GUHA

Denition

Given an observational language L

n

, the set / of all models

M, the set VAL of all valuations v and the set

V = TRUE, FALSE of truth values, a system L

n

, /, VAL, V

is an observational semantic system.

We say that a sentence L

n

is a tautology (note [= ) if

v() = TRUE for all valuations v VAL (thus, in all models M).

Moreover, L

n

is a logical consequence of a nite set A of

sentences of L

n

if, whenever v() = TRUE for all A, then

v() = TRUE, too.

In theoretical considerations we are interested in tautologies

(true in all models) while in practical GUHAresearch we

consider only one model M induced by a given data matrix.

Esko Turunen http://www.vrtuosi.com

Denition

An observational semantic system L

n

, /, VAL, V is

axiomatizable if the set of all tautologies is recursively

denumerable.

We can prove

Theorem

For any natural number n, the semantic system L

n

, /, VAL, V

is axiomatizable.

Remark

Axiomatizable means that there is a nite set of schemas of

tautologies called axioms and a nite set of rules of inference

such that all tautologies (and only them) can be reduced from

axioms by means of rules of inference, i.e. that all tautologies

have a proof (noted by ). Thus, axiomatizability means: [=

iff .

Esko Turunen http://www.vrtuosi.com

For example, if and are axioms (or, if they have a

proof), then one can infer by means of a rule of inference

called Modus Ponens. Thus, has a proof, too.

Here we are not interested in the general axiom system of

GUHA. However, we will need rules of inference to understand

how LISpMiner decreases the amount of outcomes of practical

GUHA procedures.

We give the denition of a (sound) rule of inference. It is of form

1

, ,

n

i

) = TRUE for all i = 1, , n, then

v() = TRUE, too.

1

, ,

n

are premises, is the conclusion.

Rules of inference are called deduction rules, too.

Esko Turunen http://www.vrtuosi.com

Exercises

6. Prove or disprove

(a) [= [ ( ] [( ) ( )],

(b) [= ( ) [( )] and

(c) [= ( ),

where , and are wellformed formulae.

7. Is

(a) a logical consequences of a set , ,

(b) a logical consequences of a set , ,

(c) a logical consequences of a set , of

wellformed formulae?

8. Prove that Modus Ponens is a sound rule of inference.

Esko Turunen http://www.vrtuosi.com

Exercises

In exercises 9 11, consider the Allergy matrix.

9. Write down all sentences P

i

(x)

0.7,7

P

j

(x) such that

v(P

i

(x)

0.7,7

P

j

(x)) = TRUE, (i ,= j ).

10. Does v(Apple(x) Orange(x)) = TRUE hold true, where

is simple association?

11. Is v(Tomato(x) Cheese(x)) = TRUE true ( is simple

association)?

Esko Turunen http://www.vrtuosi.com

TAM002

Data mining with GUHA Part 4

More about foundations of GUHA

Esko Turunen

1

1

Tampere University of Technology

Esko Turunen http://www.vrtuosi.com

Theoretical foundations of GUHA continuation

We study desirable properties the nonstandard quantiers in

GUHA should have, e.g.

if a data contains evidence for an interdependence of and

, i.e. that is true in this data, then should remain

true in all other such data where there is even more evidence.

Such quantiers are called associational or implicational.

that truth value should really depend on values in the fourfold

table; by altering these values is either true or false.

We also introduce sound rules of inference to minimize

computations in practical data mining tasks. This chapter

contains exercises.

Esko Turunen http://www.vrtuosi.com

Denition (Associational quantiers)

Let (x),(x) be two xed formulae of a language L

n

such that

x is the only free variable in both of them and they dont have

common predicates. Let M and N be two models. Then we

have the following two fourfold tables:

M

a

1

b

1

c

1

d

1

N

a

2

b

2

c

2

d

2

We dene: N is associationally better than M if a

2

a

1

,

d

2

d

1

b

2

b

1

and c

2

c

1

. Moreover, a binary quantier is

associational if, for all formulae (x),(x), all models M, N: if

v

M

((x) (x)) = TRUE, N associational better than M, then

v

N

((x) (x)) = TRUE.

Esko Turunen http://www.vrtuosi.com

Obviously, the quantier of simple association is associational:

this follows by the fact that, under the given circumstances,

a

2

d

2

a

1

d

1

> b

1

c

1

b

2

c

2

.

Also quantiers of basic implication are associational: if

a

2

a

1

Base, b

2

b

1

, then a

2

b

1

a

1

b

2

, therefore

a

2

a

1

+ a

2

b

1

a

2

a

1

+ a

1

b

2

and nally,

a

2

a

2

+b

2

a

1

a

1

+b

1

p.

In stead of being associational, a nonclassical quantier may

satisfy a stronger condition. We dene

Esko Turunen http://www.vrtuosi.com

Denition (Implicational quantiers)

Let (x),(x) be two xed formulae of a language L

n

such that

x is the only free variable in both of them and they dont have

common predicates. Let M and N be two models. Then we

have the following two fourfold tables:

M

a

1

b

1

c

1

d

1

N

a

2

b

2

c

2

d

2

We dene: N is implicational better than M if a

2

a

1

and

b

2

b

1

. Moreover, a binary quantier is implicational if, for

all formulae (x),(x), all models M, N: if

v

M

((x) (x)) = TRUE, N implicational better than M, then

v

N

((x) (x)) = TRUE.

Esko Turunen http://www.vrtuosi.com

Basic implication quantiers are implicational, however, simple

association quantiers are not implicational. To see this,

consider the following two frequency tables

M

a

1

= 1 b

1

= 1

c

1

= 1 d

1

= 2

N

a

2

= 1 b

2

= 1

c

2

= 2 d

2

= 1

Clearly N is implicational better than M, and a

1

d

1

> c

1

b

1

thus,

v

M

((x) (x)) = TRUE. However, a

2

d

2

< c

2

b

2

, thus

v

N

((x) (x)) = FALSE.

Remark

Let be an implicational quantier. Then is associational.

Proof.

Let be implicational and v

M

((x) (x)) = TRUE. If N is

associational better than M, then N is clearly also implicational

better than M, so v

N

((x) (x)) = TRUE. Therefore is

associational, too.

Esko Turunen http://www.vrtuosi.com

Theorem

Let (x), (x), (x) be formulae, and let be an implicational

quantier. Then

[ ]

is a sound rule of inference.

Proof.

Let v

M

((x) (x)) = TRUE and

M

a

1

b

1

c

1

d

1

M ( )

a

2

b

2

c

2

d

2

We reason that a

1

= {x|v

M

((x)) = v

M

((x)) = TRUE}

{x|v

M

((x) (x)) = v

M

((x)) = TRUE} = a

2

and

b

1

= {x|v

M

((x)) = v

M

((x)) = TRUE}

{x|v

M

((x) (x)) = v

M

((x)) = TRUE}

= {x|v

M

((((x) (x)) = v

M

((x)) = TRUE} = b

2

.

Obviously, since is implicational, v

M

( [ ]) = TRUE.

Esko Turunen http://www.vrtuosi.com

Theorem (Reduction Theorem 1.)

Let (x), (x), (x) be formulae, and let be an implicational

quantier. Then

[ ]

[ ]

is a sound rule of inference.

Theorem (Reduction Theorem 2.)

Let (x), (x), (x) be formulae, and let be simple

association quantier. Then

are sound

rules of inference.

We have introduced sound rules of inference mainly to

minimize computations in practical data mining tasks. For

example, if is an implicational quantier and is true in a

given model M, so is [ ] true, too. Due to the solid

logical foundations of GUHA, there are several means to

reduce the amount of computa tions. However, we do not study

them here.

Esko Turunen http://www.vrtuosi.com

We introduce

Basic equivalence quantiers

p

, where 0 < p 1. In any

model M, v(((x)

p

(x)) = TRUE iff

(a + d) p(a + b + c + d) except for a case a + d = 0,

b + c = 0; then v(((x)

p

(x)) = FALSE.

Basic or double implication quantiers

p

, where

0 < p 1. In any model M, v(((x)

p

(x)) = TRUE iff

a p(a + b + c) except for a case a + b + c = 0, d = 0; then

v(((x)

p

(x)) = FALSE.

Exercises

12. Prove Reduction Theorem 1.

13. Prove Reduction Theorem 2.

14. Prove that Reduction Theorem 2 does not hold for basic

implication quantiers.

15. Prove that Basic equivalence quantiers are associational.

16. Prove that double implication quantiers are

associational.

Esko Turunen http://www.vrtuosi.com

It is unquestionable that any valuable quantier should be (at

least) implicational (associational). However, there are other

obvious conditions, too. We say that an implicational quantier

is adependent if there are models M and N such that

M

a b

c d

and

N

a

b

c d

where v

M

(((x) (x)) = v

N

(((x) (x)). Similarly we

dene a bdependent implicational quantier . If an

adependent and bdependent quantier satises a

condition

a = b = 0 implies v(((x) (x)) = FALSE,

then is called interesting.

Exercise 17

Show that basic implication quantiers are interesting. What

about the case p = 1?

Esko Turunen http://www.vrtuosi.com

We can also dene (a+d)dependent, (b+c)dependent, etc

implicational quantiers. For example double implication

quantiers

p

are adependent. Indeed, let 0 < p 1. Then

a

+b+c

<

a

a+b+c

(= p) iff a

a + a

b + a

c < a

a + ab + ac = p iff a < a

.

Thus, there are models M and N such that the corresponding

frequency tables differ only with respect to the value a and

v

N

(((x)

p

(x)) = FALSE while v

M

(((x)

p

(x)) = TRUE.

In a similar manner we prove that double implication

quantiers

p

are (b+c)dependent, too.

We continue by having a closer look at quantiers implemented

to LISpMiner and see what kinds of tasks can be performed.

We use National Indonesia Contraceptive Prevalence Survey

data as a benchmark data test set. However, we often study the

above mentioned theoretical properties each individual

quantier meets.

Esko Turunen http://www.vrtuosi.com

TAM002

Data mining with GUHA Part 5

Introduction to LISpMiner software

Esko Turunen

1

1

Tampere University of Technology

Esko Turunen http://www.vrtuosi.com

Instructions to download LISpMiner

In this chapter we show how to start working with LISpMiner.

We also introduce the rst GUHA task where we use founded

implication quantier.

It is highly recommended that, to download LISpMiner, you

print these slides and follow the instructions step by step from

them. The situation is even better if you have two computers

and two terminals available.

LISpMiner is not a commercial product and you might nd

some parts somewhat cumbersome. There are often updates

so the layout might not look today precisely as it looked when

these slides were completed patience!

LISpMiner is built on Microsoft products, so you need Micro

Soft Access database 2002 or later (.mdble).

You might also like to look at http://lispminer.vse.cz/, the

homepage of LISpMiner.

Esko Turunen http://www.vrtuosi.com

LISpMiner

is an academic software based on GUHA method

is developed and maintained at Prague University of

Economics by Jan Rauch and Milan imunek.

can be downloaded free from http://lispminer.vse.cz/.

We show step by step how you can download LISpMiner on

your own computer. Here we have LISpMiner version 12.01.00

from February 24, 2010.

Step 1. Open a new folder (e.g. to position C:\) and name it

e.g. LispMiner2010.

Step 2. Go to the web page http://lispminer.vse.cz/, click

Download and download the following le to your folder

LispMiner2010. (Later we will download more les from this

course there are also some more instructions there.)

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Your C:\LispMiner2010 should look like this now.

Add there a new folder and name it e.g. My Data you

will put your data files there!

Esko Turunen http://www.vrtuosi.com

Then go to the address http://b-course.cs.helsinki.fi/obc/readymade.html

->(Under Contraseptive methods ->) The data set itself (a .txt-file)

to download Indonesia data. Put it to My Data

Esko Turunen http://www.vrtuosi.com

Name it Indonesia1.txt

Delete empty space from the names!

Esko Turunen http://www.vrtuosi.com

It should look like this now.

Next we create a Micro Soft Access data base

through the following steps:

Esko Turunen http://www.vrtuosi.com

Take a copy of LMEmpty, rename its LM_Indonesia1.mdb

and save it to My Data

Esko Turunen http://www.vrtuosi.com

Your My Data folder should now look like this. Next we

create there still one data matrix. It will be called

Indonesia1.mdb. Do the following steps:

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Your data file Indonesia1.mdb should now look like

this. Close it and go to LispMiner2010

Esko Turunen http://www.vrtuosi.com

Open LMAdmin.exe

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Put here Indonesia1.mdb

...

and here

LM_Indonesia1.mdb

Esko Turunen http://www.vrtuosi.com

You should now have this on your screen!

Close it and go again to LispMiner2010

Esko Turunen http://www.vrtuosi.com

Open now LMDataSource.exe

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Open from Database ->Data Matrices

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Open from Database ->Attributes Three

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Here is the distribution of ages

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

In this way we go through the whole list of Attributes.

Depending on how many categories we divide each attribute

into, we get the total amount of Predicates (=columns

composed of 0/1/-cells in the final data matrix) of our logic

language. We should always have them as few as possible.

Esko Turunen http://www.vrtuosi.com

We are now ready to start the simplest data mining

tasks. Open 4ftTask.exe

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

We are interested in knowing which Attributes imply

Contraception Method, so we name the task like this.

Esko Turunen http://www.vrtuosi.com

Open the Succedentpart: click it!

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Since there are only 3 Predicates

(No use, Short Term and Long

Term) it is natural to take subset

with min- and max-length=1

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Next open the Antecedent-part: click

here. There are much more choices

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Select all but Contraceptive

Esko Turunen http://www.vrtuosi.com

Age is typically an interval, say 16-18, 17-19,..., 46-48,

16-19, 17-20,, 45-48 or 16-20, 17-21,, 44-48.

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

All the others contain only 2 or

4 Categories (=Predicates) so

let them be just 1-1 subsets.

With this division there are

34+15+4+4+4+2+4+2+2=71

columns in the data to be mined

Esko Turunen http://www.vrtuosi.com

Choosing the quantifier:

2*Click BASE, choose =5%

2*Click FUI, choose p=0.900

Push Generate and see what

happens

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Esko Turunen http://www.vrtuosi.com

Let us see the shortest

hypotheses in detail

click here!

Esko Turunen http://www.vrtuosi.com

There are 95 +2 =97 women

who have no children

95 of them use no contraception

Esko Turunen http://www.vrtuosi.com

- BA Logic 8Diunggah olehIkki De Queero
- Data MiningDiunggah olehKarthik H
- 36-350, Data Mining_ IntroductionDiunggah olehbmartindoyle6396
- Application of Data Mining Techniques to Support Customer Relationship Management at Ethiopian Airlines 2002 ThesisDiunggah olehMounir Lahlou Kassi
- ETEBMS-2016_ENG-EE7Diunggah olehjojee2k6
- International Journal of Engineering Research and Development (IJERD)Diunggah olehIJERD
- A Survey on Soil Data MiningDiunggah olehamitarya514
- Demystifying Data MiningDiunggah olehPlezi
- DM IntroductionDiunggah olehsanchobe1
- Performance Analysis of Hybrid Approach for Privacy Preserving in Data MiningDiunggah olehidescitation
- Tableau CV LatestDiunggah olehChinta Venu
- Arnold poofs text.pdfDiunggah olehChris
- Parametric Comparison Based on Split Criterion on Classification AlgorithmDiunggah olehIAEME Publication
- Data PreparationDiunggah olehSherin Vs
- logicbDiunggah olehAbin Chua Hong Yean
- MULTI-CRITERIA DECISION SUPPORT GUIDED BY CASE-BASED REASONINGDiunggah olehCS & IT
- Data MiningDiunggah olehdarebusi1
- Literature SurveyDiunggah olehdivisionitm
- 3l4 Gane.pdfDiunggah olehPETER
- SRASDiunggah olehasdfqwer
- Chap 008Diunggah olehkhalala
- Big Data MiningDiunggah olehAnupamSh
- 1-s2.0-S2212827117302809-main PSS WORD DONEDiunggah olehali aflah muzakki
- Frank Koppen Noordman Vonk2003Diunggah olehKevin Johnson
- 2006mayDiunggah olehyakzak_khan
- Algorithms & Programming ConceptsDiunggah olehVinesh Am
- Concatenated Decision Paths Classification for Time Series ShapeletsDiunggah olehijicsjournal
- hghgDiunggah olehvishal3193
- Kes04 YadaDiunggah olehsukande
- Combining Semantic n DiscourseDiunggah olehM Nata Diwangsa

- Kawasaki Rose OrigamiDiunggah olehMariko
- Complex Laplace NewDiunggah olehblackmatrix2007
- LTmonIXJDiunggah olehblackmatrix2007
- De Thi LT Integrating XML With JavaDiunggah olehChen Zhen
- ICEpdf Developers GuideDiunggah olehblackmatrix2007
- Can Ban PHP - Huu KhangDiunggah olehKara Tsuboi
- Practical TestDiunggah olehblackmatrix2007

- Defects Hydrogen Cracks IdentificationDiunggah olehguru_terex
- 2010 Syracuse Chiefs Media GuideDiunggah olehSyracuseChiefs
- hAP-qgDiunggah olehJedaloja Emma
- 4 Sensor ArrayDiunggah olehMirza Baig
- EDSDK_API.pdfDiunggah olehStefBeck Beck
- Zigbee Cluster Library JN-UG-3115.pdfDiunggah olehGertrude Ramsbottom
- br.pdfDiunggah olehCarlos Eduardo Zelidon
- Roger Schlossberg, Trustee-Appellant v. Jean Barney, Debtor-Appellee, 380 F.3d 174, 4th Cir. (2004)Diunggah olehScribd Government Docs
- Ethical_Issues_of_Nestle.docDiunggah olehKamal Azam
- Buenaventura v CADiunggah olehsamantha
- Customer Support Success VP in New Orleans LA Resume Dennis GaresDiunggah olehDennisGares3
- Lecture-10 Angle ModulationDiunggah olehCh Abdul Mannan
- Limitations of Double HullsDiunggah olehnapoleonpt2
- Canon 2 Judicial EthicsDiunggah olehgretelsalley
- Enhancement of Iraqi Light Naphtha Octane Number Using Pt Supported HMOR Zeolite CatalystDiunggah olehMustafaJabbar
- Time Delay SystemsDiunggah olehSamarendu Baul
- Tl-wr841n(Un) v11 QigDiunggah olehsdfgsdgsdfgsdfg
- Inspection and Testing RequirementsDiunggah olehnaoufel1706
- Chapter 7 - Elastic InstabilityDiunggah olehhibby_tomey
- ACU Project Management Policy Final Oct 2015Diunggah olehTao Chun Liu
- Customer Variability in ServiceDiunggah olehManchi297
- ESBv7_Case_Study__09[2].04.02_Diunggah olehJai Gaizin
- Boi ProjectDiunggah olehnitin0010
- OHS-Handbook-WorkforceXS-Monbulk.pdfDiunggah olehAlejandro Cea
- Quiz P3Diunggah olehJEP Walwal
- Ms Tie in of Oily WaterDiunggah olehsharif339
- Strategic VisionDiunggah olehBeereddy Swapna
- Attorney General's Opinion Feral CatsDiunggah olehWSET
- Loading Guidelines Volume2Diunggah olehAndreiCaba
- PanelView 300 Micro Terminals Installation Instructions_°²×°Ö¸ÄÏ.pdfDiunggah olehjaimeasisa