Anda di halaman 1dari 88

Fall 2004 1

Optimization Methods
in Data Mining
Fall 2004 2
Overview
Optimization
Mathematical
Programming
Combinatorial
Optimization
Support
Vector
Machines
Steepest
Descent
Search
Classification,
Clustering,
etc
Neural Nets,
Bayesian Networks
(optimize parameters)
Genetic
Algorithm
Feature selection
Classification
Clustering
Fall 2004 3
What is Optimization?
Formulation
Decision variables
Objective function
Constraints
Solution
Iterative algorithm
Improving search
Problem
Model
Solution
Formulation
Algorithm
Fall 2004 4
Combinatorial Optimization
Finitely many solutions to choose from
Select the best rule from a finite set of rules
Select the best subset of attributes
Too many solutions to consider all
Solutions
Branch-and-bound (better than Weka exhaustive
search)
Random search
Fall 2004 5
Random Search
Select an initial solution x
(0)
and let k=0
Loop:
Consider the neighbors N(x
(k)
) of x
(k)

Select a candidate x from N(x
(0)
)
Check the acceptance criterion
If accepted then let x
(k+1)
= x and
otherwise let x
(k+1)
= x
(k)

Until stopping criterion is satisfied
Fall 2004 6
Common Algorithms
Simulated Annealing (SA)
Idea: accept inferior solutions with a given probability
that decreases as time goes on
Tabu Search (TS)
Idea: restrict the neighborhood with a list of solutions
that are tabu (that is, cannot be visited) because
they were visited recently
Genetic Algorithm (GA)
Idea: neighborhoods based on genetic similarity
Most used in data mining applications
Fall 2004 7
Genetic Algorithms
Maintain a population of solutions
rather than a single solution
Members of the population have certain
fitness (usually just the objective)
Survival of the fittest through
selection
crossover
mutation
Fall 2004 8
GA Formulation
Use binary strings (or bits) to encode
solutions:
0 1 1 0 1 0 0 1 0
Terminology
Chromosomes = solution
Parent chromosome
Children or offspring
Fall 2004 9
Problems Solved
Data Mining Problems that have been
addressed using Genetic Algorithms:
Classification
Attribute selection
Clustering
Fall 2004 10
Outlook
Sunny 100

Overcast 010

Rainy 001


Yes 10

No 01
Windy
Classification Example
Fall 2004 11
Representing a Rule
If windy=yes then play=yes

play windy
outlook
10 10 111
If outlook=overcast and windy=yes then play=no

play windy
outlook
01 10 010
Fall 2004 12
Single-Point Crossover

play windy
outlook
10 10 111

play windy
outlook
01 01 010
Crossover point

play windy
outlook
01 01 111

play windy
outlook
10 10 010
Parents Offspring
Fall 2004 13
Two-Point Crossover

play windy
outlook
10 10 111

play windy
outlook
01 01 010
Crossover points

play windy
outlook
10 01 111

play windy
outlook
01 10 010
Parents Offspring
Fall 2004 14
Uniform Crossover

play windy
outlook
0 1 10 1 11

play windy
outlook
1 0 01 0 01
Parents Offspring

play windy
outlook
00 10 110

play windy
outlook
11 01 011
Problem?
Fall 2004 15
Mutation

play windy
outlook
01 01 010

play windy
outlook
01 11 010
Parent Offspring
Mutated bit
Fall 2004 16
Selection
Which strings in the population should
be operated on?
Rank and select the n fittest ones
Assign probabilities according to fitness
and select probabilistically, say

=
j
j
i
i
x
x
x P
) ( Fitness
) ( Fitness
] select [
Fall 2004 17
Creating a New Population
Create a population P
new
with p individuals
Survival
Allow individuals from old population to survived intact
Rate: 1-r % of population
How to select the individuals that survive: Deterministic/random
Crossover
Select fit individuals and create new once
Rate: r% of population. How to select?
Mutation
Slightly modify any on the above individuals
Mutation rate: m
Fixed number of mutations versus probabilistic mutations
Fall 2004 18
GA Algorithm
Randomly generate an initial population P
Evaluate the fitness f(x
i
) of each individual in P
Repeat:
Survival: Probabilistically select (1-r)p individuals from P and
add to P
new
, according to


Crossover: Probabilistically select rp/2 pairs from P and apply
the crossover operator. Add to P
new

Mutation: Uniformly choose m percent of member and invert
one randomly selected bit
Update: P P
new

Evaluate: Compute the fitness f(x
i
) of each individual in P
Return the fittest individual from P

=
j
j
i
i
x f
x f
x P
) (
) (
] select [
Fall 2004 19
Analysis of GA: Schemas
Does GA converge?
Does GA move towards a good solution?
Local optima?

Holland (1975): Analysis based on schemas
Schema: string combination of 0s, 1s, *s
Example: 0*10 represents {0010,0110}

Fall 2004 20
The Schema Theorem
(all the theory on one slide)
) (
) 1 (
1
) (
1 ) , (
) (
) , (

)] 1 , ( [
s o
m c
p
l
s d
p t s m
t f
t s u
t s m E
|
.
|

\
|

> +
Number of instance of
schema s at time t
Average fitness of
individuals in
schema s at time t
Probability
of crossover
Probability
of mutation
Number of
defined bits
in schema s
Distance between
defined bits in s
Fall 2004 21
Interpretation
Fit schemas grow in influence
What is missing
Crossover?
Mutation?
How about time t+1 ?
Other approaches:
Markov chains
Statistical mechanics
Fall 2004 22
GA for Feature Selection
Feature selection:
Select a subset of attributes (features)
Reason: to many, redundant, irrelevant

Set of all subsets of attributes very
large
Little structure to search
Random search methods
Fall 2004 23
Encoding
Need a bit code representation
Have some n attributes
Each attribute is either in (1) or out (0)
of the selected set

windy
humidity
e temperatur
outlook
0 1 0 1
Fall 2004 24
Fitness
Wrapper approach
Apply learning algorithm, say a decision tree, to
the individual x ={outlook, humidity}
Let fitness equal error rate (minimize)
Filter approach
Let fitness equal the entropy (minimize)
Other diversity measures can also be used
Simplicity measure?
Fall 2004 25
Crossover

windy
humidity
e temperatur
outlook
0 1 0 1

windy
humidity
e temperatur
outlook
1 1 0 0

windy
humidity
e temperatur
outlook
1 1 0 1

windy
humidity
e temperatur
outlook
0 1 0 0
Crossover point
Fall 2004 26
In Weka
Fall 2004 27
Clustering Example
ID Outlook Temperature Humidity Windy Play
10 Sunny Hot High True No
20 Overcast Hot High False Yes
30 Rainy Mild High False Yes
40 Rainy Cool Normal False Yes

Offspring
Parents
0 0 1 0
1 0 1 1
1 0 1 0
0 0 1 1

Crossover
{10,20}
{30,40}
{20,40}
{10,30}
{10,20,40}
{30}
{20}
{10,30,40}
Create two clusters for:
Fall 2004 28
Discussion
GA is a flexible and powerful random
search methodology
Efficiency depends on how well you can
encode the solutions in a way that will
work with the crossover operator
In data mining, attribute selection is the
most natural application
Fall 2004 29
Attribute Selection in
Unsupervised Learning
Attribute selection typically uses a
measure, such as accuracy, that is directly
related to the class attribute
How do we apply attribute selection to
unsupervised learning such as clustering?
Need a measure
compactness of cluster
separation among clusters
Multiple measures
Fall 2004 30
Quality Measures
Compactness
( )

= =
=
K
k
n
i
kj ij ik
within
within
x
d Z
F
1 1
2
1 1
1 o

=
Otherwise 0
cluster to belongs If 1 k x
i
ik
o

=
=
=
n
i
ik
n
i
ij ik
kj
x
1
1
o
o

Centroid
Instances
Clusters
Number of
attributes
Normalization
constant to make
] 1 , 0 [ e
within
F
Fall 2004 31
More Quality Measures
Cluster Separation
( ) ( )

= = e

=
K
k
n
i J j
kj ij ik
bet
between
x
k d Z
F
1 1
2
1
1
1 1 1
o
Fall 2004 32
Final Quality Measures
Adjustment for bias


Compexity
min max
min
1
K K
K K
F
clusters

=
1
1
1

=
D
d
F
clusters
Fall 2004 33
Wrapper Framework
Loop:
Obtain an attribute subset
Apply k-means algorithm
Evaluate cluster quality
Until stopping criterion satisfied
Fall 2004 34
Problem
What is the optimal attribute subset?
What is the optimal number of clusters?

Try to find simultaneously
Fall 2004 35
Example
Find an attribute subset and optimal number of clusters
(K
min
= 2, K
min
= 3) for
ID Sepal Length Sepal Width Petal length Petal Width
10 5.0 3.5 1.6 0.6
20 5.1 3.8 1.9 0.4
30 4.8 3.0 1.4 0.3
40 5.1 3.8 1.6 0.2
50 4.6 3.2 1.4 0.2
60 6.5 2.8 4.6 1.5
70 5.7 2.8 4.5 1.3
80 6.3 3.3 4.7 1.6
90 4.9 2.4 3.3 1.0
100 6.6 2.9 4.6 1.3
Fall 2004 36
Formulation
Define an individual



Initial Population
0 1 0 1 1
1 0 0 1 0

clusters
of number
attributes Selected
* * * * *

Fall 2004 37
Evaluate Fitness
Start with 0 1 0 1 1
Three clusters and {Sepal Width, Petal Width}








Apply k-means with k=3
ID Sepal Width Petal Width
10 3.5 0.6
20 3.8 0.4
30 3.0 0.3
40 3.8 0.2
50 3.2 0.2
60 2.8 1.5
70 2.8 1.3
80 3.3 1.6
90 2.4 1.0
100 2.9 1.3
Fall 2004 38
K-Means
Start with random centroids: 10, 70, 80
100 70
60
80
10
20
40 50
30
90
0
0.5
1
1.5
2
2 2.5 3 3.5 4
Sepal Width
P
e
t
a
l

W
i
d
t
h
Fall 2004 39
New Centroids
No change in assignment so
terminate k-means algorithm
C1
C3
100 70
60
80
10
20
40 50
30
90
0
0.5
1
1.5
2
2 2.5 3 3.5 4
Sepal Width
P
e
t
a
l

W
i
d
t
h
Fall 2004 40
Quality of Clusters
Centers
Center 1 at (3.46,0.34): {60,70,90,100}
Center 2 at (3.30,1.60): {80}
Center 3 at (2.73,1.28): {10,20,30,40,50}
Evaluation
67 . 0
00 . 0
60 . 6
55 . 0
=
=
=
=
complexity
clusters
between
within
F
F
F
F
Fall 2004 41
Next Individual
Now look at 1 0 0 1 0
Two clusters and {Sepal Length, Petal Width}








Apply k-means with k=3
ID Sepal Length Petal Width
10 5.0 0.6
20 5.1 0.4
30 4.8 0.3
40 5.1 0.2
50 4.6 0.2
60 6.5 1.5
70 5.7 1.3
80 6.3 1.6
90 4.9 1.0
100 6.6 1.3
Fall 2004 42
K-Means
Say we select 20 and 90 as initial centroids:
90
30
50 40
20
10
80
60
70 100
0
0.5
1
1.5
2
4 4.5 5 5.5 6 6.5 7
Sepal Width
P
e
t
a
l

W
i
d
t
h
Fall 2004 43
Recalculate Centroids
90
30
50 40
20
10
80
60
70 100
C1
C2
0
0.5
1
1.5
2
4 4.5 5 5.5 6 6.5 7
Sepal Width
P
e
t
a
l

W
i
d
t
h
Fall 2004 44
Recalculate Again
90
30
50 40
20
10
80
60
70 100
C1
C2
0
0.5
1
1.5
2
4 4.5 5 5.5 6 6.5 7
Sepal Width
P
e
t
a
l

W
i
d
t
h
No change in assignment so
terminate k-means algorithm
Fall 2004 45
Quality of Clusters
Centers
Center 1 at (4.92,0.45): {10,20,30,40,50,90}
Center 3 at (6.28,1.43): {60,70,90,100}
Evaluation
67 . 0
00 . 1
59 . 14
39 . 0
=
=
=
=
complexity
clusters
between
within
F
F
F
F
Fall 2004 46
Compare Individuals
67 . 0
00 . 1
59 . 14
39 . 0
0 1 0 0 1
=
=
=
=
complexity
clusters
between
within
F
F
F
F
67 . 0
00 . 0
60 . 6
55 . 0
1 1 0 1 0
=
=
=
=
complexity
clusters
between
within
F
F
F
F
=
<
<
>
Which is fitter?
Fall 2004 47
Evaluating Fitness
Can scale (if necessary)
Then weight them together, e.g.,




Alternatively, we can use Pareto optimization
84 . 0 67 . 0 0
59 . 14
59 . 14
55 . 0
39 . 0
) 0 1 0 0 1 (
53 . 0 67 . 0 0
59 . 14
60 . 6
55 . 0
55 . 0
) 1 1 0 1 0 (
= + + + =
= + + + =
f itness
f itness
Fall 2004 48
Fall 2004 49
Mathematical Programming
Continuous decision variables
Constrained versus non-constrained
Form of the objective function
Linear Programming (LP)
Quadratic Programming (QP)
General Mathematical Programming (MP)
Fall 2004 50
Linear Program
x x f 5 . 0 2 ) ( + =
10
15 s x
x
x s 0
15 0 s.t
5 . 0 2 max
s s
+
x
x
Optimal solution
Fall 2004 51
Two Dimensional Problem
Feasible
Region
2000


1500



1000



500
500 1000 1500 2000
1500
2
s x
1500
1
s x
1750
2 1
s + x x
4800 2 4
2 1
s + x x
0
1
> x
0
2
> x
2
x
1
x
0 ,
4800 2 4
1750
1500
1000 s.t.
9 12 max
2 1
2 1
2 1
2
1
2 1
>
s +
s +
s
s
+
x x
x x
x x
x
x
x x
6000 9 12
2 1
= + x x
Optimal Solution
12000 9 12
2 1
= + x x
Optimum is
always at an
extreme point
Fall 2004 52
Simplex Method
2000


1500



1000



500
500 1000 1500 2000
1500
2
s x
1500
1
s x
1750
2 1
s + x x
4800 2 4
2 1
s + x x
0
1
> x
0
2
> x
2
x
1
x
Fall 2004 53
Quadratic Programming
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0
0
.
2
0
.
4
0
.
6
0
.
8 1
1
.
2
1
.
4
1
.
6
1
.
8 2
f (x )=0.2+(x -1)
2
1
0 ) 1 ( 2
) 1 ( 2 ) ( '
=
=
=
x
x
x x f
Fall 2004 54
General MP
0 ) ( ' = x f
Derivative being
zero is a necessary
but not sufficient
condition
0 ) ( ' = x f
0 ) ( ' = x f
0 ) ( ' = x f
0 ) ( ' = x f
Fall 2004 55
Constrained Problem?
0 ) ( ' = x f
10 s x
Fall 2004 56
General MP
) g ,..., g , (g
) h ,..., h , (h
ariables decision v of Vector
) ( s.t.
) ( min
m 2 1
m 2 1
(x) (x) (x) g(x)
(x) (x) (x) h(x)
x
0 g(x)
0 x h
x
=
=
=
s
=
f
We write a general mathematical program in
matrix notation as:
Fall 2004 57
Karush-Kuhn-Tucker (KKT)
Conditions
0 g(x)
0 x h
x
s
= ) ( s.t.
) ( min
for minimum relative a is If
*
f
x
( ) ( ) ( )
( ) 0
*
* * *
=
= V + V + V
x g
0 x g x h x
T
T T
f
such that , , exist There 0 >
Fall 2004 58
Convex Sets
Convex Not Convex
A set C is convex if any line connecting two points in the
set lies completely within the set, that is,
C C e + e e
2 1 2 1
) 1 ( : ) 1 , 0 ( , , x x x x o o o
Fall 2004 59
Convex Hull
The convex hull co(S) of a set S is the
intersection of all convex sets
containing S
A set V e R
n
is a linear variety if
R x x x x e e + e , ) 1 ( : ,
2 1 2 1
V V
Fall 2004 60
Hyperplane
A hyperplane in R
n
is a (n-1)-dimensional variety
Hyperplane in R
3

Hyperplane in R
2

Fall 2004 61
Convex Hull Example
Play

No Play
Temperature
Humidity
Separating hyperplane
bisects closest points
Closets points in
convex hulls
c
d
Fall 2004 62
Finding the Closest Points
0
1
1
s.t.
2
1
min
Yes Play
Yes Play
No Play
Yes Play
2
>
=
=
=
=

=
=
=
=
i
i:
i
i:
i
i:
i i
i:
i i
x d
x c
d c
o
o
o
o
o
Formulate as QP:
Fall 2004 63
Support Vector Machines
Play

No Play
Temperature
Humidity
Separating
Hyperplane
Support Vectors
Fall 2004 64
Example
ID Sepal Width Petal Width
10 3.5 0.6
20 3.8 0.4
30 3.0 0.3
40 3.8 0.2
50 3.2 0.2
60 2.8 1.5
70 2.8 1.3
80 3.3 1.6
90 2.4 1.0
100 2.9 1.3
Fall 2004 65
Separating Hyperplane
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2 2.5 3 3.5 4
Fall 2004 66
Assume Separating Planes
. 1 : , 1
1 : , 1
= s +
+ = + > +
i i i
i i i
y i b
y i b
w x
w x
Constraints:




Distance to each plane:
w
1
Fall 2004 67
Optimization Problem
. 1 : , 1
1 : , 1 subject to
max
2
b ,
= s +
+ = + > +
i i i
i i i
y i b
y i b
w x
w x
w
w
Fall 2004 68
How Do We Solve MPs?
-5 -4 -3 -2 -1 0 1 2 3 4 5
-5
-4
-3
-2
-1
0
1
2
3
4
5
Fall 2004 69
Improving Search
Direction-step approach




k k k
x x x A + =
+

1
Step size
Search direction
Current Solution
New Solution
Fall 2004 70
Steepest Descent
Search direction equal to negative gradient


Finding is a one-dimensional optimization
problem of minimizing
) (
k
f x x V = A
) ( ) (
k k
f x x A + = |
Fall 2004 71
Newtons Method
Taylor series expansion



The right hand side is minimized at
| | ) ( ) (
1
1 k k k k
f F x x x x V =

+
| | ) ( ) ( ) (
2
1
) )( ( ) ( ) (
k k
T
k
k k k
F
f f f
x x x x x
x x x x x
+
V + ~
Fall 2004 72
Discussion
Computing the inverse Hessian is difficult
Quasi-Newton


Conjugate gradient methods
Does not account for constraints
Penalty methods
Lagrangian methods, etc.
) (
1 k k k k k
f x S x x V =
+

Fall 2004 73
Non-separable
. , 0
1 : , 1
1 : , 1
i
y i b
y i b
i
i i i i
i i i i
>
= + s +
+ = + > +
c
c
c
w x
w x
Add an error term to the constraints:
Fall 2004 74
Wolfe Dual
. 0
0
subject to
2
1
max
,
2


=
s s

i
i i
i
j i
j i j i j i
i
i
y
C
y y
o
o
o o o x x w

Simple
constraints
Only place
data appears
Fall 2004 75
Extension to Non-Linear
) ( ) ( ) ( y x y x, u u = K
Kernel functions


Mapping
H
n
R : u
Takes place of
dot product in
Wolfe dual
High dimensional
Hilbert space
Fall 2004 76
Some Possible Kernels
) tanh( ) (
) (
) 1 ( ) (
2
2
2 /
o k
o
=
=
+ =

y x y x,
y x,
y x y x,
y x
K
e K
K
p
Fall 2004 77
In Weka
Weka.classifiers.smo
Support vector machine for nominal data only
Does both linear and non-linear models
Fall 2004 78
Optimization in DM
Optimization
Mathematical
Programming
Support
Vector
Machines
Classification,
Clustering,
etc
Steepest
Descent
Search
Neural Nets,
Bayesian Networks
(optimize parameters)
Combinatorial
Optimization
Genetic
Algorithm
Feature selection
Classification
Clustering
Fall 2004 79
Bayesian Classification
Nave Bayes assumes independence between
attributes
Simple computations
Best classifier if assumption is true
Bayesian Belief Networks
Joint probability distributions
Directed acyclic graphs
Nodes are random variables (attributes)
Arcs represent the dependencies
Fall 2004 80
Example: Bayesian Network
Family History Smoker
Lung Cancer Emphysema
Positive X-Ray Dyspnea
Lung Cancer is conditionally independent of
emphysema. given Family History and Smoker
FH,S FH,~S ~FH,S ~FH,~S
LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
Lung Cancer depends on Family History and Smoker
Fall 2004 81
Conditional Probabilities
( )
[
=
=
= = = =
= = = =
n
i
i i n
Z z z z
1
1
) ( Parents | Pr ) ,..., Pr(
9 . 0 ) no" " Smoker , no" " ory FamilyHist | no" " LungCancer Pr(
8 . 0 ) yes" " Smoker , yes" " ory FamilyHist | yes" " LungCancer Pr(
Random
variable
Outcome of the
random variable
The node representing
the class attribute is
called the output node
Fall 2004 82
How Do we Learn?
Network structure
Given/known
Inferred or learned from the data
Variables
Observable
Hidden (missing values / incomplete data)
Fall 2004 83
Case 1: Known Structure and
Observable Variables
Straightforward
Similar to Nave Bayes
Compute the entries of the conditional
probability table (CPT) of each variable
Fall 2004 84
Case 2: Known Structure and
Some Hidden Variables
Still need to learn the CPT entries
Let S be a set of s training instances


Let w
ijk
be the CPT entry for variable
Y
i
=y
ij
having parents U
i
=u
ik
.
. ,..., ,
2 1 s
X X X
Fall 2004 85
CPT Example
} " " , " {"
} , {
" "
yes yes u
Smoker ory FamilyHist U
yes y
LungCancer Y
ik
i
ij
i
=
=
=
=
FH,S FH,~S ~FH,S ~FH,~S
LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
ijk
w
{ }
k j i
ijk
w
, ,
= w
Fall 2004 86
Objective
Must find the value of


The objective is to maximize the likelihood of
the data, that is,


How do we do this?
{ }
k j i
ijk
w
, ,
= w
( )
[
=
=
s
d
d
X S
1
Pr ) ( Pr
w w
Fall 2004 87
Non-Linear MP
Compute gradients:



Move in the direction of the gradient
( )

=
= =
=
c
c
s
d
ijk
d ik i ij i
ijk
w
X u U y Y
w
S
1
| , Pr
) ( Pr
w
From training data
ijk
ijk ijk
w
S
l w w
c
c
+
) ( Pr
w
Learning rate
Fall 2004 88
Case 3: Unknown Network
Structure
Need to find/learn the optimal network
structure for the data

What type of optimization problem is this?
Combinatorial optimization (GA etc.)

Anda mungkin juga menyukai