Anda di halaman 1dari 35

Computational Intelligence:

Methods and Applications

Lecture 28
Non-parametric density modeling

Source: Wodzisaw Duch; Dept. of Informatics, UMK;


Google: W Duch
Density from histograms
In 1-D or 2-D it is rather simple, histograms provide piecewise constant
approximation since we do not assume any particular functional
dependence such estimation is called nonparametric.

Histograms change,
depending on the size
of the bin Bi that
measures frequency
P(XBi).

Smoothing histograms
may be done by fitting
some smooth
functions, such as
Gaussians.
How good is this
Why histogram estimation works?
Probability that a data point comes from some region R (belongs to
some category, etc) is:
P=P ( X ) dX
R

We are given n data points, what is the chance Pr that k of these points
are from region R? If n=k=1, this Pr=P, in general Pr is the number of
combinations in which k points could be selected out of n, multiplied by
probability of selecting k points from R, i.e. Pk, and selecting n-k points
not from R, i.e. (1-P)n-k, that is, the distribution is binomial:

n k n - k Expected k value: E ( k ) = nP
Pr( k ) = P ( 1 - P )
k Expected variance: s 2 ( k ) = nP ( 1 - P )
k P
Since P(X) VR P = k/n, for a large number of s = ( 1 - P )
2

samples n small variance of k/n is expected, n n


therefore this is useful approximation to P(X).
Parzen windows 1D
Density estimate using (for standardized data) a bin of size h (a window on the
data) in each dimension.
For 1D cumulative density function CP(x)=(# observation<x)/N
Density is given as a derivative of this function, estimated as:

CP( x + h ) - CP ( x - h ) For example, hyperrectangular


P( x ) = windows with H(u)=1 for all |uj|<0.5,
2h
Number of points inside: or hard sphere with 1 inside and 0
outside.
n
X ( i ) - X
k = H
h h is called a smoothing parameter
i =1
k 1 n
X ( i ) - X
Density estimate: P ( X) = = d H
nV nh i =1 h
Kernel should be H(u)0 and should integrate to 1.
Parzen windows 1D
Estimate density using (for standardized data) a bin of size h (a window
on the data) in each dimension. For 1D cumulative density function is:
P(x<a)=(# observation x<a)/n 1 (this is probability that x<a ).
Cumulative contributions from all points
should sum up to 1, contribution from
each interval [x-h/2, x+h/2] with a
single observation xi inside is 1/n.
For real data this is a stairway function.

Density is given as a derivative of this function, but for such staircase


data it will be discontinuous, a series of spikes for x=xi values
corresponding to real observations.
Numerical estimation of density at point x is calculated as:
P ( x < a + h / 2) - P ( x a - h / 2)
P( x = a ) =
h
Parzen 1D kernels
We need continuous density estimation, not spikes.

Introduce a kernel function indicating if 0 | u |> 1



H (u ) =
variable is in -+ interval: 1 | u |1

1 n xi - x
Density may be now written as: P( x) = H
nh i =1 h
Density in the window is constant=1, +xi - x
so integrating over each kernel:
-
H
h
dx = h

Integrating over all x gives therefore total probability=1.


Smooth cumulative density for x a is then:
a
This is equal to 1/n times the number of
P ( x a ) = P( x )dx
- xi a plus a fraction from the last interval
[xi-h/2,a] if a < xi+h/2
Parzen windows dD
The window moves with X which is in the middle, therefore density is
smoothed. 1D generalizes to dD situations easily:
Volume V=hd and the kernel (window) function:

X ( i ) - X Typically hyperrectangular windows with


H ( u) = H H(u)=1 for all |uj|<1 are used, or hard
h sphere windows with 1 inside and 0
outside, or some other localized functions.
Number of points inside:
h is called a smoothing parameter
n
X - X
(i )
k = H
i =1 h
k 1 n
X ( i ) - X
Density estimate: P ( X) = = d H
nV nh i =1 h
Any function with H(u)0 integrating to 1 may be used as a kernel.
Example with rectangles
With large h strong smoothing is achieved (imagine window covering all data ...)

Details are picked up when h is small, general shape when it is large.


Use as H(u) a smooth function, such as Gaussian; if it is normalized
than also the final density is normalized:
1 n x ( i ) - x

H (u )du = 1
P( x )dx =
nh i =1
H dx = 1

h
Example with Gaussians
Dispersion h is also called here smoothing or regularization parameter.

A. Webb, Chapter 3.5 has a good explanation of Parzen windows.


Idea
Assume that P(X) is a combination of some smooth functions (X);
m
P ( X ) = Wi i ( X )
i =1

use an iterative algorithm that adapts the density to the incoming data.
Estimate density P(X|C) for each class separately.

Since calculation of parameters may be done on a network of


independent processors, this leads to the basis set networks, such as
the radial basis set networks.

This may be used for function approximation, classification and


discovery of logical rules by covering algorithms.
Computational Intelligence:
Methods and Applications

Lecture 29
Approximation theory, RBF and
SFN networks
Source: Wodzisaw Duch; Dept. of Informatics, UMK;
Google: W Duch
Basis set functions
A combination of m functions
m
( X ) = Wi i ( X )
i =1

may be used for discrimination or for density estimation.


What type of functions are useful here?

Most basis functions are composed of two functions (X)=g(f(X)):


f(X) activation, defining how to use input features, returning a scalar f.
g(f) output, converting activation into a new, transformed feature.
Example: multivariate Gaussian function, localized at R:

f ( X ) = X - R ; g ( f ) = exp ( - f 2 )

Activation f. computes distance, output f. localizes it around zero.


Radial functions
General form of multivariate Gaussian:
1
f ( X) =
2
( X - )
R
T
X( R-
-1
) =X R- S
; g ( f ) = exp ( - f 2
)
This is a radial basis function (RBF), with Mahalanobis distance
and Gaussian decay, a popular choice in neural networks and
approximation theory. Radial functions are spherically symmetric
in respect to some center. Some examples of RBF functions:

Distance f (r) = r = X - R

Inverse multiquadratic h( r ) = ( s + r
2
)
2 -a
,a >0
Multiquadratic h( r ) = ( s + r
2
)
2 b
,1> b > 0
Thin splines h( r ) = (s r )2 ln(s r )
G + r functions

Multivariate Gaussian function


and its contour.

- r2
G(r ) = e ; r = X-R S

Distance function
and its contour.

hd ( r ) = r = X - R
Multiquadratic and thin spline

Multiquadratic and an inverse

ha ( r ) = ( s + r
2
)
2 -a
, a = 1;

hb ( r ) = ( s + r
2
)
2 b
, b = 1/ 2

Thin spline function

hs ( r ) = (s r ) 2 ln(s r )

All these functions are useful in


theory of function approximation.
Scalar product activation
Radial functions are useful for density estimation and function
approximation. For discrimination, creation of decision borders,
activation function equal to linear combination of inputs is most useful:
N
f ( X; W ) = W
X = Wi X i
i =1

Note that this activation may be presented as

X=
W
1
2
(2 2
W + X - W-X
2
) = L ( W, X ) - D ( W, X )
2

The first term L is constant if the length of W and X is fixed.


This is true for standardized data vectors; square of Euclidean
distance is equivalent (up to a constant) to a scalar product !
If ||X||=1 replace W.X by ||W-X||2 and decision borders will still be
linear, but using instead of Euclidean various other distance
functions will lead to non-linear decision borders!
More basis set functions
More sophisticated combinations of activation functions are useful, ex:

f ( X; W, D) = a W
X+b X-D

This is a combination of distance-based activation with scalar product


activaiton, allowing to achieve very flexible PDF/decision border
shapes. Another interesting choice is separable activation function:
d
f ( X; ) = f i ( X i ;i )
i =1

Separable functions with Gaussian factors have radial form, but


Gaussian is the only localized radial function that is also separable.

The fi(X;) factors may represent probabilities (like in the Naive Bayes
method), estimated from histograms using Parzen windows,
or may be modeled using some functional form or logical rule.
Output functions
Gaussians and similar bell shaped functions are useful to localize
output in some region of space.
For discrimination weighted combination f(X;W)=WX is filtered through
a step function, or to create a gradual change through a function with
sigmoidal shape (called squashing f., such as the logistic function:
1
s ( x; b ) = ( 0,1)
1 + exp ( - b x )
Parameter b sets the slope of the sigmoidal function.

Other commonly used function are:


tangh(bf)(-+)similar to logistic function;
semi-linear function: first constant -, t
hen linear and then constant +1.
Convergence properties
Multivariate Gaussians and weighted sigmoidal functions may
approximate any function: such systems are universal approximators.

The choice of functions determines the speed of convergence of the


approximation and the number of functions need for approximation.

The approximation error in d-dimensional spaces using weighted


activation with sigmoidal functions does not depend on d. The rate of
convergence with m functions is O(1/m)

Polynomials, orthogonal polynomials etc need for reliable estimation a


number of points that grows exponentially with d like 1
making them useless for high-dimensional problems! O 1/ d
n
The error convergence rate is:
In 2-D we need 10 time more data points to achieve the same error as
in 1D, but in 10-D we need 10G times more points!
Radial basis networks (RBF)
RBF is a linear approximation in space of radial basis functions

( )
m m
( X
; ) = Wi X -X (i )
;i = Wi i (X )
i =1 i =1
i ( X )
Such computations are frequently
presented in a network form: X1
inputs nodes: Xi values; ( X
; )
X2
internal (hidden) nodes: functions;
outgoing connections: Wi coefficients. X3
Wi
output node: summation.
X4
Sometimes RBF networks are called
neural, due to inspiration for their
development.
RBF for approximation
RBF networks may be used for function approximation, or classification with
infinitely many classes. Function should pass through points:

( X ( i ) ; ) = Y ( i ) , i = 1...n

Approximation function should also be smooth to avoid high variance of


the model, but not too smooth, to avoid high bias. Taking n identical
functions centered at the data vectors:

( )
n n
( X ; ) = W j X - X
(i ) (i ) ( j)
= H ijW j = Y ( i )
j =1 j =1

HW = Y W = H -1Y
If matrix H is not too big and non-singular this will work; in practice
many iterative schemes to solve the approximation problem have been
devised. For classification Y(i)=0 or 1.
Separable Function Networks (SFN)
For knowledge discovery and mixture of Naive Bayes models
separable functions are preferred. Each function component
d
f ( j)
( X; ( j)
) = f i ( X i ;i( j ) ) f (1) ( X
; (1)
)
i =1

is represented by a single node and X1 W11


if localized functions are used may F1 ( X )
represent some local conditions. X2
Linear combination of these
X3
component functions: F2 ( X )
n(c)
; ,W c ) = W jc f ( j ) (X ; ( j ) )
Fc ( X X4
j =1 W32
specifies the output; several outputs Fc are defined, for different
classes, conclusions, class-conditional probability distributions etc.
SFN for logical rules
If the component functions are rectangular:

2(+j )

1 if X ( j)
,
i - i + ( j)

f i ( X i ;i( j ) )
i
=
0 if X ( j)
, ( j)
2(-j )

i i - i + 1(-j ) 1(+j )

then the product function realized by the node is a hyperrectangle,


and it may represent crisp logic rule:


IF X 1 ( j)
,
1- 1+ ( j)
... X i ( j)
,
i - i + ( j)
... X d ( j)
d - d +
, ( j)

THEN Fact ( j ) =True
Conditions that cover whole data may be deleted.
SNF rules
Final function is a sum of all rules for a given fact (class).
n(c )
; ,W c ) = W jc f ( j ) (X ; ( j ) )
Fc ( X
j =1

The output weights are either:


all Wj = 1, all rules are on equal footing;
Wj ~ Rule precision (confidence): a ratio of the number of vectors
correctly covered by the rule over the number of all elements covered.
W j = N ( X C|Rule j ) N ( X|Rule j )
This may additionally be multiplied by the coverage of the rule, or

W j = N ( X C|Rule j ) N ( X|Rule j ) N ( X C )
2

W may also be fitted to data to increase accuracy of predictions.


Rules with weighted conditions
Instead of rectangular functions Gaussian, triangular or trapezoidal
functions may be used to evaluate the degree (not always equivalent
to probability) of a condition being fulfilled.

For example, triangular functions


| X - Di |
f ( X i ; Di ,s i ) = max
0,1 - i
si
A fuzzy rule based on triangular
membership functions is a product
of such functions (conditions):
d
f ( X; D
, ) = f i ( X i ; Di , s i )
i =1

The conclusion is highly justified in


areas where f() is large, shapes =>
RBFs and SFNs
Many basis set expansions have been proposed in approximation theory.
In some branches of science and engineering such expansions have
been widely used, for example in computational chemistry.
There is no particular reason why radial functions should be used, but all
basis set expansions are mistakenly called now RBFs ...
In practice Gaussian functions are used most often, and Gaussian
approximators and classifiers have been used long before RBFs.
Gaussian functions are also separable, so RBF=SFN for Gaussians.
For other functions:
SFNs have natural interpretation in terms of fuzzy logic membership
functions and trains as neurofuzzy systems.
SFNs can be used to extract logical (crisp and fuzzy) rules from data.
SFNs may be treated as extension of Naive Bayes, with voting
committees of NB models.
SFNs may be used in combinatorial reasoning (see Lecture 31).
but remember ...
that all this is just a poor approximation to Bayesian analysis.

It allows to model situations where we have linguistic knowledge but no


data for Bayesians one may say that we guess prior distributions
from rough descriptions and improve the results later by collecting real
data.

Example: RBF regression

Neural Java tutorial: http://diwww.epfl.ch/mantra/tutorial/

Transfer function interactive tutorial.


Computational Intelligence:
Methods and Applications

Lecture 30
Neurofuzzy system FSM
and covering algorithms.

Source: Wodzisaw Duch; Dept. of Informatics, UMK;


Google: W Duch
Training FSM network
Parameters of the network nodes may be estimated using
maximum likelihood Expectation Maximization learning.
Computationally simpler iterative schemes have been proposed.

An outline of the FSM (Feature Space Mapping) separable function


network algorithm implemented in GhostMiner:

Select the type of functions and desired accuracy.


Initialize network parameters: find main clusters, their centers and
dispersions; include cluster rotations.
Adaptation phase: read the training data, if error is made adapt
the parameters of the network to reduce them - move the closest
centers towards the data, increase dispersions.
Growth phase: if accuracy cannot be improved further add new
nodes (functions) in areas where most errors occur.
Cleaning phase: remove functions with smallest coverage, retrain.
Example 1: Wine rules
Select rectangular functions;
default initialization is based on histograms, looking for clusters
around maxima in each dimension;

Create simplest model, starting from


low learning accuracy, 0.90.

FSM window shows the convergence


and the number of neurons (logical
rules for rectangular functions) created
by the FSM system.

Different rules with similar accuracy


(especially for low accuracy) exist, and
the learning algorithm is stochastic (the
data is presented in randomized
order), so many rules sets are created.
Experiments with Wine rules
Run FSM with different parameters and note how different set of rules
are generated.
FSM may discover new, simple rules, that trees will not find, for ex:

if proline > 929.5 then class 1 (48 covered, 3 errors, but 2 corrected by
other rules).
if color < 3.792 then class 2 (63 cases, 60 correct, 3 errors)

Trees generate hierarchical path; FSM covers the data samples with
rectangular functions, minimizing the number of features used.

FSM includes stochastic learning (samples are randomized).


Weak: large variance for high accuracy models.
Strong: many simple models may be generated, experts may like
some more than the others.
Example
Example 2:
2: Pyrimidines
Pyrimidines
QSAR Qualitative Structure-Activity Relationship problem.
Given a family of molecules try to predict their biological activity.
Pyrimidine family has a common template:

R3, R4, R5 are places where


chemical groups are substituted.
The site may be also empty.

9 features are given per chemical group: name, polarity, group name,
polarity, size, hydrogen-bond donor, hydrogen bond acceptor, pi-donor,
pi-acceptor, polarizability, and the sigma effect.
For a single pyrimidine 27(=3*9) features are given;
evaluation of relative activity strength requires pair-wise comparison
A, B, True(AB)
There were 54 features (columns), and 2788 pairs compared (rows).
Pyrimidine
Pyrimidine results
results
Since ranking of activities is important an appropriate measure of
success is the Spearman rank order correlation coefficient:
d distance in ranking pairs, n number of pairs.
n
6
rS = 1 -
n ( n 2 - 1)
i [-1, +1]
d 2

i =1
5xCV results
Perfect agreement gives +1,
perfect disagreement -, ex: LDA 0.65
True ranking: X1 X2 ... Xn, CART tree 0.50 nodes
predicted ranks Xn Xn-1 .. X1 FSM (Gauss) 0.770.02 (86)
FSM (crisp) 0.770.03 (41)
differences: di= (n-1), (n-2) ... 0
(or 1, for odd n) ... (n-2), (n-1) 41 nodes with rectangular
functions, equivalent to 41 crisp
Sum of d2 is n(n2-1)/3, so rs=- logic rules.
Covering algorithms
Many machine learning algorithms for learning rules try to cover as
many positive examples as possible.
WEKA contains one such algorithm, called PRISM.

For each class C


E = training data
Create a rule R: IF () THEN C (with empty conditions)
Until there are no more features or R covers all C cases
For each feature A and its possible subset of values (or an
interval) consider adding a condition (A=v) or A [v,v]
Select feature A and values v that maximize rule precision,
N(C,R)/N(R) = number of samples from class C covered by R,
divided by number of all samples covered by R (ties are broken
by selecting largest N(C,R)).
Add to R: IF ( (A=v) ... ) THEN C
Remove samples covered by R from E
PRISM
PRISM for
for Wine
PRISM in the WEKA implementation cannot handle numerical
attributes, so discretization is needed first: this is done
automatically via Filtered Classifier approach.

Run PRISM on Wine data and note that:

PRISM has no parameters to play with!


Discretization determines complexity of the rules.
Perfect covering is achieved on whole data.
Rules frequently contain single condition, sometimes two, rarely
more conditions.
Rules require manual simplification.
Large number of rules (>30) has been produced.
10xCV accuracy is 86%, error is 7% and 7% of vectors remain
unclassified: covering leaves gaps in the feature space.

Anda mungkin juga menyukai