Lecture 28
Non-parametric density modeling
Histograms change,
depending on the size
of the bin Bi that
measures frequency
P(XBi).
Smoothing histograms
may be done by fitting
some smooth
functions, such as
Gaussians.
How good is this
Why histogram estimation works?
Probability that a data point comes from some region R (belongs to
some category, etc) is:
P=P ( X ) dX
R
We are given n data points, what is the chance Pr that k of these points
are from region R? If n=k=1, this Pr=P, in general Pr is the number of
combinations in which k points could be selected out of n, multiplied by
probability of selecting k points from R, i.e. Pk, and selecting n-k points
not from R, i.e. (1-P)n-k, that is, the distribution is binomial:
n k n - k Expected k value: E ( k ) = nP
Pr( k ) = P ( 1 - P )
k Expected variance: s 2 ( k ) = nP ( 1 - P )
k P
Since P(X) VR P = k/n, for a large number of s = ( 1 - P )
2
use an iterative algorithm that adapts the density to the incoming data.
Estimate density P(X|C) for each class separately.
Lecture 29
Approximation theory, RBF and
SFN networks
Source: Wodzisaw Duch; Dept. of Informatics, UMK;
Google: W Duch
Basis set functions
A combination of m functions
m
( X ) = Wi i ( X )
i =1
f ( X ) = X - R ; g ( f ) = exp ( - f 2 )
Distance f (r) = r = X - R
Inverse multiquadratic h( r ) = ( s + r
2
)
2 -a
,a >0
Multiquadratic h( r ) = ( s + r
2
)
2 b
,1> b > 0
Thin splines h( r ) = (s r )2 ln(s r )
G + r functions
- r2
G(r ) = e ; r = X-R S
Distance function
and its contour.
hd ( r ) = r = X - R
Multiquadratic and thin spline
ha ( r ) = ( s + r
2
)
2 -a
, a = 1;
hb ( r ) = ( s + r
2
)
2 b
, b = 1/ 2
hs ( r ) = (s r ) 2 ln(s r )
X=
W
1
2
(2 2
W + X - W-X
2
) = L ( W, X ) - D ( W, X )
2
f ( X; W, D) = a W
X+b X-D
The fi(X;) factors may represent probabilities (like in the Naive Bayes
method), estimated from histograms using Parzen windows,
or may be modeled using some functional form or logical rule.
Output functions
Gaussians and similar bell shaped functions are useful to localize
output in some region of space.
For discrimination weighted combination f(X;W)=WX is filtered through
a step function, or to create a gradual change through a function with
sigmoidal shape (called squashing f., such as the logistic function:
1
s ( x; b ) = ( 0,1)
1 + exp ( - b x )
Parameter b sets the slope of the sigmoidal function.
( )
m m
( X
; ) = Wi X -X (i )
;i = Wi i (X )
i =1 i =1
i ( X )
Such computations are frequently
presented in a network form: X1
inputs nodes: Xi values; ( X
; )
X2
internal (hidden) nodes: functions;
outgoing connections: Wi coefficients. X3
Wi
output node: summation.
X4
Sometimes RBF networks are called
neural, due to inspiration for their
development.
RBF for approximation
RBF networks may be used for function approximation, or classification with
infinitely many classes. Function should pass through points:
( X ( i ) ; ) = Y ( i ) , i = 1...n
( )
n n
( X ; ) = W j X - X
(i ) (i ) ( j)
= H ijW j = Y ( i )
j =1 j =1
HW = Y W = H -1Y
If matrix H is not too big and non-singular this will work; in practice
many iterative schemes to solve the approximation problem have been
devised. For classification Y(i)=0 or 1.
Separable Function Networks (SFN)
For knowledge discovery and mixture of Naive Bayes models
separable functions are preferred. Each function component
d
f ( j)
( X; ( j)
) = f i ( X i ;i( j ) ) f (1) ( X
; (1)
)
i =1
2(+j )
1 if X ( j)
,
i - i + ( j)
f i ( X i ;i( j ) )
i
=
0 if X ( j)
, ( j)
2(-j )
i i - i + 1(-j ) 1(+j )
IF X 1 ( j)
,
1- 1+ ( j)
... X i ( j)
,
i - i + ( j)
... X d ( j)
d - d +
, ( j)
THEN Fact ( j ) =True
Conditions that cover whole data may be deleted.
SNF rules
Final function is a sum of all rules for a given fact (class).
n(c )
; ,W c ) = W jc f ( j ) (X ; ( j ) )
Fc ( X
j =1
W j = N ( X C|Rule j ) N ( X|Rule j ) N ( X C )
2
Lecture 30
Neurofuzzy system FSM
and covering algorithms.
if proline > 929.5 then class 1 (48 covered, 3 errors, but 2 corrected by
other rules).
if color < 3.792 then class 2 (63 cases, 60 correct, 3 errors)
Trees generate hierarchical path; FSM covers the data samples with
rectangular functions, minimizing the number of features used.
9 features are given per chemical group: name, polarity, group name,
polarity, size, hydrogen-bond donor, hydrogen bond acceptor, pi-donor,
pi-acceptor, polarizability, and the sigma effect.
For a single pyrimidine 27(=3*9) features are given;
evaluation of relative activity strength requires pair-wise comparison
A, B, True(AB)
There were 54 features (columns), and 2788 pairs compared (rows).
Pyrimidine
Pyrimidine results
results
Since ranking of activities is important an appropriate measure of
success is the Spearman rank order correlation coefficient:
d distance in ranking pairs, n number of pairs.
n
6
rS = 1 -
n ( n 2 - 1)
i [-1, +1]
d 2
i =1
5xCV results
Perfect agreement gives +1,
perfect disagreement -, ex: LDA 0.65
True ranking: X1 X2 ... Xn, CART tree 0.50 nodes
predicted ranks Xn Xn-1 .. X1 FSM (Gauss) 0.770.02 (86)
FSM (crisp) 0.770.03 (41)
differences: di= (n-1), (n-2) ... 0
(or 1, for odd n) ... (n-2), (n-1) 41 nodes with rectangular
functions, equivalent to 41 crisp
Sum of d2 is n(n2-1)/3, so rs=- logic rules.
Covering algorithms
Many machine learning algorithms for learning rules try to cover as
many positive examples as possible.
WEKA contains one such algorithm, called PRISM.