Joint Featureselection and Classifier Design Thesis

A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION: THEORY AND ANALYSIS
By FAN MAO
A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2007
2007 Fan Mao
To my Mom, for her enormous patience and unfailing love.
ACKNOWLEDGMENTS I would like to thank my thesis adviser, Dr. Paul Gader, for his encouragement and valuable adviceboth on the general research direction and the specific experimental details. As a starter, I sincerely appreciate the opportunity of getting the direct help from an expert in this field. I also thank Dr. Arunava Banerjee and Dr. Joseph Wilson for being my committee members and reading my thesis. Special thanks go to Xuping Zhang. We had a lot of interesting discussions in the last half year, many of which greatly inspired me and turned out to be on my experiment designs.
TABLE OF CONTENTS page ACKNOWLEDGMENTS ...............................................................................................................4 LIST OF TABLES...........................................................................................................................7 LIST OF FIGURES .........................................................................................................................8 LIST OF SYMBOLS .......................................................................................................................9 ABSTRACT...................................................................................................................................11 CHAPTER 1 VARIABLE SELECTION VIA JCFO...................................................................................12 1.1 1.2 Introduction..................................................................................................................12 Origin ...........................................................................................................................12 1.2.1 LASSO...............................................................................................................12 1.2.2 The Prior of ...................................................................................................14 JCFO ............................................................................................................................16 EM Algorithm for JCFO..............................................................................................17 Complexity Analysis....................................................................................................19 Alternative Approaches ...............................................................................................21 1.6.1 Sparse SVM .......................................................................................................21 1.6.2 Relevance Vector Machine ................................................................................23
1.3 1.4 1.5 1.6
ANALYSIS OF JCFO ............................................................................................................24 2.1 2.2 Brief Introduction.........................................................................................................24 Irrelevancy ...................................................................................................................24 2.2.1 Uniformly Distributed Features .........................................................................25 2.2.2 Non-informative Noise Features........................................................................26 Redundancy..................................................................................................................27 2.3.1 Oracle Features ..................................................................................................27 2.3.2 Duplicate Features .............................................................................................28 2.3.3 Similar Features .................................................................................................29 2.3.4 Gradually More Informative Features ...............................................................31 2.3.5 Indispensable Features .......................................................................................33 Nonlinearly Separable Datasets ...................................................................................35 Discussion and Modification .......................................................................................37 Conclusion ...................................................................................................................38
2.3
2.4 2.5 2.6
APPENDIX A SOME IMPORTANT DERIVATIONS OF EQUATIONS ...................................................40
THE PSEUDOCODES FOR EXPERIMENTS DESIGN......................................................43
LIST OF REFERENCES...............................................................................................................46 BIOGRAPHICAL SKETCH .........................................................................................................47
LIST OF TABLES Table page
1.1. Comparisons of computation time of and other parts.......................................................21 2.1. Feature divergence of HH.......................................................................................................25 2.2. Feature divergence of Crabs ...................................................................................................25 2.3. Uniformly distributed feature in Crabs ...................................................................................26 2.4. Uniformly distributed feature in HH.......................................................................................26 2.5. Non-informative Noise in HH and Crabs ...............................................................................26 2.6. Oracle feature..........................................................................................................................28 2.7. Duplicate feature weights on HH............................................................................................28 2.8. Duplicate feature weights on Crabs ........................................................................................28 2.9. Percentage of either two identical features weights being set to zero in HH ........................29 2.10. Percentage of either two identical features weights being set to zero in Crabs...................29 2.11. Five features..........................................................................................................................32 2.12. Ten features...........................................................................................................................32 2.13. Fifteen features......................................................................................................................32 2.14. Comparisons of JCFO, non-kernelized ARD and kernelized ARD .....................................35 2.15. Comparisons of JCFO and kernelized ARD with an added non-informative irrelevant feature ................................................................................................................................35 2.16. Comparisons of JCFO and kernelized ARD with an added similar redundant feature ........36 2.17. Comparisons of the three ARD methods ..............................................................................38
LIST OF FIGURES Figure page
1.1. Gaussian (dotted) vs. Laplacian (solid) prior..........................................................................15 1.2. Logarithm of gamma distribution. From top down: a=1e-2 b=1e2; a=1e-3 b=1e3; a=1e-4 b=1e4. ....................................................................................................................23 2.1. Two Gaussian classes that can be classified by either x or y axes .........................................30 2.2. Weights assigned by JCFO (from top to down: x, y and z axis) ............................................30 2.3. Weights assigned by ARD (from top to down: x, y and z axis) .............................................31 2.4. Two Gaussians that can only be classified by both x and y axes ...........................................33 2.5. Weights assigned by JCFO (from top to down: x, y and z axis) ............................................34 2.6. Weights assigned by ARD (from top to down: x, y and z axis) .............................................34 2.7. Cross data................................................................................................................................36 2.8. Ellipse data..............................................................................................................................37
LIST OF SYMBOLS Symbols

xi
Meanings the ith input object vector the jth element of the ith input object vector weight vector
xij
l p -norm
identity matrix zero vector 0 mean unit variance normal distribution evaluated on v sign function design matrix positive part operator Gaussian cdf
0
N (v | 0,1) sgn()
H
()+
( z) [ z ]
N ( x | 0,1)dx
zs expectation the estimate of in the tth iteration big O notation of complexity element-wise Hadamard matrix multiplication
(t )
O ()
LIST OF ABBREVIATIONS JCFO: LASSO: RBF: SVM: RVM: ARD: Joint Classifier and Feature Optimization Least Absolute Shrinkage and Selection Operator Radial Basis Function Support Vector Machine Relevance Vector Machine Automatic Relevance Determination
10
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION PAGE: THEORY AND ANALYSIS By Fan Mao December 2007 Chair: Paul Gader Major: Computer Engineering Feature selection is a major focus in modern day processing of high dimensional datasets that have many irrelevant and redundant features and variables, such as text mining and gene expression array analysis. An ideal feature selection algorithm extracts the most representative features while eliminating other non-informative ones, which achieves feature significance identification as well as the computational efficiency. Furthermore, feature selection is also important in avoiding over-fitting and reducing the generalization error in regression estimation where feature sparsity is preferred. In this paper, we provide a thorough analysis of an existing state-of-the-art Bayesian feature selection method by showing its theoretical background, implementation details and computational complexity in the first half. Then in the second half we analyze its performance on several specific experiments we design with real and synthetic datasets, pointing out its certain limitations in practice and finally giving a modification.
11
CHAPTER 1 VARIABLE SELECTION VIA JCFO 1.1 Introduction Joint Classifier and Feature Optimization (JCFO) was first introduced by (Krishnapuram et al, 2004). It's directly inspired by (Figureiredo, 2003) for achieving sparsity in feature selections, by driving feature weights to zero. Compared to the traditional ridge regression, which shrinks the range of the regression parameters (or simply the weight coefficients of a linear classifier) but rarely sets them directly to zero, JCFO inherits the spirit of LASSO (Tibshirani, 1996) for driving some of the parameters exactly to zero, which is equivalent to removing those corresponding features. In this way, it is claimed that JCFO eliminates redundant and irrelevant features. The remainder of this chapter will be arranged as follows: Section 2 takes a close look at how this idea was derived; Section 3 gives the mathematical structure of JCFO; Section 4 and 5 further illustrate an EM algorithm which is used to derive the learning algorithm for JCFO and analyze its complexity. The last part briefly introduces two other approaches that have a similar functionality. 1.2 Origin 1.2.1 LASSO
i i Suppose we have a data set Z = {( x i , yi )}, i = 1, 2,..., n where x i = ( x1i , x2 ,...xk ) is the ith
input variable vector whose elements are called features, and yi are the responses or the class labels (this thesis will only consider yi {0,1} ). The ordinary least squares regression would be to find = ( 1 , 2 ,..., k )T , which minimizes
(y x )
i =1 i j =1 j
i 2 j
(1.1)
12
The solution is the well-known original least squares estimate 0 = ( X T X ) 1 X T y . As
mentioned in (Hastie et al, 2001), it often has low bias but large variance when X T X is invertible, which tends to incur over-fitting. To improve the prediction accuracy we usually shrink or set some elements of to zero in order to achieve parameter sparsity. If this is appropriately done, we actually also employ an implicit feature selection process on our dataset getting rid of the insignificant features and extracting those that are more informative. In addition to simplifying the structure of our estimation functions, according to (Herbrich, 2002), the sparsity of weight vector also plays a key role in controlling the generalization error which we will discuss in the next section. The ridge regression as a revised approach that penalizes the large j would be
arg min ( yi j xij ) 2 + j2

i =1 j =1 j =1
(1.2)
Here is a shrinking coefficient that adjusts the ratio of the squared l 2 -norm of to the
residual sum squares in our objective function. The solution of (1.2) is (1 + ) 1 0 , (Tibshirani, 1996) where depends on and X . This shows ridge regression reduces to a fraction of 0 ,
but rarely sets their elements exactly to zero hence can't completely obtain the goal of feature selection. For an alternative approach, (Tibshirani, 1996) proposed changing (1.2) to
( yi j xij )2 + j
i =1 j =1 j =1
(1.3)
Or equivalently,
( y X) ( y X)+
T T T
(1.4)
denotes the l 1 -norm. This is called Least Absolute Shrinkage and Selection
Operator (LASSO). To see why l 1 -norm favors more sparsity of , note that
13
( 1 / 2, 1/ 2) = (1, 0) 2 = 1 , but ( 1 / 2, 1/ 2) = 2 > (1, 0) 1 = 1 .Hence it tends to set more

2 1
elements exactly to zero. By (Tibshirani, 1996), the solution of (1.4) for orthogonal design X is j = sgn( 0 )( 0 ) + , j j (1.5)
where sgn() denotes the sign function and (a ) + is defined as (a) + = a , if a 0 ; 0 otherwise. depends on . Intuitively, can be seen like a threshold that filters those 0 s that are below j a certain range and truncates them off. This is a very nice attribute by which our purpose of getting sparsity has been fulfilled. 1.2.2 The Prior of The ARD (Automatic Relevance Determination) approach proposed by (Figureiredo, 2003) inherits this idea of promoting sparsity from LASSO through a Bayesian approach. It considers the regression functional h as linear with respect to , so our estimate function would be
f ( x, ) = j h j ( x ) = T h( x ) ,
j =1
(1.6)
where h(x) could be a vector of linear transformations of x, nonlinear fixed basis functions or kernel functions which consist of a so-called design matrix H, such that H i , j = h j ( xi ) . Further, it assumes that the error yi T h( x ) is a zero mean Gaussian, N (0, 2 ) . Hence the likelihood would be p ( y | ) = N ( y | H , 2 I ) , (1.7)
where I is a k k identity matrix. Note that since we assume samples are i.i.d. Gaussian, theres no correlation among them. More importantly, this method also assigns a Laplacian prior to :
p( | ) =
i =1 k
exp { | i |} = exp{ 1} . 2 2
k
14
The influence that a prior exerts on the sparsity of was discussed in (Herbrich, 2002), where they used a N (0 , I k ) prior to illustrate that ' s log-density is proportional to
= i2 and the highest value is acquired when = 0 . For the comparison of Gaussian
i =1 k
and Laplacian priors, we plot both of their density functions in Figure 1.1. The latter is much more peaked in the origin therefore favors more ' s elements being zero.
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -10 -8 -6 -4 -2 0 2 4 6 8 10
Figure 1.1. Gaussian (dotted) vs. Laplacian (solid) prior The MAP estimate of is given by
2 = arg min{ y H 2 + 2 2 1}
(1.8)
It can be easily seen that this is essentially the same as (1.3). If H is an orthogonal matrix, (1.8) can be solved separately for each j . (See Appendix 1.1 for the detail derivation.)
i = arg min{i2 2i ( H T y )i + 2 2 i }
i
= sgn(( H y )i )( ( H y )i ) +
T T 2
(1.9)
Unfortunately, in the general case i cannot be solved directly from (1.8) due to its non-differentiability in the origin. As a modification, (Figureiredo, 2003) present a 15
hierarchical-Bayes view of the Laplacian prior, showing its nothing but a two-level Bayes model--zero mean with independent, exponentially distributed variance: p ( i | i ) = N (0, i ) and p ( i | ) = ( / 2) exp{( / 2) i } , such that:
p( i | ) = p( i | i ) p( i | )d i =
0
exp{ | i |} ,
(1.10)
where i can be considered as a hidden variable being calculated by an EM algorithm while is the real hyper-parameter we need to specify. (This integration can be found in Appendix 1.2.) 1.3 JCFO From the same Bayesian viewpoint, JCFO (Krishnapuram et al, 2004) apply a Gaussian cumulative distribution function (probit link) to (1.6) to get a probability measure of how likely one input object belongs to a class c {0,1} . To be more precise:
N P ( y = 1| x ) = 0 + i K ( x, x i ) , i =1
(1.11)
where ( z ) = N (0,1)dz . (Note that ( z ) = 1 ( z ) .) K ( x, x j ) is a symmetric measure
of the similarity of two input objects. For example:

k K ( x, x ) = 1 + i xi xij i =1 j r
(1.12)
for polynomial functions and:
k ( xi xij )2 K ( x, x ) = exp i 2 i =1
j
(1.13)
for Radial Basis Functions. (The good attributes of RBF functions will be discussed in Section 2.4.)
16
An apparent disadvantage of (1.6) is that if we dont choose h as a linear function of x (or simply the input vector x itself), the calculation of would be like selecting the kernel functions of x rather than selecting the features of x. This is also why its conceptually an ARD method. As a major modification of explicitly selecting each feature, in (1.12) and (1.13) JCFO assigns each xi a corresponding parameter i . Since the same i is applied to the ith element of each input x j , it is equivalent to weighing the significance of the ith feature in the input object. This is how the feature selection is incorporated. Another point that needs to be mentioned is that all i should be non-negative because K ( x, x j ) measures the similarity of two objects as a whole, i.e. accumulating the differences in each two corresponding elements. Had i = j , in (1.13) the difference of the ith and the jth elements of two objects would have been cancelled. Then like how we treated i , each i is given a Laplacian prior, too:
2N ( k | 0, k ) if k 0 p ( k | k ) = , if k 0 0
where p( k | 2 ) = ( 2 / 2) exp{( 2 / 2) k } , thus:
(1.14)
exp( 2 k ) if k 0 p ( k | k ) p ( k | 2 )d k = 2 if k 0 0 0
(1.15)
1.4 EM Algorithm for JCFO Now from our i.i.d. assumption of the training data, its likelihood function becomes
N p ( D | , ) = 0 + i K ( x i , x j ) j =1 i =1 N yi
N i j 0 i K ( x , x ) j =1
(1 y i )
(1.16)
17
Rather than maximizing this likelihood, JCFO provides an EM algorithm by first assuming
T a random function z ( x , , ) = h ( x ) + , where h ( x ) = [1, K ( x, x1 ),..., K ( x, x N )] and
~ N (0,1) . Then it treats z = [ z1 , z 2 ,...z N ], , as the missing variables whose expectations

would be calculated in the E step. E step The complete log-posterior is log p ( , | D, z , , ) log( z | , , D ) + log( | ) + log( | )
T z T z T H ( H 2 z ) T T T R
where T = diag ( 0 1 , 11 ,... N1 ) and R = diag ( 11 , 2 1 ,..., N1 ) . If we want to calculate the
expectation of (1.17), i.e. Q( , | (t ) , (t ) )

T = z T z T H ( H 2 z ) T T T R
(1.17)
The expectations with respect to those missing variables are

vi = z i | D, (t ) , (t )
T (2 y i 1)N (h ( x i ) (t ) | 0,1) , T = h ( x i ) (t ) + T ((2 y i 1) h ( x i ) ( t ) )
(1.18)
i = i1 | D, i ( t ) , 1 = 1 | i (t ) |1 ; and
i = i1 | D, i (t ) , 2 = 2 i(t )
( )
where i (t ) denotes the estimated value of i in the tth iteration. (The derivation of (1.18) can
be found in Appendix 1.3)
18
M step Now with the expectations of these missing variables at hand, we can start applying MAP on the Q function of (1.17). After dropping all terms irrelevant to , and setting
v = [v1 , v2 ,..., vN ]T , = diag[1 , 2 ,... N ] and = diag[1 , 2 ,..., N ] we get
T T Q( , | (t ) , (t ) ) = T H H + 2 T H v T T .
(1.19)
Take the derivative of (1.18) with respect to and k would be:

T T Q = 2 H H + 2 H v 2
(1.20) , ( i , j )
N N +1 Q = 2 k k 2 ( H v) T k i =1 j =1
H o k
(1.21)
where o represents the element-wise Hadamard matrix multiplication. Now by jointly maximizing (1.19) with respect to both and , we not only select the kernel functions but also those features of the input objects if both and are parsimonious. This is the core motivation of JCFO. Something that needs to be stressed is that could be solved directly in each iteration while k can only be computed in terms of an approximation method. This in fact increases the computation complexity and, even worse, influences the sparsity of . We will repeatedly come to this issue in Chapter 2. 1.5 Complexity Analysis What is the complexity of the EM algorithm above in the general case? Since in each iteration this algorithm alternatively calculates and , we investigate these two scenarios respectively. Complexity of estimating Lets first take a look at the search for , i.e. setting (1.19) to zero:
19
T T = ( + H H ) 1 H v
(1.22)
i) If kernel functions are applied, the calculations for the inner product is O ( k ) while the common matrix inversion and multiplication of an N N matrix in matlab are O( N c ), 2 < c < 3 and O ( N 3 ) respectively. Therefore the whole complexity of (1.22) would be: O ( N 2 k + N 3 + N c ) . For low-dimensional datasets with k < N, we can simplify it as O ( N 3 ) . ii) If only the linear regression is used, theres no need to calculate H , thus the complexity is counted by the larger portion between matrix inversion and multiplication, i.e. max( N c , N 2 k ), (2 < c < 3) , or O( N c ) if we assume k < N c 2 . Compared with the kernel version, this doesnt seem like reducing many computations. However, if the constants in the O-notation are taken into account, the difference would be conspicuous. Meanwhile, without the kernel, we get rid of the calculation of . This also saves us enormous computation time, which we are going to show in the subsequent section. For an improved method, we can exploit the fact that after a few iterations, more and more j would vanish, which means theres no use in re-calculating their corresponding kernel functions K ( x i , x j ), j = 1, 2,..., N . Hence we can delete this entire row from H and shrink the size of this design matrix significantly. This desirable attribute can also be applied to , especially in the case of high dimensional dataset, as an alleviation of the computation pressure incurred by kernel methods. Complexity of estimating After getting the value of in the current iteration, is solved by plugging this value into (1.21) and approximating it analytically. (The authors of JCFO choose a conjugate gradient method.) Depending on different datasets, the number of iterations required to esitimate the
20
minimum of the target function vary remarkably, which makes it difficult to quantify the exact time complexity of calculating . Here we provide a comparison of the time used to compute
with that of other computations by averaging 50 and 100 runs of JCFO on two datasets. The
JCFO matlab source code is from Krishnapuram, one of the inventors of JCFO. Two hyper parameters are chosen to be 1 = 0.002 and 2 = 4 ; i ' s are initialized as 1/ k , where k is the number of features. Table 1.1. Comparisons of computation time of and other parts.
Datasets Variables
Other
mean std mean std
HH (50 runs) sec 200.3 55.70 129.7 37.97
Crabs (100 runs) sec 16.27 0.4921 9.524 0.3577
We can see that nearly two-thirds of computations are used to compute . Furthermore, by choosing a conjugate gradient method, like the authors did with fmincon in matlab, we cannot assume the minimum would be obtained by a parsimonious . In other words, since the optimization itself is not sparsity-oriented, enforcing this requirement sometimes will lead to its terminating badly. This is a major weakness of JCFO. In order to verify our hypotheses, we will look at more behaviors of in Chapter 2. 1.6 Alternative Approaches 1.6.1 Sparse SVM In the following two sections we will briefly discuss two other approaches derived from LASSO that have the very similar property as JCFO. (B.Schlkopf and A.J. Smola, 2002) introduce an insensitive SVM which has the following the objective function:
21
wH , R ,bR
min N
( w , (*) ) =
1 1 2 w + C ( + N 2
* i
i =1
(*) i
),
s.t.
( w , xi + b) yi + i , yi ( w , xi + b) + ,
(1.23)
i(*) 0, 0.
From the constraints we know that if a sample object falls inside the tube , it won't get a penalty in our objective function. The parameter i(*) denotes the real distance of a point to the tube . C is a coefficient that controls the ratio of the total i(*) distance to the norm of the weight vector both of which we want to minimize, or generally, the balance between the sparsity and the prediction accuracy. The variable [0,1] is a hyper parameter that behaves as the upper bound of the fraction of errors (the number of points outside the tube ) as well as the lower bound of the fraction of SVs. Now lets take the derivative of w , , , b in the dual form of (1.22) and set them to zero. We would get 1 N N max W ( ) = ( i + ) yi ( i + i* )( i + i* ) K ( xi , x j ), R N 2 i =1 j =1 i =1
(*) * i N
s.t.
(
i =1 N
i* ) = 0, (1.24)
C i(*) [0, ], m
(
i =1
+ i* ) C .
This is the standard form of SVM . Since the points inside the tube can't be used as the SVs, intuitively it attains more weight vector sparsity than ordinary SVM by assigning more i(*) to be zero. (J. Bi et al, 2003) proposes a modification by adding a term
N i =1
i to the
loss function (this revision is called Sparse-SVM) in order to get an even more parsimonious set
22
of SVs. This also inherits the idea from Lasso. Of course, optimization involves a searching algorithm to find both the prior C and . 1.6.2 Relevance Vector Machine The Relevance Vector Machine (RVM) (M. Tipping, 2000) has a very similar design structure as JCFO. Indeed, it also assumes a Gaussian prior on ~ N (0 , ) and p (t | ) ~ N ( X, 2 I ) , where = diag ( 1 , 2 , k ) , i can be considered as the hyper parameter of and 2 is known. These assumptions are the same as (1.7) and (1.14). However, instead of calculating the expectation of t and and plugging them back to maximize the posterior as what JCFO did in (1.17) and (1.19), RVM integrates out to get the marginal likelihood: p(t | , 2 ) = p (t | , 2 ) p ( | )d = (2 )
m 2
2 I + X X T
1 2
1 exp t T ( 2 I + X X T ) 1 t 2
(1.25)
Henceforth we can take the derivatives with respect to i and 2 in order to maximize (1.25). Interestingly, the logarithm of Gamma distribution ( i | a, b) assigned by (Herbrich, 2002) as the prior of i has the same effect as the Laplacian prior JCFO uses in promoting the sparsity of . We plot the shapes of this prior with different a, b in Figure 1.2.
0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Figure 1.2. Logarithm of gamma distribution. From top down: a=1e-2 b=1e2; a=1e-3 b=1e3; a=1e-4 b=1e4.
23
CHAPTER 2 ANALYSIS OF JCFO 2.1 Brief Introduction As the inventors of JCFO claim, its major objectives are twofold: i To learn a function that most accurately predicts the class of a new example (Classifier Design) and, ii To identify a subset of the features that is most informative about the class distinction (Feature Selection). Within this chapter we will investigate these two attributes of JCFO on several real and synthetic datasets. From Experiment 2.2.1 to 2.3.2.2, we train our samples from the 220-sample7-feature HH landmine data and 200-sample-5-feature Crabs data as our basic datasets, both of which are separable by an appropriate classification algorithm. Then well take a 4-fold cross validation on that. For the rest of the experiments, well generate several special Gaussian distributions. From the performance evaluations perspective, we compare both of JCFOs prediction and feature selection abilities using RBF kernel with the non-kernelized (linear) ARD method presented by (Figureiredo, 2003), which is its direct ascendant. Regarding our interest, we are going to devote more effort to the feature selection, primarily to the elimination of irrelevant and redundant features. In each scenario well contrive a couple of tiny experiments that test a certain aspect. Then a more extensive result will be provided together with an explanation of how this result is derived. The pseudo code of each experiment can be found in Appendix B. 2.2 Irrelevancy Intuitively, if some features don't vary from one class to another much, which means they're not sufficiently informative in differentiating the classes, we call them irrelevant features. To be more precise, it's convenient to introduce the between-class divergence as follows:
24
dk =
1 1 2 1 1 1 ( + 2) + ( 1 2 ) 2 ( + ) , 2 2 1 2 1 2
where k denotes the kth feature and i , i are the mean and the variance of Feature k of class i. Note that d is in reverse proportion to the similarity of the two classesthe more similar the mean and variance are, the less d would become. In the extreme case that the two distributions represented by the kth feature totally overlap, d drops to zero. This is equivalent to saying that the kth feature is completely irrelevant. We will use this concise measure to compare with the feature weight k assigned by JCFO in the following experiments. The divergences of each feature in HH and Crabs are given by Table 2.1 and Table 2.2.
Table 2.1. Feature divergence of HH

Feature Divergence 1 0.3300 2 0.5101 3 32.7210 4 0.2304 5 1.3924 6 3.0806 7 3.9373
Table 2.2. Feature divergence of Crabs

Feature Divergence 1 0.0068 2 0.3756 3 0.0587 4 0.0439 5 0.0326
2.2.1
Uniformly Distributed Features The distribution of this kind of feature is completely the same between two classes.
Obviously, features like this should be removed. We examine the performance of JCFO and ARD in eliminating uniformly distributed features by adding a constant column as a new feature to Crabs and HH. Testing results that average 50 runs are shown in Table 2.3 and Table 2.4. Only ARD completely sets the interferential feature weight to zero on both datasets, while JCFO doesnt fulfill the goal of getting rid of the invariant feature on HH. Nonetheless, the time consumption of JCFO is enormous, too.
25
Table 2.3. Uniformly distributed feature in Crabs

Train error rate mean (%) Train error rate std (%) Test error rate mean (%) Test error rate std (%) Percentage of the noise weights set to zero Running Time JCFO 2 0 6 0 100% 2 hours ARD 4 0 6 0 100% 1min
Table 2.4. Uniformly distributed feature in HH

Train error rate mean (%) Train error rate std (%) Test error rate mean (%) Test error rate std (%) Percentage of the noise weights set to zero Running Time JCFO 1.49 1.48 21.7 4.63 0% 4 hours ARD 14.5 2.8e-15 20 8.5e-15 100% 1min
2.2.2
Non-informative Noise Features While a uniformly distributed feature is rarely seen in realistic datasets, most of the
irrelevancies lie under the cover of various kinds of random data noise. We can emulate this by adding the class independent Gaussian noise as a feature into our datasets. Although different means and variances can be chosen, this feature itself is still non-informative in terms of classification. Our purpose is to see if JCFO can remove this noise feature. By adding unit Gaussian noise with means of 5, 10 and 15 each time as a new feature and training each one 50 times, we get an average divergence of the noise feature as 0.0001. The testing result is provided in Table 2.5.
Table 2.5. Non-informative Noise in HH and Crabs

JCFO Train error rate mean (%) Train error rate std (%) Test error rate mean (%) Test error rate std (%) Percentage of the noise weight being set to zero HH 1.55 1.32 27.41 6.1 0 Crabs 1.82 0.83 5.00 1.83 8.87 HH 14.55 0.1 20.03 0.47 96 ARD Crabs 2.40 1.96 3.60 2.95 99.3
26
It can be seen that almost none of the corresponding noise weights are set to zero by JCFO. By contrast, ARD almost always sets noise weights to zero. This implies that JCFO does not eliminate irrelevant Gaussian features well compared to the ARD method. As mentioned in Section 1.5.2, the reason is likely due to the fact that although in (1.10) and in (1.14) have the same sparsity-promoting mechanism, in the implementation can only be derived approximately. This leaves the values of its elements tiny but still nonzero; thus, it conversely compromises ' s sparsity. A similar situation will be reencountered in the next section. Nonetheless, JCFO doesnt beat ARD in terms of the testing error. This result can also be explained as the unfulfilment of the sparsity-related reduction of generalization error with respect to because we know the two methods share the same functionality in the part. 2.3 Redundancy A redundant feature means that it has a similar, but not necessarily identical, structure to its other counterparts. Here we can employ a well-known measure to assess the correlation between two features:
i , j =
x
k =1 N k =1
ki kj N
,
2 ki
x x
2 ki k =1
where i and j denote the ith and jth feature. The larger the is, the more correlated they are. The extreme case is 1 when they are identical. 2.3.1 Oracle Features First lets look at a peculiar example: if we put the class label as a feature, obviously this can be considered as the most informative attribute in distinguishing the classes, and all the other features are redundant compared to it. We shall check if JCFO can directly select this oracle
27
feature and get rid of all the others; i.e., only the i corresponding to this feature is nonzero. We test JCFO and ARD 100 times on HH and Crabs respectively with the label feature being added. Table 2.6. Oracle feature
Train error rate (%) Only the label feature weight is nonzero? JCFO 0 Yes ARD 0 Yes
The purpose of designing this toy experiment is to determine if, as feature selection approaches, both of them have the ability of identifying the most representative feature from other relatively subtle ones. This should be considered as the basic requirement for feature selection. 2.3.2 Duplicate Features A duplicate feature is identical to another feature and thus is completely redundant. Of course, here the correlation value of the two features is exactly 1. This experiment will examine whether JCFO can identify a duplicate feature and remove it. We achieve this by simply replicating a feature in our original dataset then adding it back as a new feature. We repeat this process on each feature and test whether it or its replica can be eliminated.
Table 2.7. Duplicate feature weights on HH

Feature Weight JCFO ARD Itself Duplicate Itself Duplicate 1 0.2157 0.2300 0 0 2 0.1438 0.1520 0 0 3 0.3130 0.3539 0 0 4 0.2134 0.2142 0 0 5 0.4208 0.4010 0 0 6 0.2502 0.2454 0 0.3172 7 0.2453 0.2604 -0.0094 0 Error (%) Train/Test 4.082/ 20.32 14.63/ 20.25
Table 2.8. Duplicate feature weights on Crabs

Feature Weight JCFO ARD Itself Duplicate Itself Duplicate 1 0 0 0 0 2 0.0981 0.0846 -0.9558 0 3 0.1091 0.1076 0.9050 0 4 0 0 0 0 5 0.0037 0.0032 0 0 Error (%) Train/Test 2/ 6 4/ 6
28
From the above tables, apparently JCFO assigns each feature approximately the same weight as its duplicate, while ARD sets either the feature or its duplicate (or both) to zero. Meanwhile, considering their testing errors we come to the situation of 2.2.2 again. 2.3.3 Similar Features Now we examine the features that are not identical but highly correlated. This section is separated into two parts: 2.3.3.1 This experiment is similar to the duplicate feature experiment. We coin a new feature by replicating an original feature, mixing it with a unit variance Gaussian noise. A verification is also taken on each feature to check if either this feature or its counterpart is removed by running the test 50 times on HH and 100 times on Crabs and averaging the times of either of their weights being set to zero.
Table 2.9. Percentage of either two identical features weights being set to zero in HH
Feature
1 2 3 4 5 6 7 (0.4652) (0.8951) (0.9971) (0.9968) (0.9992) (0.9992) (0.1767) 30 100 30 100 20 100 10 100 10 99 10 100 20 92
Error(%) Train/Test 3.853/21.4 14.51/20.11
Weight JCFO ARD
Table 2.10. Percentage of either two identical features weights being set to zero in Crabs
Feature
1 (0.9980) 90 100
2 (0.9970) 10 91
3 (0.9995) 30 97
4 (0.9996) 90 99
5 (0.9977) 50 100
Error (%) Train/Test 2.013/4.840 4.001/5.976
Weight JCFO ARD
The results on two datasets still imply that JCFO cant perform as well as ARD in terms of the redundancy elimination.
29
Figure 2.1. Two Gaussian classes that can be classified by either x or y axes 2.3.3.2 In Figure 2.1 above, these two ellipsoids can be differentiated with either the x or y variables, thus either can be considered as redundant to the other. However, z is completely irrelevant. We generate 200 samples each time, half in each class, and run both methods 150 times. We plot the corresponding to three axes features in the following figures.
0.4 0.2 0 0.4 0.2 0 0.1 0.05 0 0 50 100 150 0 50 100 150 0 50 100 150
Figure 2.2. Weights assigned by JCFO (from top to down: x, y and z axis)
30
0.4 0.2 0 0 -0.2 -0.4 0.05 0 -0.05 0 50 100 150 0 50 100 150 0 50 100 150
Figure 2.3. Weights assigned by ARD (from top to down: x, y and z axis) It can be seen that both methods discover the irrelevancy of z axis though the performance of ARD is much better. For the redundancy, unfortunately neither of them eliminates the x or y axis. But, more interestingly, the assigned by ARD on x and y each time are almost identical while those assigned by JCFO are much more chaotic. Note that the redundancy here is different from the previous experiments since its more implicit from a machines perspective. We will discuss this further in 2.4. 2.3.4 Gradually More Informative Features In this experiment we generate samples from two multivariate Gaussians. The corresponding elements in both mean vectors have gradually larger differences with the increase of their orders. e.g. suppose 1 = (0, 0,..., 0k ) and 2 = (10, 20,...,10k ) . The variance of each dimension is fixed to one fourth of the largest element difference. So, in the above case it should be 10k/4. By treating each dimension as a feature, we contrive a set of gradually more informative features since the feature corresponding to the largest order has the most distant means and the least overlapped variances, thus it is the most separable. When compared to this feature, the rest could be deemed redundant. We do this experiment with 5, 10 and 15-dimensional data respectively. In each experiment we generate 200 samples and perform a 4-fold cross-validation. This process is repeated 50 times on JCFO and 100 times on ARD, then 31
we look at whether the most informative feature is kept while the others are deleted. Results are listed in the following three tables.
Table 2.11. Five features

Feature 1 2 3 4 5 mean 0.0001 0.0041 0.0090 0.0158 0.0256 JCFO std 0.0008 0.0013 0.0020 0.0021 0.0027 zero(%) 14 0 0 0 0 mean -0.0057 -0.0352 -0.0819 -0.1486 -0.2339 ARD std 0.0072 0.0155 0.0168 0.0193 0.0224 zero(%) 57 8 0 0 0
Table 2.12. Ten features

Feature 1 2 3 4 5 6 7 8 9 10 mean 0 0.0001 0.0011 0.0027 0.0040 0.0053 0.0099 0.0133 0.0144 0.0214 JCFO std 0 0.0003 0.0019 0.0028 0.0039 0.0049 0.0071 0.0075 0.0087 0.0106 zero(%) 100 80 50 34 16 6 0 0 2 0 mean -0.0001 -0.0026 -0.0070 -0.0187 -0.0286 -0.0460 -0.0668 -0.0856 -0.1125 -0.1356 ARD std 0.0010 0.0045 0.0082 0.0125 0.0145 0.0146 0.0171 0.0156 0.0190 0.0197 zero(%) 95 72 52 24 14 4 0 0 0 0
Table 2.13. Fifteen features

Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 mean 0 0.0001 0.0002 0.0009 0.0015 0.0021 0.0030 0.038 0.0044 0.0068 0.0057 0.0109 0.0117 0.0118 0.0151 JCFO std 0 0.0004 0.0006 0.0017 0.0027 0.0036 0.0050 0.0050 0.0069 0.0089 0.0075 0.0120 0.0110 0.0136 0.0134 zero(%) 100 92 84 68 62 52 44 30 34 30 30 16 8 20 12 mean -0.0002 -0.0009 -0.0022 -0.0046 -0.0062 -0.0118 -0.0143 -0.0242 -0.0318 -0.0413 -0.0497 -0.0588 -0.0735 -0.0845 -0.0992 ARD std 0.0009 0.0024 0.0038 0.0063 0.0076 0.0099 0.0120 0.0143 0.0156 0.0138 0.0157 0.0176 0.0159 0.0173 0.0193 zero(%) 94 86 71 61 56 36 35 18 10 2 2 2 0 0 0
32
The above tables indicate that although the most significant features are assigned to the largest weights by both methods, the rest of the features are not removed but ranked in terms of their significances. Generally, JCFO eliminates more redundant features in the 2nd and 3rd cases. However, table 2.12 also shows that sometimes JCFO incorrectly deletes the most informative feature (12% for Feature 15). In some scenarios, the feature selection method cant totally identify the redundancy, but it gives a ranking measurement according to how much each feature contributes to the classification, which is also an acceptable alternative. (I. Guyon, 2003) 2.3.5 Indispensable Features In pattern classification, if a set of features operate as a whole, i.e. no subsets of them can fulfill the classification, we call them indispensable features. Eliminating a feature from an indispensable feature set will cause a selection error. We are going to check how JCFO and ARD deal with this kind of feature. If we treat the x and y variables as two features in Figure 2.4, apparently they consist of an indispensable feature set since none of them alone can determine the classes. y
Figure 2.4. Two Gaussians that can only be classified by both x and y axes Like in Experiment 2.3.3.2, we generate 200 samples each time, half in each class and run both methods 150 times. Results are listed below:
33
1 0.5 0 1 0.5 0 1 0.5 0 0 50 100 150 0 50 100 150 0 50 100 150
Figure 2.5. Weights assigned by JCFO (from top to down: x, y and z axis)
-0.4 -0.5 -0.6 0 -0.2 -0.4 0.1 0 -0.1
50
100
150
50
100
150
50
100
150
Figure 2.6. Weights assigned by ARD (from top to down: x, y and z axis) The fact that both JCFO and ARD dont remove either x or y axis-features is a good result. However, the weights assigned by ARD are much more stable. Regarding the irrelevant z axis, ARD does a clean job like it did in 2.3.3.2. Therefore, we can still evaluate its performance as better than JCFO.
34
2.4 Non-linearly Separable Datasets The experiments with synthetic data in the previous two sections concentrated on classifying linearly separable features, which is not the strong suit for JCFO. In the following two tests, we will examine JCFOs performance on two non-linearly separable datasets compared with that of ARD method. The 2-D Cross data comprises two crossing Gaussian ellipses as one class and four Gaussians that surround this cross as the other class, each of which has 100 samples. (Figure 2.7) The 2-D Ellipse data has two equal mean Gaussian ellipses as two classes, each of which has 50 samples. One of them has a much wider variance than the other, which makes distribution look like one class is surrounded by the other (Figure 2.8). Clearly, theres no single hyper plane that can classify either of the two datasets correctly. We run JCFO, non-kernelized ARD and kernelized ARD on both datasets 50 times respectively. A comparison of their performances is shown in Table 2.14. In the experiment summarized in Table 2.15, a non-informative irrelevant feature was added in exactly as descried in Section 2.2.2. In the experiment summarized in Table 2.16, one redundant feature was added in as described in Section 2.3.1.
Table 2.14. Comparisons of JCFO, non-kernelized ARD and kernelized ARD

Methods Test error (%) mean Cross std mean Ellipse std Non-kernelized ARD 50 0 48 4e-14 Kernelized ARD 8.1 1e-14 8.4 3e-15 JCFO 4.0 1e-15 12 1e-14
Table 2.15. Comparisons of JCFO and kernelized ARD with an added non-informative irrelevant feature
Methods Test error (%) mean Cross std mean Ellipse std Kernelized ARD 17.7 17.6 15.4 5.70 JCFO 7.32 3.20 13.4 5.03
35
Table 2.16. Comparisons of JCFO and kernelized ARD with an added similar redundant feature
Methods Test error (%) mean Cross std mean Ellipse std Kernelized ARD 15.3 4.01 13.4 5.21 JCFO 8.76 3.77 12.3 4.41
Since non-kernelized ARD performs a linear mapping, it cannot perform a reasonable classification in this scenario. Each of kernelized ARD and JCFO performs a little better in one of the original datasets. JCFO outperforms ARD on both of the two datasets mixed with irrelevant or redundant features, especially on the Cross data. However, the time consumption for training JCFO still dominates that of kernelized ARD and none of the parameters corresponding to the noise feature are set to zero by JCFO in these two scenarios.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 2.7. Cross data
36
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 2.8. Ellipse data 2.5 Discussion and Modification From the trouble JCFO suffered in the experiments we designed, we reiterate what we discussed in 2.2.2: That is, JCFO does not reliably achieve sparsity on the feature set and does not, therefore, realize the advantage of low generalization error due to reduced model complexity. Hence, it doesnt achieve both of the two advantages it claims. Therefore, a more effective approximation method for calculating in (1.21) needs to be proposed in order to achieve JCFOs theoretical superiority. Without such a method being available at hand, we suggest simply dropping the part and returning to a kernelized ARD model. e.g. we can plug in the RBF kernel function in (1.13). According to (Herbrich, 2002), this function has the appealing property that each linear combination of kernel functions (see (1.6)) of the training objects ( x1 , x 2 ,... x N ) can be viewed as a density estimator in the input space because it effectively puts a Gaussian on each x i and weights its contribution to the final density by i in (1.6). Regarding feature selection, we can apply a non-kernelized (linear) ARD to
37
directly weigh each feature. (However, this requires our datasets shouldnt be non-linearly separable.) As we have seen in 2.3.3.2, each time the x and y variables are assigned the same absolute value (different signs), which might help us detect the very implicit redundancy in that scenario. By combining these two approaches, we first use the latter to pre-select the more representative features, then turn them back to the former to do a more precise learning. This will increase the efficiency of the whole learning process. We show comparisons of the performance of non-kernelized, kernelized ARD and the combo of them as below on the Crabs data and HH by training 100 times.
Table 2.17. Comparisons of the three ARD methods

Methods Test error (%) mean Crabs std iterations mean HH std iterations Non-kernelized 3.60 2.95 100 20.03 0.47 100 Kernelized 2.2 0 5 12.63 6.26 5 Combo 1~30 1~10 2 9.17 2.11 3
Note that after implementing feature selections on the Crabs data, only two features remain. The Kernelized ARD will make i vanish very fast on low dimension datasets, (here even after 2 iterations only two i are nonzero) which causes the training result to be unstable (See the mean and std of the combinational method on the Crabs data). Therefore, we also need to realize that there are certain cases that the combination method is not appropriate and it also needs to balance the tradeoff between the parameter sparsity and the prediction accuracy. 2.6 Conclusion In this thesis, we introduced the theoretical background of an existing Bayesian based feature selection and classification method in the first chapter, from its origin to implementation detail as well as complexity. Then, from the second chapter, we systematically analyzed its
38
performances in a series of particularly designed experiments. A couple of comparisons with its direct ascendant approach are provided. From these experimental results weve seen that even if JCFO is theoretically more ideal in achieving both the features and basis functions sparsity, the lack of an effective implementation technique seriously restricts its performance. As an alternative, we suggest returning to the original ARD method, jointly using the kernelized and non-kernelized versions of it to exploit both feature selections and class predictions. Though our model would become less ambitious henceforth, its simplicity in practice and time-efficiency are still preserved from our original design purpose.
39
APPENDIX A SOME IMPORTANT DERIVATIONS OF EQUATIONS Equation (1.9) Consider the General case of minimizing: f ( x) = x 2 2ax + 2b | x | w.r.t. x, where a, b are constants. This is equivalent to:
x 2 2ax + 2bx, x > 0 min f ( x) = 0 , x = 0. x x 2 2ax 2bx, x < 0 When x 0 : a b arg min f ( x) = x 0 When x 0 : a + b arg min f ( x) = x 0
a b , a + b
, a b > 0 , a b 0
(1) ; (2)
, a+b < 0 , a+b 0
(3) . (4)
(1) & (3) b < 0 & x = arg min( f (a b), f (a + b)) x = a + b, a 0 x = a b, a > 0 x = sgn( a)(| a | b) + ;
(1) & (4) a 0 & x = a b > 0 ; x = sgn( a)(| a | b) +

(2) & (3) a 0 & x = a + b < 0 ; x = sgn( a)(| a | b) + (2) & (4) b 0 & x = 0 . x = sgn( a)(| a | b) +
Summarizing the above four scenarios yields:
40
x = sgn( a)(| a | b) +
Integration (1.10)
1 e 2 e 2 d 2 2
2
= =
e e
2 2 2
1 2
x2
2
2 x2
1 dx (let x = 1/2 , dx = 1/2 ) 2

2
= =
2 2 x 2 2 2x
dx
2 2 t t2 | |
dt
e 2 ( Beyer ,1979) = 2 | |
= 2 e
Expectation (1.18)
T T Since p( z i | D, ( t ) , (t ) ) ~ N (h( t ) ( x ) (t ) ,1) , lets consider y i = 1 , then h( t ) ( x ) (t ) 0 . i.e.:
N (hT( t ) ( x ) ( t ) ,1) i ,z 0 i (t ) , (t ) ) = N (hT( t ) ( x ) (t ) ,1) p ( z | D, 0 0 , zi < 0 N (hT( t ) ( x ) (t ) ,1) i ,z 0 T = (h( t ) ( x ) (t ) ) , zi < 0 0
41
vi = z i | D, (t ) , (t ) = z i p ( z i | D, (t ) , (t ) )dz i 0
= h ( x ) (t ) +
T (t )
T N (h( t ) ( x ) (t ) | 0,1) (hT( t ) ( x ) ( t ) )
The case of y i = 0 follows the similar way.
42
APPENDIX B THE PSEUDOCODES FOR EXPERIMENTS DESIGN

2.2 Irrelevancy
NOTES: In this testing unit, our basic datasets are randomly generated, linearly separable. Each feature is normalized to zero-mean, unit variance.
2.2.1 Uniformly Distributed Feature
counter 0; For 1:50 ds load dataset; (n,k) get the matrix size from ds; ds append a column of 1s as the k+1 feature Run JCFO (ds); If theta(k+1) = 0; counter counter + 1; End End Display(counter/50)
2.2.2 Noninformative Noise Features

counter 0; For mu = 5, 10, 15 For 1:50 ds load dataset; (n,k) get the matrix size from ds; ds append a column of mu-mean, unit variance noise as the k+1 feature Run JCFO (ds); If theta(k+1) = 0 counter counter + 1; End End End Display(counter/150)
2.3 Redundancy
NOTES: In this testing unit, our basic datasets come from the randomly generated, linearly separable datasets or Gaussian distributions. Each feature is normalized to zero-mean, unit variance.
43
2.3.1 Oracle Features

counter 0; For 1:100 ds load dataset; (n,k) get the matrix size from ds; ds append the class label as the k+1 feature Run JCFO (ds); If theta(k+1) > 0 and theta(1 through k) = 0; counter counter + 1; End End Display(counter/100)
2.3.2 Duplicate Features

counter 0; For 1:50 ds load dataset; (n,k) get the matrix size from ds; For i 1:k ds replicate and append the ith feature Run JCFO (ds); If theta(i) = 0 or theta(k+1) = 0 counter counter + 1; End End End Display(counter/(50*k))
2.3.3 Similar Features 2.3.3.1

counter 0; For 1:50 ds load dataset; (n,k) get the matrix size from ds; For i 1:k ds mix the ith feature with a unit noise and append it Run JCFO (ds); If theta(i) = 0 or theta(k+1) = 0 counter counter + 1; End End End Display( counter/(50*k))
2.3.3.2
44
mu1 [10 0 0]; mu2 [0 10 0]; sig1 [7 0 0; 0 1 0; 0 01]; sig2 [1 0 0; 0 7 0; 0 0 1] counter 0; For 1:150 ds randomly generate 200 samples from the two Gaussians with class label Run JCFO(ds); If either Feature X or Feature Y has been removed counter counter + 1; End End Display(counter/150)
2.3.4 Gradually More Informative Features

counter 0; For k = 5, 10, 15 For 1:100 mu1 a vector with k zeros; mu2 [10 20 30 10*k]; sig (10k/4) * (k-size identity matrix) ds randomly generate 200 samples from the Gaussians with class label Run JCFO (ds); If theta(k)>0 and theta(1 through k-1) = 0 counter counter + 1; End End End Display(counter/300 )
2.3.5 Indispensable Features

mu1 [5 5 0]; mu2 [8 8 0]; sig [4 0 0; 0 1 0; 0 01]; counter 0; For 1:150 ds randomly generate 200 samples from the two Gaussians with class label then rotate it 45 degree clockwise around the class center Run JCFO(ds); If theta(1)>0 and theta(2) > 0. counter counter + 1; End End Display( counter/150 )
45
LIST OF REFERENCES
W. Beyer. CRC Standard Mathematical Tables, 25th ed. CRC Press, 1979. J. Bi, K.P. Bennett, M. Embrechits, C.M. Breneman, and M. Song. Dimensionality reduction via sparse support vector machines. JMLR, 3: 1229-1243, 2003. M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9): 1150-1159, 2003. I. Guyon and Andre Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157-1182, 2003. T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer - Verlag, 2001. R. Herbrich. Learning Kernel Classifiers. MIT Press, 2002. B. Krishnapuram, A.J. Hartemink and M.A.T. Figueiredo. A Bayesian approach to joint feature selection and classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(9): 1105-1111, 2004. B. Krishnapuram, D. Williams, Y. Xue, A. Hartemink, L. Carin and M. Figueiredo. On semi-supervised classification. In L. K. Saul, Y.Weiss and L. Bottou, editors, Advances in Neural Information Processing Systems 17. MIT Press, 2005. B. Schlkopf and A.J. Smola. Learning with Kernels Regularization, Optimization and Beyond. MIT Press, 2002. R. Tibshirani. Regression shrinkage and selection via lasso. J. Royal Statistical Soc. (B), 58: 267-200, 1996. M. Tipping. The relevance vector machinc. In S. A. Solla, T. K. Leen and K. R. Muller, editors, Advances in Neural Information Processing Systems 11, pp. 218-224. MIT Press, 2000.
46
BIOGRAPHICAL SKETCH Fan Mao was born in Chengdu, China, in 1983. He received his bachelors degree in computer science and technology in Shanghai Maritime University, Shanghai, China, in 2006. Then he came to University of Florida, Gainesville, FL. In December 2007, he received his M.S. in computer science under the supervision of Dr. Paul Gader.
47

Joint Featureselection and Classifier Design Thesis

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Joint Featureselection and Classifier Design Thesis

Diunggah oleh

Hak Cipta:

Format Tersedia

A STUDY OF JOINT CLASSIFIER AND FEATURE OPTIMIZATION: THEORY AND ANALYSIS

2007 Fan Mao

To my Mom, for her enormous patience and unfailing love.

1.3 1.4 1.5 1.6

2.4 2.5 2.6

APPENDIX A SOME IMPORTANT DERIVATIONS OF EQUATIONS ...................................................40

THE PSEUDOCODES FOR EXPERIMENTS DESIGN......................................................43

LIST OF REFERENCES...............................................................................................................46 BIOGRAPHICAL SKETCH .........................................................................................................47

LIST OF TABLES Table page

LIST OF FIGURES Figure page

LIST OF SYMBOLS Symbols

The solution is the well-known original least squares estimate 0 = ( X T X ) 1 X T y . As

arg min ( yi j xij ) 2 + j2

( 1 / 2, 1/ 2) = (1, 0) 2 = 1 , but ( 1 / 2, 1/ 2) = 2 > (1, 0) 1 = 1 .Hence it tends to set more

where ( z ) = N (0,1)dz . (Note that ( z ) = 1 ( z ) .) K ( x, x j ) is a symmetric measure

of the similarity of two input objects. For example:

for polynomial functions and:

~ N (0,1) . Then it treats z = [ z1 , z 2 ,...z N ], , as the missing variables whose expectations

where T = diag ( 0 1 , 11 ,... N1 ) and R = diag ( 11 , 2 1 ,..., N1 ) . If we want to calculate the

expectation of (1.17), i.e. Q( , | (t ) , (t ) )

The expectations with respect to those missing variables are

Take the derivative of (1.18) with respect to and k would be:

mean std mean std

HH (50 runs) sec 200.3 55.70 129.7 37.97

Crabs (100 runs) sec 16.27 0.4921 9.524 0.3577

Table 2.1. Feature divergence of HH

Table 2.2. Feature divergence of Crabs

Table 2.3. Uniformly distributed feature in Crabs

Table 2.4. Uniformly distributed feature in HH

Table 2.5. Non-informative Noise in HH and Crabs

Table 2.7. Duplicate feature weights on HH

Table 2.8. Duplicate feature weights on Crabs

Error(%) Train/Test 3.853/21.4 14.51/20.11

Weight JCFO ARD

Error (%) Train/Test 2.013/4.840 4.001/5.976

Weight JCFO ARD

Table 2.11. Five features

Table 2.12. Ten features

Table 2.13. Fifteen features

1 0.5 0 1 0.5 0 1 0.5 0 0 50 100 150 0 50 100 150 0 50 100 150

Table 2.14. Comparisons of JCFO, non-kernelized ARD and kernelized ARD

Figure 2.7. Cross data

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1

Table 2.17. Comparisons of the three ARD methods

, a+b < 0 , a+b 0

(1) & (4) a 0 & x = a b > 0 ; x = sgn( a)(| a | b) +

Summarizing the above four scenarios yields:

1 dx (let x = 1/2 , dx = 1/2 ) 2

N (hT( t ) ( x ) ( t ) ,1) i ,z 0 i (t ) , (t ) ) = N (hT( t ) ( x ) (t ) ,1) p ( z | D, 0 0 , zi < 0 N (hT( t ) ( x ) (t ) ,1) i ,z 0 T = (h( t ) ( x ) (t ) ) , zi < 0 0

T N (h( t ) ( x ) (t ) | 0,1) (hT( t ) ( x ) ( t ) )

The case of y i = 0 follows the similar way.

APPENDIX B THE PSEUDOCODES FOR EXPERIMENTS DESIGN

2.2.2 Noninformative Noise Features

2.3.1 Oracle Features

2.3.2 Duplicate Features

2.3.3 Similar Features 2.3.3.1

2.3.4 Gradually More Informative Features

2.3.5 Indispensable Features

Anda mungkin juga menyukai