Anda di halaman 1dari 6

navar, Iowa State University

N PRACTICAL PATTERN-CLASSIFI- PRACTICAL PATTERN- CLASSIFICATION AND KNOWLEDGE-


cation tasks such as medical diagnosis, a clas-
sification function learned through an induc- DISCOVERY PROBLEMS REQUIRE THE SELECTION OF A SUBSET
tive learning algorithm assigns a given input
OF ATTRIBUTES OR FEATURES TO REPRESENT THE PATTERNS
pattem to one of a finite set of classes. Typi-
cally, the representation of each input pattem TO BE CLASSIFIED. THEAUTHORS APPROACH USES A GENETIC
consists of a vector of attribute, feature, or
measurement values. The choice of features
ALGORITHM TO SELECT SUCH SUBSETS, ACHIEJ4iVG
to represent the patterns affects several as- MULTICRITERlA OPTIMIZATION IN TERMS OF GENERALIZATION
pects of pattern classification, including
ACCURACYAND C O S D ASSOCIATED W T H THE FEATURES.
0 Accuracy. The features used to describe
the pattems implicitly define a pattem lan-
guage. If the language is not expressive larger the number of examples needed to and risks, and some may be irrelevant or
enough, it fails to capture the information train a classification function to the mutually redundant.
necessary for classification. Hence, re- desired accuracy. A significant, practical example of such a
gardless of the learning algorithm, the Cost. In medical diagnosis, for example, scenario is the task of selectmg a subset of clin-
amount of information given by the fea- pattems consist of observable symptoms ical tests-each with a different financial cost,
tures limits the accuracy of the classifica- along with the results of diagnostic tests. diagnostic value, and associated risk-to be
tion function learned. These tests have various associated costs performed for medical diagnosis. Other in-
8
Required learning time. The features and risks; for instance, an invasive ex- stances of the feature subset selection problem
describing the patterns implicitly deter- ploratory surgery can be much more ex- arise in, for example, large-scale data-mining
mine the search space that the learning pensive and risky than, say, a blood test. applications and power system control.
algorithm must explore. An abundance of Several approaches to feature subset
irrelevant features can unnecessarily in- In the automated design of pattern classi- selection exist (see the Related work side-
crease the size of the search space and fiers, these variables present us with thefea- bar); ours employs a genetic algorithm. The
hence the time needed for learning a suf- ture subset selection problem. This is the task experiments we describe in this article dem-
ficiently accurate classification function. of identifying and selecting a useful subset onstrate the effectiveness of our approach
Necessavy number of examples. All other of pattem-representing features from a larger in the automated design of neural networks
things being equal, the larger the number set of features. The features in the larger set for pattern classification and knowledge
of features describing the patterns, the have different associated measurement costs discovery.

44 1094-7167/98/$10.000 1998 IEEE IEEE iNTELLlGENTSYSTEMS


.. . ,,

I
I Related work

1
There havc bcen several proposals of approachcs to feature suh,ct iii a diagnosis task) caii significantly worsen a decision tree clasifiers
selection. (We discuss only a few of these in this article; our rccent gcneralization accuracy.Also, most of the proposcd lwture selection
work contains a more complete list of references.) Some 01 these techniques (wiih thc exception of those using genetic algorithms) are
approachcs involve searching for an optimal subset of features hazed on not designed t o handle multiple selection criteria (classification itccu-
particular criteria of interest. racy, feature mcasurementcost. and so on).
Feature weighting is a variant of feature sclcction. It involvcs assign- lhc multicriteria approach that we explore in this article is wrapper-
ing a real-valued weight to cach feature. The weight associated with a based and uses B genetic algorithm in conjunction with a relatively fast,
feature rwasures its relevance or significance in the classificationtask.2 interpattern distance-based, ncural-network lcaining algorithm. How-
Feature subsct selection is a spccial case of wcighting with hinary ever, this gcncral approach works hith any inductive learning algorithm.
I weights.
Several authors havc examined the USK ofa heurixtic search for fea-
ture subset selection; this oftcn operates in conjunction with a branch.. References
and-bound swrch.3 Others havc explored mndomizctfl and random-
1. J. Yang andV. Honavar. Feature Subset Selcction Using a Genetic
I ized, population-based heuristic search techniques such as genetic
algorithms-to select feature subsels for USC with decision-trccor
nearest-neighborclassifiers.
Algorithm, Feature Extraction, Constntction and Selection--/\
Dura Mining Perspective, Liu and hlutoda, eds., Kluwer Academic
Publishers, Boston, forthcoming, 1998.
Feature fiubset selection algorithms fall into two categories based on
whe1:heror not they perfonn feature selection independentlyof the 2. S. Cost and S. Salzberg, A Weighted Nearest Neighbor Algorithm
learning algorithmthat constructs the classifier. If the techniquc for Learning with Synibolic Features, Muchine Learning, Vol. 10,
performs feature selection indepentlenUy of the learning algorithni, it NO.1, Jan. 1993,pp. 57-78.
follows a,fiZferapproach. Otherwise, it follows a wrapper approach.?
The filter approach is generally computationallymore etlicient. 3. G. John, R. Kohavi, and K. Pflegcr, lrrelevant Features and the
HOWCWF, its major drawback is that an optimal selection of features Suhset Selection Problem, Pwc. I Irh Int 1 Corzj Machine Learn-
rnay not he independent of the inductive and representational biases of ing, Morgan Kaufmann, San Francisco, 1994,pp. I21-..129.
I the leanling algorithm that constructs the classitier.The wrapper 4. H. Liu and R. Setiono, A Probabihic Approach to Feature Selec-
approach, on the olher hand, incurs the computational overhead of eval- tiou-4 Filter Solution, Proc. 13th In1 1 Cor$ Machine Lcwning,
ualing candidate reature subsets by executing a selected learning algo- Morgan Kaufmann, 1996. pp. 319-327.

I
I
rithm on the data sct using each feature subset undcr consideration.
Becauseexhaustivc search over all possible combinationsof features
is iiot computationally feasible. most current approaches assume
monotonicity of some measurc of classilication pcrfomiancc and then
5. E Brill, D.Brown, and W. Martin, Fast Genetic Selection of Fea-
tures For Neural Network Classifiers, IEEE Truns. Neunil
Networks, Vol. 3 , No. 2, Mar. 1991, pp. 324-328.
! usc branch-and-bound search. This ensures that adding features does
0. M. Richeldi and P. J a n ~ iPerfornming
, Effective Feature Selection
not worsen perforinance. Techniques that make this monotonicity
hy Investigating the Deep Structure of the Data, Prcic. Second
assumption in some form appcar to work rcasonably wcll with linear Inll Con& Knowledge Discover?,mid Dain Minitig, AAA1 Press,
classificrs. 1,Iowever. thcy can exhi.bit poor performancc with nonlinear Menlo Park, Oalif.. 1996,pp. 379-383.

I
classiliers such as neural networks. Furthermore, many practical sce-
narios do not satisfy the monotonicity assumption. For example, irrcle- I . R.Ripky, Puttern Rt?rcignihnand Neurul Akhvorks, Cambridge
vant features (11or example; social sccurity numhers in n~cdicelrecords IJniv. Press, New York, 1996.

...-
. .. . . .. . .. ... ..

Why a genetic algorithm? tions computed by the neurons, the connec- network pattern classifiers.
tivity of the network, and the parameters However, if we use traditional neural-
Feature subset selection in the context of (weights) associated with the connections. network training algorithms to train the pat-
practical problems such as diagnosis presents Assume Cis a finite set of classes, II a finite tern classifiers, the use of genetic algorithms
a multicriteria optimization problem. The cri- number of discrete or real-valued attributes, for subset selection presents some practical
teria to be optimized include the classifica- R the set of real numbers, and D a finite set of problems:
tions accuracy, cost, and risk. Evolutionary discrete values. Multilayer networks of non-
algorithms offer a particularly attractive linear computing elements (such as threshold Traditional neural-network learning algo-
approach to multicriteria optimization be- neurons) can realize any classification func- rithms (such as back-propagation) per- ,

cause they are effective in high-dimensional tion ip : Rn+ C o r ip: Dn+ C. If the attributes form an error gradient-guided search for
search spaces. are symbolic, they must first be mapped to a suitable setting of weights in the weight
Neural networks are densely intercon- numeric values using appropriate coding space determined by a user-specified net-
nected networks of relatively simple com- schemes. Evolutionary algorithms are gener- work architecture. This ad hoc choice of
puting elements-for example. threshold or ally quite effective for rapid global search of network architecture often inappropri-
sigmoid neurons. Neural networks potential large search spaces in multimodal optimiza- ately constrains the search for weight set-
for parallelism and their fault and noise tol- tion problems. Neural networks are particu- ting. For example, if the network has too
erance make them an attractive framework larly effective for fine-tuning solutions once few neurons, the learning algorithm will
for the design of pattern classifiers for real- promising regions in the search space have miss the desired classification function.
world, real-time, pattern-classification tasks. been identified. Against this background, If the network has far more neurons than
The classification function realized by a genetic algorithms offer an attractive ap- necessary, it can result in overfitting of
neural network is determined by the func- proach to feature subset selection for neural- the training data, which leads to poor gen-

MARCH/APRIL 1998 45
eralization. Either case would make it dif- then be eliminated from further considera- * Population size: 50
ficult to evaluate the usefulness of a fea- tion. The process terminates when this pro- Number of generations: 20
ture subset describing the training pat- cess results in an empty training set-that is, Probability of crossover: 0.6
terns for the neural network. when the network correctly classifies the Probability of mutation: 0.001
Gradient-based learning algorithms, al- entire training set. At this point, the training * Probability of selection of the highest
though mathematically well-founded for set becomes linearly separable in the trans- ranked individual: 0 6
unimodal search spaces, can get caught formed space defined by the hidden neurons.
in local mnima of the error function. This In fact, it is possible to set the weights on the We based these parameter settings on the
can complicate the evaluation of the use- hidden-to-output neuron connections with- results of several prelimnary runs The prob-
fulness of a feature subset employed to out going through an iterative process. abilities of crossover, mutation, and selec-
describe the neural networks' training DistAl is guaranteed to converge to 100% tion of the highest ranked individual are close
patterns. classification accuracy on any finite training to the typical values used in standard genetic
A typical run of a genetic algorithm in- set in time that is polynomial in the number algorithms.
volves many generations. In each gener- of training pattems. Earlier experiments3show Each individual in the population repre-
ation, evaluation of an individual (a fea- that DistAl, despite its simplicity, yields clas- sents a candidate solution to the feature sub-
ture subset) involves training the neural sifiers that compare quite favorably with those set selechon problem. Let m be the total num-
network and computing its accuracy and generated by leaming algorithms that are more ber of features available to choose from to
cost. This can make the fitness evaluation sophisticated and substantiallymore demand- represent the patterns to be classified. In a
rather expensive, because gradient-based medical diagnosis task, these would be
algorithms are typically quite slow. The observable symptoms and a set of possible
problem is exacerbated because we must diagnostic tests that can be performed on the
use multiple neural networks to sample patient. (Given m such features, there exist
the space of ad hoc network architecture DIS'GqL ADDS HIDDEN NEU- 2mpossible feature subsets Thus, for large
choices to get a reliable fitness estimate values of m, an exhaustive search is not fea-
for each feature subset represented in the RONS ONE AT A TlMl?, USING A sible). Each feature subset is represented by
population. GREEDY S7RA7EGY THAT EN- a binary vector of hmension m If a bit is a
1, it means that the corresponding feature is
Fortunately, constructive neural-network SURES THAT EACH HDDm selected. A value of 0 indicates that the cor-
learning algorithms2eliminate the need for ad NEURON CORRECTLY CLASSIFIES responding feature is not selected
hoc and often inappropriate a priori choices We determine an individual's fitness by
of network architectures. In addition, such A iW4XlML SUBSET OF 'IIL4w- evaluating the neural network constructed by
algorithms can potefitially discover near-min- ING PATTERNS BELONGING TO A DistAl using a tranmg set whose pattenis are
imal networks whose size is commensurate represented using only the selected subset of
with the complexity of the classification task SINGLE CLASS. features If an individual has n bits tumed on,
implicitly specified by the training data. Sev- the correspondmg neural network has n input
eral new, provably convergent, and relatively nodes.
efficient constructive learning algorithms for ing computationally. This makes DistAl an The fitness function combines two crite-
multicategory real and discrete-valued pattern attractive choice for experimenting with evo- ria-the accuracy of the classificahon func-
classification tasks have begun to appear in lutionary approaches to feature subset selec- tion realized by the neural network and the
the l i t e r a t ~ r e . Many
~ . ~ of these have demon- tion for neural-network pattem classifiers. Fig- cost of performing the classification. We can
strated very good performance in terms of ure 1 shows the key steps in our approach. estimate the classification function's accu-
reduced network size, learning time, and gen- racy by calculating the percentage of patterns
eralization in several experiments with both in a test set that the neural network in ques-
artificial and fairly large real-world data sets. Implementation hon correctly classifies Several measures of
classification cost suggest themselves. the
We ran our experiments using a standard cost of measuring the value of a particular
I genetic algorithm with a rank-based selec- feature needed for classification (the cost of
tion strategy. The probability of selection of performing the necessary test in a medical
The results we present in this article are the highest ranked individual i s p (where 0.5 diagnosis application), the nsk involved, and
from experiments using neural networks con- < p < 1.0 is a user-specified parameter); that so on. To keep things simple, we chose this
structed by D i ~ t A l a, ~simple and fast con- of the second highest ranked individual isp(1 two-criteria fitness function.
structive neural-network learning algorithm - p ) ; that of the third highest ranked individ- fztness(x) = uccurucy(x)
for pattern classification. DistAl's key fea- ual is p ( 1 - P ) ~and
; that of the last ranked
cost(x)
ture is to add hidden neurons one at a time, individual is 1 - (sum of the probabilities of - +cost,,
using a greedy strategy that ensures that each selection of all the other individuals).' Our accuvucy(x) +1 (1)
hidden neuron correctly classifies a maximal results are based on ten random partitions for Here,fitizess(xj is the fitness of the feature
subset of training patterns belonging to a sin- each classification task with the following subset represented by x; accuvucy(x) is the
gle class. Correctly classified examples can parameter settings: test accuracy of the neural-network classifier

46 IEEE INTELLIGENT SYSTEMS


trained by DistAl using the feature subset rep-
resented by x;cosr(x)is the sum of measure-
ment costs of the feature subset represented
by x;and cost,,,, is an upper bound on the
costs of candidate solutions. In this case,
cost,, is simply the sum of the costs associ- ... pool of
ated with all of the features. This is clearly a candidates candidates fitness values)
somewhat ad hoc choice. However, it does
discourage trivial solutions-such as a zero-
cost solution with very low accuracy-from
being selected over reasonable solutions that
.. ~. .. A
1. Feature subset selection using a genetic algorithm with DistAl. Starting from the initial population.ofmdi-
yield high accuracy at a moderate cost. It also dates having different feature subsets, we generate new populations repeatedlyfrom the previous ones by applying
ensures that 'dx 0 5 fitness (x) 5 (100 + genetic operators (crossover and mutation) to the selected parents. DistAl evaluates the fitness values of offspring and
COSGI".
ranks them according to their fitness values. The last generation of the processyields the best individual.
In practice, we must define suitable trade-
offs between the multiple objectives based
on knowledge of the domain. In general, it is neural networks built using feature subsets performance considering the cost in addition to
a nontrivial task to combine multiple opti- that the genetic algorithm selected with the accuracy (see Equation 1) with that we
mization criteria into a single fitness func- neural networks using the entire set of fea- obtained by considering the accuracy alone.
tion. The literature on utility theory exam- tures available. Table 1 summarizes the data
ines a wide variety of appro ache^.^ sets' characteristics.
Some medical data sets include measure- Experimentalresults
ment costs for the features, but most sets lack
Experimental data sets this information. Thus, our experiments fo- We partitioned each data set into a training
cused on identifying a minimal subset of fea- and test set (with 90% of the data used for
The experiments we report here used real- tures to yield high-accuracy neural-network training and the remaining 10% for testing).
world data sets as well as a carefully con- classifiers for all data sets. Where measure- We did this partition 10 times and used each
structed artificial data set (called1 3-bit par- ment costs were available, we compared the partition in five independent runs of the gen-
ity) to explore the feasibility of using genetic
algorithms for feature subset selection for
neural-network classifiers. We obtained the
real-world data sets from the machine- Table 1. Data sets used in the experiments.
learning data repository at the University of
SIZE INPUT OUTPUl
California, Irvine (http://www.ics.uci.edu/
(NO. OF FEATIRES CLASSES
A I/MuMDBRepository. html). DATASET DESCRIPTION PATTERNS) (NO.\ FEATURE
TYPE (NO.\

3-bit parity data set. We constructed this 3P 3-bit parity problem 100 13 Numeric 2
data set to explore the genetic algorithm's Annealing Annealing database 798 38 Numeric, nominal 5
Audiology Audiology database 200 69 Nominal 24
effectiveness in selecting an appropriate sub- Bridges Pittsburgh bridges 105 11 Numeric, nominal 6
set of relevant features in the presence of Cancer Breast cancer 699 9 Numeric 2
redundant features. If successful, the genetic CRX Credit screening 690 15 Numeric, nominal 2
algorithm would minimize the cost and max- Flag Flag database 194 28 Numeric, nominal 8
imize the accuracy of the resulting neural- Glass Glass identification 214 9 Numeric 6
Heart Heart disease 270 13 Numeric, nominal 2
network pattern classifier. HeartCle Heart disease (Cleveland) 303 13 Numeric, nominal 2
To introduce redundancy to the training HeartHun Heart disease (Hungarian) 294 13 Numeric, nominal 2
set, we replicated the original features once, HeartLB Heart disease (Long Beach) 200 13 Numeric, nominal 2
thereby doubling the number of features. HeartSwi Heart disease (Swiss) 123 13 Numeric, nominal 2
Hepatitis Hepatitis domain 155 19 Numeric, nominal 2
Then, we generated an additional set of irrel- Horse Horse colic 300 22 Numeric, nominal 2
evant features and assigned them random Ionosphere Ionosphere structure 351 34 Numeric 2
Boolean values. We generated 1007-bit ran- Liver Liver disorders 345 6 Numeric 2
dom vectors and augmented them with the Pima Pima Indians diabetes 768 8 Numeric 2
6-bit vectors (corresponding to the original Promoters DNA sequences 106 57 Nominal 2
Sonar Sonar classification 208 60 Numeric 2
three bits plus an identical set of ihree bits). Soybean Large soybean 307 35 Nominal 19
We assigned each feature in the resulting data Votes House votes 435 16 Nominal 2
set a random cost between 0 and 9. Vehicle Vehicle silhouettes 846 18 Numeric 4
Vowel Vowel recognition 528 10 Numeric 11
Wine Wine recognition 178 13 Numeric 3
Real-world data sets. Our objective with zoo Zoo database 101 16 Numeric, nominal 7
real-world data sets was to compare the

MARCH/APRIL 1998 47
etic algonthm. Tables 2,3, and 4 show aver- Acknowledgments
aged performance. The table entries corre-
spond to means and standard deviations, AN attractive approach to solving the fea- This research was partially supported by the
shown in the form mean k standard deviation. ture subset selection problem in inductive National Science Foundation (through grants IRI-
learning of pattern classifiers in general and 9409580 and IRI 9643299) and the John Deere
(See our recent work6 for more thorough
Foundation.
experiments .) neural-network pattern classifiers in partic-
ular. This task finds applications in the cost-
Improving generalization. To study the sensitive design of classifiers for tasks such
effect of feature subset selection on general- as medical diagnosis and computer vision.
ization, we ran expenments using classifica- Other applications of interest include auto-
tion accuracy as the fitness function. The mated data-mining and knowledge discov-
results shown in Table 2 indicate that the net- ery from data sets with an abundance of
works constructed using a GA-selected sub-
set of features compare quite favorably with
networks that use all of the features. In par- References
ticular, feature subset selection resulted in
significant generalization improvement. 1 M Mitchel1,An Introduction to Genetic Algo-
Table 3 compares the results of our ap-
proach with other GA-based approaches7
THEFITNESS FUNCTION THAT rithms, MIT Press, Cambndge, Mass , 1996

and several non-GA-based approaches cited COMBINED BOTH ACCURACY 2 V. Honavar and L. Uhr, GeneratlveLearning
in our recent work.6 (These non-GA ap- Structures and Processes for Connectionist
AND COST OUTPERFORMED Networks, Information Sciences, Vol. 70,
proaches use a decision-tree algorithm.) We
1993,pp. 75-108.
limited the comparisons to only those data THAT BASED ON ACCURACY
sets for which at least one of the two stud-
ies6,7reported results that could be compared
ALONE IN EVERY RESPEC?: THE 3 J Yan~.R Parekh. andV Honavar. DistAl.
U

An Inter-Pattern Distance-Based Construc-


with the results of our experiments. (It is not NUMBER OF FEATURES USED, tive Learning Algorithm, Tech Report ISU-
generally feasible to do a completely fair and CS-TR 97-05, Iowa State Univ , Ames, Ia ,
thorough comparison between different GENERALIZATION ACCURACI; 1997 Also to appear in Proc Intl Joint Conf
approaches without complete knowledge of Neural Networks, IEEE, Piscataway, N J ,
AND THE NUMBER OF HIDDEN 1998
the parameters and setup used in the exper-
iments.) The results indicate that our NEURONS.
approach provided higher generalization 4. R Parekh, J. Yang, and V. Hanovar, Con-
structive Neural Network Learning Algo-
accuracy in almost all cases, although it r Multi-Category Real-valued Put-
occasionally used more features. irrelevant or redundant features. In such tern Classification, Tech. Report ISU-CS-TR
cases, identifying a relevant subset that ade- 97-06, Dept Computer Science, Iowa State
Minimizing cost and maximizing accu- quately captures the regularities in the data Univ., 1997.
racy. For this experiment, we based subset can be particularly useful.
selection on both generalization accuracy Some directions for further research in this 5 R. Keeney and H. Raffa, Decisions with Mul-
and the features measurement cost. (See the field include tiple Objectives: Preferences and Value
Tradeoffs, John Wiley & Sons, New York,
fitness function in Equation 1.) We used the 1976
%bit parity problem, hepatitis, Cleveland the application of approaches based on
heart disease, and the Pima Indians diabetes genetic algorithms to feature subset selec-
6 J Yang and V Honavar, Feature Subset
data sets (with random costs in the 3-bit par- tion for large-scale pattern classification Selection Using a Genetic Algorithm, Fea-
ity problem.) Table 4 shows the results. tasks that anse in power systems control,8 ture Extraction, Construction and sekchon-
The fitness function that combined both gene sequence recognition, and data-min- A Datu Mining Perspective, Liu and Motoda,
accuracy and cost outperformed that based ing and knowledge discovery: eds., Kluwer Academic Publishers, Boston,
forthcomng in 1998.
on accuracy alone in every respect: the num- extensive experimental and theoretical
ber of features used, generalization accuracy, comparison of the performance of our
and the number of hidden neurons. This is 7. M. Richeldi and P Lanzi, PerformmgEffec-
approach with that of conventional meth-
tive Feature Selection by Investigating the
not surprising, because the former tries to ods for feature subset selection; Deep Structure of the Data, Proc. Second
minimize cost while maximizing accuracy; * more principled design of multiobjective Intl Conf Knowledge Discovery and Data
this reduces the number of features. The lat- fitness functions for feature subset selec- Mining, AAA1 Press, Menlo Park, Calif.,
ter emphasizes only the accuracy. Some of tion using domain knowledge along with 1996,pp 379-383
the runs resulted in feature subsets that did mathematically well-founded tools of
not necessarily have minimum cost. This multiathribute utility the01-y.~ 8. G. Zhou, J McCalley, and V. Honavar,
suggests that we can improve the results with Power System Security Margin Predichon
Radial Basis Function Networks,
a more principled choice of a fitness func- Some of these research directions are cur- 29th Ann North American Power
tion combining accuracy and cost. rently being explored. 1997,pp. 192-198.

48 IEEE INTELLIGENT SYSTEMS


Table 2. Neural-networkpattern classifiersconstructedusing the entire set of features comparedwith those con- Jihoon Yang is a graduate student working on a
structed using the most accurate subsets selected by the genetic algorithm. The column labeled accuracy shows the PhD in computer science at Iowa State University.
generalizationaccuracy, and the column labeled hidden shows the number of hidden neurons generated in the His research interests include intelligent agents,
neural networks. data mining and knowledge discovery, machine
learning, neural networks, pattern recognition, and
evolutionary computing. Yang received a BS in
ALLATTRlBUrES GA-SELECTED
SUBSET computer science from Sogang University, Seoul,
DATASET FEATURES ACCURACY FEATURES ACCURACY Korea, and an MS in computer science from Iowa
(NO.) (YO) HIDDEN (NO.) W O ) HIDDEN State University. Yang is a member of the AAAI
and the IEEE. Contact him at the AI Research
3P 13 79.0 f 12.2 5.0 f 2.0 6.6k1.6 100 f 0.0 9.2 f 4.9 Group, Dept. of Computer Science, Iowa State
Annealing 38 96.6 f 2.0 12.1 f 2.4 21 .O f 3.1 99.5 f 0.9 11.1 f 2 . 9 Univ., Ames, IA 50011; yang@cs.iastate.edu.
Audiology 69 66.0 f 9.7 24.7 f 4.8 36.4 f 3.5 83.5 f 8.2 27.4 f 5.6
Bridges 11 63.0 f 7.8 5.2 f 3.3 5.6 f 1.5 81.6 f 7.6 17.6 f 12.4
Cancer 9 97.8 f l.;! 2.9f1.2 5.4f1.4 99.3 f 0.9 5.7 f 2.9
CRX 15 87.7 f 3.3 7.7 f 6.9 8.0 f 2.1 91.5 f 2.8 12.5 f 7.6
Flag 28 65.8 f 9.5 9.1 f 6.2 14.0 f 2.6 78.1 f 7.8 11.2 f 6.5
Glass 9 70.5 f 8.5 9.8 f 6.9 5.5 f 1.4 80.8 f- 5.0 14.5 f 6.6
Heart 13 86.7 f 7.6 5.7 f 4.4 7.2 f 1.6 93.9 f 3.8 7.5 k 3.9
HeartCle 13 85.3 f 2.7 3.4f1.1 7.3 f 1.7 92.9 f 3.6 7.6 f 4.2
HeartHun 13 85.9 f 6.3 5.0 f 2.9 Vasant Honavar is an associate professor in Iowa
7.0f1.2 93.0 f 4.0 7.1 f 3.7
HeartSwi 13 94.2 f 3.8 2.2 f 0.6 6.6k1.7 State Universitys Artificial Intelligence Research
98.3 k 3.3 3.7f1.5 Laboratory, which he founded, and is also on the
HeartVa 13 80.0 f 7.4 5.1 f 2.6 7.1 f- 1.7 91 .O f 5.7 8.5 f 3.0
Hepatitis 19 84.7 f 9.5 6.2 f 4.0 9.2 f 2.3 faculty of the Interdepartment Graduate Program
97.1 f 4.3 8.1 f 2 . 8 in Neuroscience. His research interests include AI,
Horse 22 86.0 f 3.6 5.3 f 4.5 11.1 f 2 . 3 92.6 f 3.4 9.5 f 4.1
Ionosphere 34 94.3 f 5.0 5.5f1.6 cognitive science, machine learning, neural net-
17.3f 3.5 98.6 f 2.4 7.5 f 2.4 works, intelligent agents and multiagent systems,
Liver 6 72.9 k 5.1 21.5 f 27.3 4.1 f 0.7 77.8 f 4.0 25.9 f 24.3
Pima 8 76.3 f 5.1 8.1 f 4 . 9 adaptive systems, and intelligent manufacturing
3.8 f 1.5 79.5 f- 3.1 20.8 f 21.2
Promoters 57 88.0 f 7.5 2.2 f 0.4 systems. He received a BE in electrical engineer-
28.8 f 3.3 100~0.0 2.7fl.O
Sonar 60 83.0 f 7.8 6.4 f 2.7 ing from Bangalore University, Bangalore, India,
30.7 f 3.7 97.2 f 2.9 7.2 f 3.0 an MS in electrical and computer engineering from
Soybean 35 81 .O f 5.6 20.2 f 3.2 19.4 f 2.7 92.8 f 5.9 23.3 f 4.3
Vehicle 18 65.4 f 3.5 Drexel University, an MS in computer science
23.7 f 5.0 9.1 f 1.7 68.8 f 4.3 36.2 f 18.2
Votes 16 96.1 f 1.5 3.2 f 1.5 from the University of Wisconsin, and a PhD in
8.9 ? 1.8 98.8 f 1.2 4.0 f 1.8 computer science and cognitive science, also from
Vowel 10 69.8 f 6.4 38.0 f 8.3 6.5 f 1.2 78.4 f 3.8 41.5 f 7.7
Wine 13 97.1 f 4.0 5.5f1.7 the University of Wisconsin. He edited Advances
6.7 f 1.6 99.4 f 2.1 5.9 f 2.1
zoo 16 96.0 f 4.9 +
6.1 1.1 9.3f1.6 100f 0.0 6.2 f 1.1
in Evolutionary Synthesis of Neural Systems (MIT
Press, 1998). He is a member of the IEEE, ACM,
AAAI, Cognitive Science Society, Society for
Neuroscience, Neural Network Society,and Sigma
Table 3. Comparisonbetween various approachesfor feature subset selection. The non-GA column shows the best Xi, and is an associate of the Behavior and Brain
performanceamong several approaches not based on genetic algorithms: the Richeldicolumn shows the perfor- Sciences Society. Contact him at the AI Research
mance reported by Richeldiand ilanzi; and the DistAl column shows the performance of our approach. Group, Dept. of Computer Science, Iowa State
Univ., Ames, IA 50011; honavar@cs.iastate.edu.
NON-GA RICHELDI DISTAL
DATASET FEATURES ACCURKY FEATURES ACCURACY FEATURES ACCURACY

Annealing 8 95.0 f 2.3 21 .O f 3.1 99.5 f 0.9


Cancer 74.7 5.4 f 1.4 99.3 f 0.9
C RX 85.0 7 85.1 f 6.1 8.0 f2.1 91.5 f 2.8
Glass 62.5 4 70.5 f 7.8 5.5f1.4 80.8 f 5.0
Heart 79.2 5 80.8 f 6.5 7.2 f 1.6 93.9 k3.8
Hepatitis 84.6 9.2 f 2.3 97.1 f 4.3
Horse 85.3 11.1 f 2 . 3 92.6 f 3.4
Pima 3 73.2 f 3.8 3.8 f 1.5 79.5 f 3.1
Sonar 16 76.0 f 9.0 30.7 f 3.7 97.2 f 2.9
Vehicle 7 69.6 f 6.1 9.1 f 1.7 68.8 f 4.3
Votes 4 97.0 5 95.7 f 3.5 8.9f1.8 98.8 f 1.2

Table 4. Performunce comparison:neural-network pattern classifiersthat use features selected based on accuracy alone compared to
those that use features selected based on both accuracy and cost.

ACCURACY
ONLY ACCURACY
AND COST
DATA SET FEATURES ACCURACY HIDDEN FEATURES ACCURACY COST HIDDEN

3P 6.6f1.6 1OOfO.O 9.2 f 4.9 4.3f1.2 1OOfO.O 26.7 f 7.6 7.3 f 4.2
Hepatitis 9.2 f 2.3 97.1 f 4 . 3 8.1 f 2 . 8 8.3 f 2.4 97.3 f 3.5 19.0 f 8.1 7.4 f 2.8
HeartCle 7.3f1.7 92.9 f 3.6 7.6 f 4.2 6.1 f 1 . 6 93.0 f 3.4 261.5 f94.4 7.2 f5.1
Pima 3.8 f 1.5 79.5 f 3.1 20.8 f 21.2 3.1 f 1 . 0 79.5 f 3.0 22.8 f 9.7 16.0f 11.1

MARCH/APRIL 1998 49