Anda di halaman 1dari 4

N.A. Abu Osman, F. Ibrahim, W.A.B. Wan Abas, H.S. Abd Rahman, H.N. Ting (Eds.

): Biomed 2008, Proceedings 21, pp. 266269, 2008


www.springerlink.com Springer-Verlag Berlin Heidelberg 2008

A Comparative Study of Imputation Methods to Predict Missing Attribute Values
in Coronary Heart Disease Data Set
N.A. Setiawan, P.A. Venkatachalam and A.F.M. Hani

Department of Electrical and Electronic Engineering, Universiti Teknologi PETRONAS,
Bandar Seri Iskandar 31750 Tronoh, Perak, Malaysia
Abstract The objective of this research is to investigate
the effects of missing attribute value imputation methods on
the quality of extracted rules when rule filtering is applied.
Three imputation methods: Artificial Neural Network with
Rough Set Theory (ANNRST), k-Nearest Neighbor (k-NN) and
Concept Most Common Attribute Value Filling (CMCF) are
applied to University California Irvine (UCI) coronary heart
disease data sets. Rough Set Theory (RST) method is used to
generate the rules from the three imputed data sets. Support
filtering is used to select the rules. Accuracy, coverage, sensi-
tivity, specificity and Area Under Curve (AUC) of Receiver
Operating Characteristics (ROC) analysis are used to evaluate
the performance of the rules when they are applied to classify
the complete testing data set. Evaluation results show that
ANNRST is considered as the best method among k-NN and
CMCF.
Keywords Missing attribute value, imputation, rough set
theory.
I. INTRODUCTION
The problem of finding appropriate models that describe
or classify the data is encountered in many areas. Mathe-
matical or empirical methods are used to find the best
model. Empirical methods are used when there are difficul-
ties to find appropriate mathematical model that describes
or classifies the data. One of widely used methods is knowl-
edge discovery from data (KDD). The knowledge discov-
ered is usually in form of rules. KDD is usually used in
biomedicine area to build decision support systems.
The KDD process consists of several steps. One of the
important steps is data preprocessing. The quality of ex-
tracted knowledge depends on the quality of data. The pre-
processing step consists of handling missing attribute value.
Obtaining large amount of high quality data may not always
be an easy task or even possible. Medical data acquisition in
practice may be non-automated task. Many of medical data
involve laboratory work. Most hospitals transcribe the dic-
tated notes from physicians manually [1]. Not all the labora-
tory tests are available in most hospitals. Incompleteness of
data sets then is the common issue in medical data set col-
lection.
Many researches in missing attribute value prediction
have been conducted in general data set and specific area
data set. Grzysmala-Busse and Hu [2] compare and present
nine different approaches to missing attribute values. They
do not have enough evidence to support the claim which
approach is considered superior. Al Shalabi, Najjar and Al
Kayed [3] present a framework to deal with missing data.
They compare four algorithms to impute the missing data.
Evaluation of generated rules is conducted. Li and Cercone
[4] proposed RSFit, a combination of rough set and distance
method, to assign the missing data. Wasito and Mirkin [5,
6] present nearest neighbor approach in the least square data
imputation algorithms. Ragel and Cremilleux [7] built miss-
ing values completion based on association rules. In the
specific area, ANNRST is presented by Setiawan,
Venkatachalam and Hani [8, 9] to predict the missing at-
tribute value of UCI coronary heart disease data set. Troy-
anskaya, et. al. [10] compared missing data imputations
methods for DNA microarrays. Missing value estimation in
DNA microarrays is also found in [11]. Missing value esti-
mation methods in other specific area are also discussed in
[12-14]. Most of the researchers only consider the accuracy
or error of imputations. Only Al Shalabi considered the
evaluation of generated knowledge to select the appropriate
methods, although he concluded that the best model of im-
putation is task-dependent. Rough set theory (RST) pro-
posed by Pawlak [15] is considered as new theory that can
be applied in KDD. RST usually generates a large number
of decision rules. Rule filtering must be used to select the
most important and high quality rules. Quality of the classi-
fier will be reduced during rule filtering process. How the
imputation methods affect the quality of generated rules
during filtering need to be answered
In this research three imputation methods, namely
ANNRST, k-NN and CMCF are compared in UCI coronary
heart disease data sets [16]. The rules are generated from
three imputed data sets using RST. The most complete data
set of UCI coronary heart disease data will be used as test-
ing data while filtering the generated rules. Support filtering
is applied on three sets of generated rules. Accuracy, cover-
age, sensitivity, specificity and AUC of ROC analysis of the
classifiers on testing data are evaluated.
A Comparative Study of Imputation Methods to Predict Missing Attribute Values in Coronary Heart Disease Data Set 267
_________________________________________
IFMBE Proceedings Vol. 21
___________________________________________

II. DESCRIPTION OF INVESTIGATED IMPUTATION METHODS
The following three imputations of missing attribute val-
ues are used in this experiment:
A. Concept most common attribute value filling (CMCF)
Most common attribute value filling is the simplest
methods to deal with missing attribute values. The value
that occurs most frequent is selected to be the value of the
unknown missing attribute value. CMCF is a restriction of
most common attribute value filling by its concept or deci-
sion. In CMCF, the value that occurs most frequent within
concept is selected to be the imputed value. CMCF is also
called maximum relative frequency method or maximum
conditional probability method.
B. k-Nearest neighbor method (k-NN)
The algorithm of k-NN can be described as follows: take
the row that contains missing value as the target, determine
its k-nearest neighbors by computing the Euclidean distance
between incomplete row and complete rows, identify the k
closest and impute the missing value in row that contains
missing value by averaging the corresponding complete
rows of the k closest.
C. Artificial neural network with rough set theory method
The algorithm of ANNRST contains two parts: RST at-
tribute reduction and constructing ANN to predict the miss-
ing attribute values. A reduct is combination of attributes
that can discern between objects as well as all attributes in
the decision system:
( ) d A U S , , = (1)
Where U is the set of objects, A is the set of conditional
attributes and d represents the decision attributes. Reduct
can be found as the set of prime implicant of discernibility
function for attribute A:
} 1 ) { (
1
=
ij ij m
c n i j c a a
A
f , ,...,
* * *
(2)
Where Boolean variables
} {
ij ij
c a a c = |
* *
(correspond-
ing to the attributes:
m
a a ,...,
1
) and
) ( ) ( and ) ( ) ( |
j i j i ij
x d x d x a x a A a c = (3)
For i, j = 1 n where n the number of objects in deci-
sion table. After minimal reduct is computed, ANN is
constructed with its input are the corresponding reduct
condition attributes and decision attribute. To impute miss-
ing values of m attributes, m ANN topologies must be
constructed. If there are p condition attributes and q
decision attribute, then ANN has p-1+q input attributes
including decision attribute with one condition attribute that
contains missing value as output. Full discernibility reduct
computation is used to reduce the attributes in attribute
reduction and feature selection problem. Further detail on
ANNRST can be found in [8, 9].
III. ROUGH SET THEORY CLASSIFIER
RST classifier is based on rules. The rules that are ex-
tracted are generated from relative reducts in form of IF
condition(s) THEN decision(s). The rules are generated by
combination of reducts and the value of attributes from data
sets. The rule is called deterministic rule if the combination
of attributes in the antecedent implies only single decision
attribute in the consequent, otherwise it is called probabilis-
tic rule.
The terminology of support is defined as the number of
objects or instances in the training data that match the rule.
Support is a basic filtering criterion when RST generates a
large number of rules. Evaluation of the classifier is con-
ducted by applying the generated rules to the testing data
and computing the accuracy, coverage, sensitivity and
specificity. Receiver ROC analysis is introduced as a
combination of sensitivity and specificity of the classifier.
Area under curve of ROC is also used to evaluate the classi-
fier [17].
IV. EXPERIMENT AND RESULTS
UCI coronary artery disease is used as data set. Origi-
nally, the amount of data is 920 objects. They are collected
from four different sources: Cleveland Clinic Foundation,
U.S.; Hungarian Institute of Cardiology, Budapest, Hun-
gary; V.A. Medical Center, Long beach, U.S. and Univer-
sity Hospital, Zurich, Switzerland. After selection, 661
objects are used which 351 objects consist of missing val-
ues. The data has thirteen conditional attributes and one
decision attribute which is the presence of coronary artery
disease. All the numerical values are discretized. The data is
missing in three row conditional attributes: the slope of the
peak exercise ST segment of EKG, number of vessels col-
ored by fluoroscopy and exercise thallium scintigraphic
defects. ANNRST, k-NN and CMCF are used to impute the
missing value in the three row attributes. The complete data
set is used as training data of RST classifier and the imputed
data set us used as testing data. The generated rules then are
filtered based on their support. There are three sets of im-
puted data: ANNRST, k-NN and CMCF imputed data sets.
268 N.A. Setiawan, P.A. Venkatachalam and A.F.M. Hani
_________________________________________
IFMBE Proceedings Vol. 21
___________________________________________

RST classifier generates 4095, 5566 and 1537 rules for
ANNRST, k-NN and CMCF data sets respectively.
Support filtering is applied by removing rule that has
support below one to twenty five. The maximum number of
support of twenty-five is chosen because the accuracy of the
classifier is still good (0.823 for ANNRST data set) at that
number and there is no big change to the number of rules
that are pruned after twenty-five. The effect of rule support
filtering on the number of rules can be seen in Fig. 1. The
number of rules is exponentially reduced for all three data
sets. The effects of rule support filtering on the accuracy,
coverage, sensitivity and specificity are shown in Fig. 2,
Fig. 3, Fig. 4 and Fig. 5 respectively.
0
1000
2000
3000
4000
5000
6000
0 5 10 15 20 25
Support
R
u
l
e
s
ANNRST
KNN
CMCF

Fig. 1 Support filtering effect on the number of rules with three different
data sets.
0.7
0.75
0.8
0.85
0.9
0.95
0 5 10 15 20 25
Support
A
c
c
u
r
a
c
y
ANNRST
KNN
CMCF

Fig. 2 Support filtering effect on the accuracy of classifier with three
different data sets.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
0 5 10 15 20 25
Support
C
o
v
e
r
a
g
e
ANNRST
KNN
CMCF

Fig. 3 Support filtering effect on the coverage of classifier with three
different data sets.
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 5 10 15 20 25
Support
S
e
n
s
i
t
i
v
i
t
y
ANNRST
KNN
CMCF

Fig. 4 Support filtering effect on the sensitivity of classifier with three
different data sets.
0.7
0.75
0.8
0.85
0.9
0.95
1
0 5 10 15 20 25
Support
S
p
e
c
i
f
i
c
i
t
y
ANNRST
KNN
CMCF

Fig. 5 Support filtering effect on the specificity of classifier with three
different data sets.
ANNRST data set performs better on accuracy and sensi-
tivity than k-NN and CMCF except for support 17 while
k-NN performs better but with decreased coverage. The
coverage of classifier on k-NN data set drops significantly
after supports filtering is about ten. CMCF data set has good
result on specificity and outperforms ANNRST.
ROC analysis is used to determine the classifier perform-
ance on sensitivity and specificity by calculating the area
under ROC curve (AUC). ROC represents the performance
of classifier based on sensitivity and specificity analysis.
Neither sensitivity nor specificity can represent the classi-
fier performance. Both must be considered. Fig. 6 shows the
effect of rule support filtering on AUC of ROC analysis. It
can be seen that classifier with ANNRST data sets has the
better AUC than k-NN and CMCF classifiers.
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0 5 10 15 20 25
Support
A
U
C
ANNRST
KNN
CMCF

Fig. 6 Support filtering effect on the AUC-ROC of classifier with three
different data sets.
A Comparative Study of Imputation Methods to Predict Missing Attribute Values in Coronary Heart Disease Data Set 269
_________________________________________
IFMBE Proceedings Vol. 21
___________________________________________

The AUC in Fig. 6 is based on the number of support.
The number of rules must be considered because the same
number of support gives different number rules as shown in
Fig. 1. The accuracy, coverage and AUC-ROC evaluation
based on the same number of rules among three different
data sets can be seen in Fig. 7, Fig. 8 and Fig. 9 respec-
tively.
V. CONCLUSIONS
The main objective of this research is to investigate the
effects of missing attribute value imputation methods on the
quality of extracted rules when rule filtering is applied. It
can be considered that determining the best method for
imputation is difficult task. Many aspects of the impact of
imputation must be considered. Selecting the best imputa-
tion method depends on the task and the type of data. This
paper considered the quality classifier as the method for
evaluation of imputation methods of missing data in coro-
nary heart disease. ANNRST imputation can be considered
as the best method in the case of UCI coronary heart disease
data sets. It can be considered as the imputation method on
the case of data that has the same properties with UCI coro-
nary heart disease.
VI. REFERENCES
1. Ohrn A (1999) Discernibility and rough sets in medicine: tools and
applications, PhD thesis, Department of computer and information
science, Norwegian University of Science and Technology: Trond-
heim.
2. Grzymala-Busse J, Hu M (2001) A Comparison of several approaches
to missing attribute values in data mining, Rough Sets and Current
Trends in Computing, p. 378.
3. Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with
missing data in data sets. Journal of Computer Science, 2(9): p. 740-
745.
4. Li J, Cercone N (2006) Assigning missing attribute values based on
rough sets theory, IEEE International Conference on Granular Com-
puting. 2006
5. Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-
squares data imputation algorithms. Information Sciences, 169(1-2):
p. 1.
6. Wasito I, Mirkin B (2006) Nearest neighbours in least-squares data
imputation algorithms with different missing patterns. Computational
Statistics & Data Analysis. 50(4): p. 926.
7. Ragel A, Cremilleux B (1999) MVC--a preprocessing method to deal
with missing values. Knowledge-Based Systems, 12(5-6): p. 285.
8. Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data
estimation on heart disease using artificial neural network and rough
set theory, International Conference on Intelligent and Advanced Sys-
tems, 2007, Kuala Lumpur, Malaysia.
9. Setiawan NA, Venkatachalam PA, Hani AFM (2008) Missing attrib-
ute value prediction based on artificial neural network and rough set
theory, International Conference on Biomedical Engineering and In-
formatics, Sanya, Hainan, China. In press
10. Troyanskaya O, et al (2001) Missing value estimation methods for
DNA microarrays. Bioinformatics, 17(6): p. 520-525.
11. Wang X, et al (2006) Missing value estimation for DNA microarray
gene expression data by Support Vector Regression imputation and
orthogonal coding scheme. BMC Bioinformatics, 7(1): p. 32.
12. Junninen H, et al (2004) Methods for imputation of missing values in
air quality data sets. Atmospheric Environment, 38(18): p. 2895.
13. Bhattacharya B, Shresta DL, Solomatine DP (2003) Neural networks
in constructing missing wave data in sedimentation modelling.
XXXth IAHR Congress, Thessaloniki, Greece.
14. Siripitayananon P, Hui-Chuan C, Kang-Ren J (2002) Estimating
missing data of wind speeds using neural network. Proceedings IEEE
SoutheastCon, 2002.
15. Pawlak Z (1982) Rough Sets. International Journal of Computer and
Information Sciences, 11(5): p. 341-355.
16. Newman DJ, et al (1998) UCI Repository of machine learning data-
bases, University California Irvine, Department of Information and
Computer Science.
17. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogni-
tion Letters, 27(8): p. 861.

0.7
0.75
0.8
0.85
0.9
0.95
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rules
A
c
c
u
r
a
c
y
ANNRST
KNN
CMCF

Fig. 7 Rule filtering effect on the accuracy of classifier with three different
data sets.
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rules
C
o
v
e
r
a
g
e
ANNRST
KNN
CMCF

Fig. 8 Rule filtering effect on the coverage of classifier with three different
data sets.
0.76
0.78
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Rules
A
U
C
ANNRST
KNN
CMCF

Fig. 9 Rule filtering effect on the AUC-ROC of classifier with three differ-
ent data sets.

Anda mungkin juga menyukai