N.A. Abu Osman, F. Ibrahim, W.A.B. Wan Abas, H.S. Abd Rahman, H.N. Ting (Eds.
): Biomed 2008, Proceedings 21, pp. 266269, 2008
www.springerlink.com Springer-Verlag Berlin Heidelberg 2008
A Comparative Study of Imputation Methods to Predict Missing Attribute Values in Coronary Heart Disease Data Set N.A. Setiawan, P.A. Venkatachalam and A.F.M. Hani
Department of Electrical and Electronic Engineering, Universiti Teknologi PETRONAS, Bandar Seri Iskandar 31750 Tronoh, Perak, Malaysia Abstract The objective of this research is to investigate the effects of missing attribute value imputation methods on the quality of extracted rules when rule filtering is applied. Three imputation methods: Artificial Neural Network with Rough Set Theory (ANNRST), k-Nearest Neighbor (k-NN) and Concept Most Common Attribute Value Filling (CMCF) are applied to University California Irvine (UCI) coronary heart disease data sets. Rough Set Theory (RST) method is used to generate the rules from the three imputed data sets. Support filtering is used to select the rules. Accuracy, coverage, sensi- tivity, specificity and Area Under Curve (AUC) of Receiver Operating Characteristics (ROC) analysis are used to evaluate the performance of the rules when they are applied to classify the complete testing data set. Evaluation results show that ANNRST is considered as the best method among k-NN and CMCF. Keywords Missing attribute value, imputation, rough set theory. I. INTRODUCTION The problem of finding appropriate models that describe or classify the data is encountered in many areas. Mathe- matical or empirical methods are used to find the best model. Empirical methods are used when there are difficul- ties to find appropriate mathematical model that describes or classifies the data. One of widely used methods is knowl- edge discovery from data (KDD). The knowledge discov- ered is usually in form of rules. KDD is usually used in biomedicine area to build decision support systems. The KDD process consists of several steps. One of the important steps is data preprocessing. The quality of ex- tracted knowledge depends on the quality of data. The pre- processing step consists of handling missing attribute value. Obtaining large amount of high quality data may not always be an easy task or even possible. Medical data acquisition in practice may be non-automated task. Many of medical data involve laboratory work. Most hospitals transcribe the dic- tated notes from physicians manually [1]. Not all the labora- tory tests are available in most hospitals. Incompleteness of data sets then is the common issue in medical data set col- lection. Many researches in missing attribute value prediction have been conducted in general data set and specific area data set. Grzysmala-Busse and Hu [2] compare and present nine different approaches to missing attribute values. They do not have enough evidence to support the claim which approach is considered superior. Al Shalabi, Najjar and Al Kayed [3] present a framework to deal with missing data. They compare four algorithms to impute the missing data. Evaluation of generated rules is conducted. Li and Cercone [4] proposed RSFit, a combination of rough set and distance method, to assign the missing data. Wasito and Mirkin [5, 6] present nearest neighbor approach in the least square data imputation algorithms. Ragel and Cremilleux [7] built miss- ing values completion based on association rules. In the specific area, ANNRST is presented by Setiawan, Venkatachalam and Hani [8, 9] to predict the missing at- tribute value of UCI coronary heart disease data set. Troy- anskaya, et. al. [10] compared missing data imputations methods for DNA microarrays. Missing value estimation in DNA microarrays is also found in [11]. Missing value esti- mation methods in other specific area are also discussed in [12-14]. Most of the researchers only consider the accuracy or error of imputations. Only Al Shalabi considered the evaluation of generated knowledge to select the appropriate methods, although he concluded that the best model of im- putation is task-dependent. Rough set theory (RST) pro- posed by Pawlak [15] is considered as new theory that can be applied in KDD. RST usually generates a large number of decision rules. Rule filtering must be used to select the most important and high quality rules. Quality of the classi- fier will be reduced during rule filtering process. How the imputation methods affect the quality of generated rules during filtering need to be answered In this research three imputation methods, namely ANNRST, k-NN and CMCF are compared in UCI coronary heart disease data sets [16]. The rules are generated from three imputed data sets using RST. The most complete data set of UCI coronary heart disease data will be used as test- ing data while filtering the generated rules. Support filtering is applied on three sets of generated rules. Accuracy, cover- age, sensitivity, specificity and AUC of ROC analysis of the classifiers on testing data are evaluated. A Comparative Study of Imputation Methods to Predict Missing Attribute Values in Coronary Heart Disease Data Set 267 _________________________________________ IFMBE Proceedings Vol. 21 ___________________________________________
II. DESCRIPTION OF INVESTIGATED IMPUTATION METHODS The following three imputations of missing attribute val- ues are used in this experiment: A. Concept most common attribute value filling (CMCF) Most common attribute value filling is the simplest methods to deal with missing attribute values. The value that occurs most frequent is selected to be the value of the unknown missing attribute value. CMCF is a restriction of most common attribute value filling by its concept or deci- sion. In CMCF, the value that occurs most frequent within concept is selected to be the imputed value. CMCF is also called maximum relative frequency method or maximum conditional probability method. B. k-Nearest neighbor method (k-NN) The algorithm of k-NN can be described as follows: take the row that contains missing value as the target, determine its k-nearest neighbors by computing the Euclidean distance between incomplete row and complete rows, identify the k closest and impute the missing value in row that contains missing value by averaging the corresponding complete rows of the k closest. C. Artificial neural network with rough set theory method The algorithm of ANNRST contains two parts: RST at- tribute reduction and constructing ANN to predict the miss- ing attribute values. A reduct is combination of attributes that can discern between objects as well as all attributes in the decision system: ( ) d A U S , , = (1) Where U is the set of objects, A is the set of conditional attributes and d represents the decision attributes. Reduct can be found as the set of prime implicant of discernibility function for attribute A: } 1 ) { ( 1 = ij ij m c n i j c a a A f , ,..., * * * (2) Where Boolean variables } { ij ij c a a c = | * * (correspond- ing to the attributes: m a a ,..., 1 ) and ) ( ) ( and ) ( ) ( | j i j i ij x d x d x a x a A a c = (3) For i, j = 1 n where n the number of objects in deci- sion table. After minimal reduct is computed, ANN is constructed with its input are the corresponding reduct condition attributes and decision attribute. To impute miss- ing values of m attributes, m ANN topologies must be constructed. If there are p condition attributes and q decision attribute, then ANN has p-1+q input attributes including decision attribute with one condition attribute that contains missing value as output. Full discernibility reduct computation is used to reduce the attributes in attribute reduction and feature selection problem. Further detail on ANNRST can be found in [8, 9]. III. ROUGH SET THEORY CLASSIFIER RST classifier is based on rules. The rules that are ex- tracted are generated from relative reducts in form of IF condition(s) THEN decision(s). The rules are generated by combination of reducts and the value of attributes from data sets. The rule is called deterministic rule if the combination of attributes in the antecedent implies only single decision attribute in the consequent, otherwise it is called probabilis- tic rule. The terminology of support is defined as the number of objects or instances in the training data that match the rule. Support is a basic filtering criterion when RST generates a large number of rules. Evaluation of the classifier is con- ducted by applying the generated rules to the testing data and computing the accuracy, coverage, sensitivity and specificity. Receiver ROC analysis is introduced as a combination of sensitivity and specificity of the classifier. Area under curve of ROC is also used to evaluate the classi- fier [17]. IV. EXPERIMENT AND RESULTS UCI coronary artery disease is used as data set. Origi- nally, the amount of data is 920 objects. They are collected from four different sources: Cleveland Clinic Foundation, U.S.; Hungarian Institute of Cardiology, Budapest, Hun- gary; V.A. Medical Center, Long beach, U.S. and Univer- sity Hospital, Zurich, Switzerland. After selection, 661 objects are used which 351 objects consist of missing val- ues. The data has thirteen conditional attributes and one decision attribute which is the presence of coronary artery disease. All the numerical values are discretized. The data is missing in three row conditional attributes: the slope of the peak exercise ST segment of EKG, number of vessels col- ored by fluoroscopy and exercise thallium scintigraphic defects. ANNRST, k-NN and CMCF are used to impute the missing value in the three row attributes. The complete data set is used as training data of RST classifier and the imputed data set us used as testing data. The generated rules then are filtered based on their support. There are three sets of im- puted data: ANNRST, k-NN and CMCF imputed data sets. 268 N.A. Setiawan, P.A. Venkatachalam and A.F.M. Hani _________________________________________ IFMBE Proceedings Vol. 21 ___________________________________________
RST classifier generates 4095, 5566 and 1537 rules for ANNRST, k-NN and CMCF data sets respectively. Support filtering is applied by removing rule that has support below one to twenty five. The maximum number of support of twenty-five is chosen because the accuracy of the classifier is still good (0.823 for ANNRST data set) at that number and there is no big change to the number of rules that are pruned after twenty-five. The effect of rule support filtering on the number of rules can be seen in Fig. 1. The number of rules is exponentially reduced for all three data sets. The effects of rule support filtering on the accuracy, coverage, sensitivity and specificity are shown in Fig. 2, Fig. 3, Fig. 4 and Fig. 5 respectively. 0 1000 2000 3000 4000 5000 6000 0 5 10 15 20 25 Support R u l e s ANNRST KNN CMCF
Fig. 1 Support filtering effect on the number of rules with three different data sets. 0.7 0.75 0.8 0.85 0.9 0.95 0 5 10 15 20 25 Support A c c u r a c y ANNRST KNN CMCF
Fig. 2 Support filtering effect on the accuracy of classifier with three different data sets. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 0 5 10 15 20 25 Support C o v e r a g e ANNRST KNN CMCF
Fig. 3 Support filtering effect on the coverage of classifier with three different data sets. 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0 5 10 15 20 25 Support S e n s i t i v i t y ANNRST KNN CMCF
Fig. 4 Support filtering effect on the sensitivity of classifier with three different data sets. 0.7 0.75 0.8 0.85 0.9 0.95 1 0 5 10 15 20 25 Support S p e c i f i c i t y ANNRST KNN CMCF
Fig. 5 Support filtering effect on the specificity of classifier with three different data sets. ANNRST data set performs better on accuracy and sensi- tivity than k-NN and CMCF except for support 17 while k-NN performs better but with decreased coverage. The coverage of classifier on k-NN data set drops significantly after supports filtering is about ten. CMCF data set has good result on specificity and outperforms ANNRST. ROC analysis is used to determine the classifier perform- ance on sensitivity and specificity by calculating the area under ROC curve (AUC). ROC represents the performance of classifier based on sensitivity and specificity analysis. Neither sensitivity nor specificity can represent the classi- fier performance. Both must be considered. Fig. 6 shows the effect of rule support filtering on AUC of ROC analysis. It can be seen that classifier with ANNRST data sets has the better AUC than k-NN and CMCF classifiers. 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0 5 10 15 20 25 Support A U C ANNRST KNN CMCF
Fig. 6 Support filtering effect on the AUC-ROC of classifier with three different data sets. A Comparative Study of Imputation Methods to Predict Missing Attribute Values in Coronary Heart Disease Data Set 269 _________________________________________ IFMBE Proceedings Vol. 21 ___________________________________________
The AUC in Fig. 6 is based on the number of support. The number of rules must be considered because the same number of support gives different number rules as shown in Fig. 1. The accuracy, coverage and AUC-ROC evaluation based on the same number of rules among three different data sets can be seen in Fig. 7, Fig. 8 and Fig. 9 respec- tively. V. CONCLUSIONS The main objective of this research is to investigate the effects of missing attribute value imputation methods on the quality of extracted rules when rule filtering is applied. It can be considered that determining the best method for imputation is difficult task. Many aspects of the impact of imputation must be considered. Selecting the best imputa- tion method depends on the task and the type of data. This paper considered the quality classifier as the method for evaluation of imputation methods of missing data in coro- nary heart disease. ANNRST imputation can be considered as the best method in the case of UCI coronary heart disease data sets. It can be considered as the imputation method on the case of data that has the same properties with UCI coro- nary heart disease. VI. REFERENCES 1. Ohrn A (1999) Discernibility and rough sets in medicine: tools and applications, PhD thesis, Department of computer and information science, Norwegian University of Science and Technology: Trond- heim. 2. Grzymala-Busse J, Hu M (2001) A Comparison of several approaches to missing attribute values in data mining, Rough Sets and Current Trends in Computing, p. 378. 3. Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with missing data in data sets. Journal of Computer Science, 2(9): p. 740- 745. 4. Li J, Cercone N (2006) Assigning missing attribute values based on rough sets theory, IEEE International Conference on Granular Com- puting. 2006 5. Wasito I, Mirkin B (2005) Nearest neighbour approach in the least- squares data imputation algorithms. Information Sciences, 169(1-2): p. 1. 6. Wasito I, Mirkin B (2006) Nearest neighbours in least-squares data imputation algorithms with different missing patterns. Computational Statistics & Data Analysis. 50(4): p. 926. 7. Ragel A, Cremilleux B (1999) MVC--a preprocessing method to deal with missing values. Knowledge-Based Systems, 12(5-6): p. 285. 8. Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data estimation on heart disease using artificial neural network and rough set theory, International Conference on Intelligent and Advanced Sys- tems, 2007, Kuala Lumpur, Malaysia. 9. Setiawan NA, Venkatachalam PA, Hani AFM (2008) Missing attrib- ute value prediction based on artificial neural network and rough set theory, International Conference on Biomedical Engineering and In- formatics, Sanya, Hainan, China. In press 10. Troyanskaya O, et al (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6): p. 520-525. 11. Wang X, et al (2006) Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinformatics, 7(1): p. 32. 12. Junninen H, et al (2004) Methods for imputation of missing values in air quality data sets. Atmospheric Environment, 38(18): p. 2895. 13. Bhattacharya B, Shresta DL, Solomatine DP (2003) Neural networks in constructing missing wave data in sedimentation modelling. XXXth IAHR Congress, Thessaloniki, Greece. 14. Siripitayananon P, Hui-Chuan C, Kang-Ren J (2002) Estimating missing data of wind speeds using neural network. Proceedings IEEE SoutheastCon, 2002. 15. Pawlak Z (1982) Rough Sets. International Journal of Computer and Information Sciences, 11(5): p. 341-355. 16. Newman DJ, et al (1998) UCI Repository of machine learning data- bases, University California Irvine, Department of Information and Computer Science. 17. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogni- tion Letters, 27(8): p. 861.
0.7 0.75 0.8 0.85 0.9 0.95 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Rules A c c u r a c y ANNRST KNN CMCF
Fig. 7 Rule filtering effect on the accuracy of classifier with three different data sets. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Rules C o v e r a g e ANNRST KNN CMCF
Fig. 8 Rule filtering effect on the coverage of classifier with three different data sets. 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Rules A U C ANNRST KNN CMCF
Fig. 9 Rule filtering effect on the AUC-ROC of classifier with three differ- ent data sets.
A Robust Missing Value Imputation Method Mifoimpute For Incomplete Molecular Descriptor Data and Comparative Analysis With Other Missing Value Imputation Methods