I.
INTRODUCTION
The prediction of Heart disease by using data mining is one of the most interesting and also a
challenging task in the area of data analysis. The dearth of specialists and wrongly diagnosed
cases in abundance has necessarily led to the development of fast and efficient detection
systems. We can identify the key patterns or recognize the prevalent features of the
medically generated data using the classifier model and predict the heart disease. The most
relevant attributes of heart disease diagnosis can be observed. This can also help the medical
practitioners to understand the root cause of disease. Prediction of the heart disease is based
on risk factors such as age of the individual, his family history, his diabetes level, BP level,
exceeding levels of cholesterol, smoking habit, excessive alcohol intake and obesity.
World Health Organization every year reports one third of total global deaths are due to
Cardio Vascular Disease. This disease is expected to be the leading cause of deaths in the
developing countries the reason being the changes in lifestyle that people have adapted, the
work culture and the food habits they possess. Hence, more careful and efficient methods for
monitoring cardiac diseases periodically are gaining high importance.
The health care as an industry collect huge loads of data which due to limitations of
technology go unmined .The information embedded in the data and patterns portrayed are
untraced . This prohibits the effectiveness of decision making. Advanced data mining
1
techniques and the introduction of Big Data technology prevails that we use the data to
predict the occurrence of diseases better, let it be heart diseases or any other medical
ailments. The advent of big data has tremendous impact in the ecommerce, economy and
scientific researches. The amalgamation of the benefits of big data and the health care
industry could create miracles and help serve mankind better by providing advanced patient
care and making clinical decisions.
The heart being the most important organ in the human body the need for taking care of it
and detecting if there are any ailments has become more important. In developing countries
like South Africa cardiovascular diseases are leading to maximum death cases. There are a
number of attributes which show impact on heart disease.. Data mining is playing an
important role in all areas of todays world. These systems can answer complex queries not
supported by the age old decision support systems Data mining algorithms like SVM
(support vector machine), Nave Bayes, K-nearest neighbors are excessively used for
prediction as they show accurate result compared to the other algorithms. In this paper, we
have used these algorithms to analyze and visualize the data set of predicting the heart
disease. Healthcare organizations are facing the greatest challenge of providing the services
to their patients at reasonable prices. Providing the right Diagnoses and administering the
right treatment to patients is a humongous task. Most of the hospitals today do employ a kind
of hospital information systems to manage the patient data. These systems generate huge
amounts of data in the form of numerical figures or text .It could also be charts and images.
Unfortunately, these data are rarely used to support clinical decision making. There is a mine
of hidden information in these data which goes royally untapped due to its size and
unstructuredness. This data can be considered as Big Data
Big data is not a single technology but a combination of older and newer technologies that
helps companies to gain actionable insight. Therefore, big data has the capability to manage a
huge volume of variable data, at the right velocity, and within the right time frame to allow
real-time analysis and to react. Hadoop consists of HDFS (Hadoop Distributed File System),
Hbase, and Hadoop MapReduce .It is capable of analyzing big data [6-9]. It is an open
source framework.With the help of which we can write and implement an application
program for processing big data.
II.
DATAMINING TECHNIQUES
A. Data Mining has been existing from quite some time, but the potential that it has
been realized only lately. It combines all machine learning, data base technology
statistics to find the hidden patterns and relationships from huge loads of data. Every
mining technique has a different purpose .It can be classification or prediction.
Classification models predict data categorically either discrete or unordered while
prediction models predict for the continuous-valued functions. We have Decision
Trees and Neural Networks which are categorically classification algorithms and on
the other hand, we have Regression, Association Rules and Clustering .
B. Naive Bayes Classifier Nave Bayes algorithm outperforms most of the sophisticated
algorithm. It is a good tool in the medical diagnosis of heart disease and is followed
by Neural Network and Decision Tree. Nave Bayes performs better than decision
trees as it can identify the significant medical predictors. The probabilities being
applied in the Nave Bayes algorithm are calculated based on the Bayes rule .Here the
probability of hypothesis H can be calculated on the basis of the hypothesis H and
evidence about the hypothesis E according to the following formula:
P (H/E) =
P (E/H)*P (H)
P(E)
C. SVM(support vector machine) :SVMs are supervised learning models following the
associated learning algorithms which analyze data better .They can also recognize
patterns and can be used for classification and regression analysis.
D. K-nearest neighbors: K nearest neighbors is a simple algorithm which stores all the
available cases and classifies the new cases based on the similarity measure. KNN has
been used in statistical estimation and pattern recognition.
III. DATASET
There are 14 attributes in medical Data [16]. These fourteen attributes are listed in figure 1.
For simplicity, categorical attributes were used for all models. The reduced data set of 13
attributes is used in the experimentation and is fed to the three classification models.
Sno
Attribute Name
Description
1.
Age
Age in years
2.
Sex
Male=1,Female=0
3.
Cp
4.
Rbp
5.
Cholestrol
6.
Fasting Blood Sugar Fasting blood sugar>120 mg/dl true=1 and false=0
7.
Resting ECG
8.
Thalach
9.
Induced Angina
11. Slope
12. Thal
13. CA
ORANGE tool is used for experimentation of three data mining algorithms such as SVM,
Nave Bayes, and K-NN.
A . Attributes considered
For analyzing data, we need some attributes upon which we can ensure or conclude on how
we are predicting. In this we have used 13 attributes such as age, gender, chest pain, rest SBP,
cholesterol, fasting blood sugar, rest ECG, max HR, ST by exercise, induced angina, slope
peak by ST, thal, number of vessels colored.
B . Result Observed
We build a model with known classes by using the dataset and use another dataset to evaluate
the model. We compare the predicted classes against the actual classes .The confusion matrix
shows the number of correct and incorrect predictions made by the classification model
against the actual model .
Confusion matrix for SVM
Prediction
Cor
rect
clas
s
Cor
rect
clas
s
Calibration graph:
for experimentation of three data mining algorithms such as SVM, Nave Bayes, K-NN.On
comparing these three algorithms, SVM gives highest performance of correctness followed
by Nave Bayes and then K-NN. Since most of the information generated by the health care
industry consists of data that is semi structured, unstructured and structured. This data meet
the 3V requirement of big data. Therefore the paper proposes carrying of the above
mentioned prediction methods in parallel using Mapreduce.
REFERENCES
[1] ByungKwan Lee1, EunHee Jeong2 A Design of a Patient-customized Healthcare
System based on the Hadoop with Text Mining (PHSHT) for an efficient Disease
Management and Prediction International Journal of Software Engineering and Its
Applications
Vol.8,
No.8
(2014),
pp.
131-150
http://dx.doi.org/10.14257/ijseia.2014.8.8,13
[2] S. Vijiyarani et. al., An Efficient Classification Tree Technique for Heart Disease
Prediction,International Conference on Research Trends in Computer
Technologies(ICRTCT - 2013) Proceedings published in International
Journal of Computer Applications (IJCA) (0975 8887), 2013.
[3] Miss. Chaitrali S. Dangare, Dr. Mrs. Sulabha S. Apte, A data mining approach for
prediction of heart disease using neural networks, international journal of computer
engineering and technology, 2012.
[4] Dr. V.V.R. Maheswara Rao , Dr. V. Valli Kumari , N. Silpa An Extensive Study On
Leading Research Paths On Big Data Techniques , International Journal of
Computer Engineering & Technology (IJCET) Volume 6, Issue 12, Dec 2015, pp. 2034, Article ID: IJCET_06_12_004.
[5] N. Aditya Sundar, P. Pushpa Latha, M. Rama Chandra, Performance analysis of
classification data mining techniques over heart diseases data base, international
journal of engineering science and advanced technology, 2012.
[6] Shadab Adam Pattekari and Asma Parveen, Prediction system for heart disease using
nave bayes, International Journal of Advanced Computer and Mathematical
Sciences, 2012.
[7] Hadoop : http://hadoop.apache.org
[8] T. White, Hadoop : The Definitive Guide, O'LEILLY, (2009).
[9] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters,
In Proceeding of the 6th Symposium on Operating Systems Design and
Implementation, San Francisco CA, (2004) Dec.
[10] Dhruba Borthakur, The Hadoop Distributed File System: Architecture and Design,
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf.
[11] Sayantan Sur, Hao Wang, Jian Huang, Xiangyong Ouyang and Dhabaleswar K.
Panda, Can High-Performance Interconnects Benefit Hadoop Distributed File
System?, http://doczine.com/bigdata/2/1371884970_6c99485db9/sur-masvdc10.pdf.
[12] Mapreduce,https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[13] I. Hwang, K. Jung, K. Im, and J. Lee, Improving the Map/Reduce Model through
Data Distribution and Task Progress Scheduling, Journal of the Korea Contents
Association, vol. 10, no. 10, (2010), pp. 78-85.
[14] P K Pedireddla, SA Yadwad An Effective and Efficient Clustering Based on KMeans Using MapReduce and TLBO.- Proceedings of the Second International
Conference on Computer and Communication Technologies,2016,Pages 619-628
Publisher Springer India