Anda di halaman 1dari 8

International Journal of Computer Application (2250-1797)

Volume 6 No.6, November December 2016

Prediction of Heart Disease using Hadoop Mapreduce


Sunita A Yadwad , Pediredla Praveen Kumar
Department of Computer science and Engineering
Pragati Engineering College,Kakinada Andhra Pradesh,India.
9059529510
Department of Computer science and Engineering
Anil Neerukonda Institute of
Technology and Sciences,Visakhapatnam,Andhra Pradesh,India.
9985063343
ABSTRACT
The Healthcare industry has seen tremendous advancement in the last century. From the ages
of Vedas medicine has been evolving and has been innovating and broadening its approach in
the areas of patient care and service. The paper proposes the use of data mining techniques on
big data to predict heart attacks. It discusses the variety of platforms and the data mining
algorithms for big data analytics.
In this paper, we propose a parallel scheme for prediction using classification methods like
SVM, Nave Bayes and K nearest neighbors. The parallel classification is based on the
MapReduce algorithm for heart disease prediction. The performance of the algorithm we
propose in this paper fares much better than Naive Bayes (NB) classifier and one-by-one
SVM classifier. We take the advantage of MapReduce using Hadoop's implementation
which is distributed computing framework .
Keywords: SVM, Nave Bayes, Prediction, Hadoop, MapReduce

I.
INTRODUCTION
The prediction of Heart disease by using data mining is one of the most interesting and also a
challenging task in the area of data analysis. The dearth of specialists and wrongly diagnosed
cases in abundance has necessarily led to the development of fast and efficient detection
systems. We can identify the key patterns or recognize the prevalent features of the
medically generated data using the classifier model and predict the heart disease. The most
relevant attributes of heart disease diagnosis can be observed. This can also help the medical
practitioners to understand the root cause of disease. Prediction of the heart disease is based
on risk factors such as age of the individual, his family history, his diabetes level, BP level,
exceeding levels of cholesterol, smoking habit, excessive alcohol intake and obesity.
World Health Organization every year reports one third of total global deaths are due to
Cardio Vascular Disease. This disease is expected to be the leading cause of deaths in the
developing countries the reason being the changes in lifestyle that people have adapted, the
work culture and the food habits they possess. Hence, more careful and efficient methods for
monitoring cardiac diseases periodically are gaining high importance.
The health care as an industry collect huge loads of data which due to limitations of
technology go unmined .The information embedded in the data and patterns portrayed are
untraced . This prohibits the effectiveness of decision making. Advanced data mining
1

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

techniques and the introduction of Big Data technology prevails that we use the data to
predict the occurrence of diseases better, let it be heart diseases or any other medical
ailments. The advent of big data has tremendous impact in the ecommerce, economy and
scientific researches. The amalgamation of the benefits of big data and the health care
industry could create miracles and help serve mankind better by providing advanced patient
care and making clinical decisions.
The heart being the most important organ in the human body the need for taking care of it
and detecting if there are any ailments has become more important. In developing countries
like South Africa cardiovascular diseases are leading to maximum death cases. There are a
number of attributes which show impact on heart disease.. Data mining is playing an
important role in all areas of todays world. These systems can answer complex queries not
supported by the age old decision support systems Data mining algorithms like SVM
(support vector machine), Nave Bayes, K-nearest neighbors are excessively used for
prediction as they show accurate result compared to the other algorithms. In this paper, we
have used these algorithms to analyze and visualize the data set of predicting the heart
disease. Healthcare organizations are facing the greatest challenge of providing the services
to their patients at reasonable prices. Providing the right Diagnoses and administering the
right treatment to patients is a humongous task. Most of the hospitals today do employ a kind
of hospital information systems to manage the patient data. These systems generate huge
amounts of data in the form of numerical figures or text .It could also be charts and images.
Unfortunately, these data are rarely used to support clinical decision making. There is a mine
of hidden information in these data which goes royally untapped due to its size and
unstructuredness. This data can be considered as Big Data
Big data is not a single technology but a combination of older and newer technologies that
helps companies to gain actionable insight. Therefore, big data has the capability to manage a
huge volume of variable data, at the right velocity, and within the right time frame to allow
real-time analysis and to react. Hadoop consists of HDFS (Hadoop Distributed File System),
Hbase, and Hadoop MapReduce .It is capable of analyzing big data [6-9]. It is an open
source framework.With the help of which we can write and implement an application
program for processing big data.
II.

DATAMINING TECHNIQUES

A. Data Mining has been existing from quite some time, but the potential that it has
been realized only lately. It combines all machine learning, data base technology
statistics to find the hidden patterns and relationships from huge loads of data. Every
mining technique has a different purpose .It can be classification or prediction.
Classification models predict data categorically either discrete or unordered while
prediction models predict for the continuous-valued functions. We have Decision
Trees and Neural Networks which are categorically classification algorithms and on
the other hand, we have Regression, Association Rules and Clustering .
B. Naive Bayes Classifier Nave Bayes algorithm outperforms most of the sophisticated
algorithm. It is a good tool in the medical diagnosis of heart disease and is followed
by Neural Network and Decision Tree. Nave Bayes performs better than decision
trees as it can identify the significant medical predictors. The probabilities being
applied in the Nave Bayes algorithm are calculated based on the Bayes rule .Here the
probability of hypothesis H can be calculated on the basis of the hypothesis H and
evidence about the hypothesis E according to the following formula:

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

P (H/E) =

P (E/H)*P (H)
P(E)

C. SVM(support vector machine) :SVMs are supervised learning models following the
associated learning algorithms which analyze data better .They can also recognize
patterns and can be used for classification and regression analysis.
D. K-nearest neighbors: K nearest neighbors is a simple algorithm which stores all the
available cases and classifies the new cases based on the similarity measure. KNN has
been used in statistical estimation and pattern recognition.
III. DATASET
There are 14 attributes in medical Data [16]. These fourteen attributes are listed in figure 1.
For simplicity, categorical attributes were used for all models. The reduced data set of 13
attributes is used in the experimentation and is fed to the three classification models.
Sno

Attribute Name

Description

1.

Age

Age in years

2.

Sex

Male=1,Female=0

3.

Cp

Chest pain type

4.

Rbp

Resting Blood pressure upon hospital admission

5.

Cholestrol

Serum Cholesterol in mg/dl

6.

Fasting Blood Sugar Fasting blood sugar>120 mg/dl true=1 and false=0

7.

Resting ECG

Resting electrocardiographic Results

8.

Thalach

Maximum Heart Rate

9.

Induced Angina

Does the patient experience angina as a result of exercise (value 1:


yes, value 0: no)

10. Old Peak

ST depression induced by exercise relative to rest

11. Slope

Slope of the peak exercise ST segment

12. Thal

Value 3:Normal ,value 6:fixed defect, value 7: reversible defect

13. CA

Number of major vessels colored by fluoroscopy(value 0-3)

14. Concept Class

Angiographic disease status

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

Fig 1. 14 attributes listed in Medical Data


IV. ABOUT HADOOP
Hadoop consists of HDFS (Hadoop Distributed File System), Hbase, and Hadoop
MapReduce .It is capable of analyzing big data [4-9]. It is an open source framework. With
the help of it, we can write and implement an application program for processing big data.
HDFS is made up of a Master Node and numerous Slave nodes in Figure 2. The Master Node
is the congregation of a Name Node that controls an access to the client file and the Job
Tracker which does the scheduling of the given job. The Master Node also manages the name
space of HDFS [10].The Slave Node consists of a Data Node whose job is managing the
storage on each node and a Task Tracker which completes the jobs assigned by a Job
Tracker.

Fig 2 . The Components of HDFS


The MapReduce [8, 11-15] is a Parallel processing model Distributed in nature .It is truely
based on a Key/Value pair. It provides a scalability in response to data growth caused by
Distributed and Parallel processing ,and minimizes the congestion caused by data movement
among nodes. The MapReduce in generates a intermediate result with the key/value based on
input data. The intermediate result grouped by key value is transferred to a Reduce Task. The
Reduce Task combines all the intermediate keys and transfers the final result to the Hbase.

Fig 3 . The Data Flow of the Mapreduce


V.

EXPERIMENTS AND RESULTS

ORANGE tool is used for experimentation of three data mining algorithms such as SVM,
Nave Bayes, and K-NN.

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

A . Attributes considered
For analyzing data, we need some attributes upon which we can ensure or conclude on how
we are predicting. In this we have used 13 attributes such as age, gender, chest pain, rest SBP,
cholesterol, fasting blood sugar, rest ECG, max HR, ST by exercise, induced angina, slope
peak by ST, thal, number of vessels colored.
B . Result Observed
We build a model with known classes by using the dataset and use another dataset to evaluate
the model. We compare the predicted classes against the actual classes .The confusion matrix
shows the number of correct and incorrect predictions made by the classification model
against the actual model .
Confusion matrix for SVM
Prediction
Cor
rect
clas
s

Confusion matrix for Nave Bayes:


Prediction
C
or
re
ct
cl
as
s

Confusion matrix for KNN:


Prediction

Cor
rect
clas
s

Calibration graph:

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

Fig 4. Calibration of all three methods


On comparing these three algorithms, SVM gives highest performance of correctness
followed by Nave Bayes and then K-NN.
Proposed Technique :
With the advent of Hadoop distributed computing platform and the MapReduce programming
model it has become easier for all the machine learning algorithms to process the data
parallel. This makes it easy to transform the machine learning algorithms to be transformed to
Map reduce paradigm to use Hadoop Distributed File System HDFS [14] .Naive Bayes and
SVM which were proved to be efficient algorithms could be programmed into the form of
Mapreduce and evaluate the prediction with higher accuracy. By increasing the number of
nodes in the cluster we can make sure there is a considerable amount of increase in the speed
of data processing. The Nave Bayes MapReduce model across multiple nodes could save a
good amount of time in comparison with the model running on a single node and without the
cost of compromising on accuracy.
A MapReduce version of the Nave Bayes classifier works out to be extremely efficient
while dealing with large loads of data. By distribution, processing and optimizing the subsets
of the training data over several nodes, the SVM can work in parallel with Mapreduce
algorithm one can reduce the training time. SVM is a powerful method for classification and
regression .The computing and storage requirements of SVM increase with the number of
training vectors which can be addressed .By distributing, processing and optimizing the
subsets of the training data and directing them across several participating nodes we can
achieve parallelization in SVM. The parallel SVM based on the MapReduce algorithm
reduces the training time significantly. Support Vector Machines (SVMs) are considered
powerful classification and regression tools. We can improve their computing and storing
requirements significantly by raising the number of training vectors. By using parallel
algorithms efficiently, we can meet the scalability and performance requirements for large
scale data mining .
VI. CONCLUSION
The main objective behind our work is to study of different data mining techniques that can
be employed in heart disease predictions. Several techniques of data mining classifiers have
been tried from time immortal. Few are defined in this work which has emerged in recent
years for efficient and effective heart disease diagnosis. By applying data mining techniques
in heart disease prediction we can identify the problem and suggest corrective treatment. The
paper is to evaluate the performance of the classification algorithms. ORANGE tool is used
6

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

for experimentation of three data mining algorithms such as SVM, Nave Bayes, K-NN.On
comparing these three algorithms, SVM gives highest performance of correctness followed
by Nave Bayes and then K-NN. Since most of the information generated by the health care
industry consists of data that is semi structured, unstructured and structured. This data meet
the 3V requirement of big data. Therefore the paper proposes carrying of the above
mentioned prediction methods in parallel using Mapreduce.
REFERENCES
[1] ByungKwan Lee1, EunHee Jeong2 A Design of a Patient-customized Healthcare
System based on the Hadoop with Text Mining (PHSHT) for an efficient Disease
Management and Prediction International Journal of Software Engineering and Its
Applications
Vol.8,
No.8
(2014),
pp.
131-150
http://dx.doi.org/10.14257/ijseia.2014.8.8,13
[2] S. Vijiyarani et. al., An Efficient Classification Tree Technique for Heart Disease
Prediction,International Conference on Research Trends in Computer
Technologies(ICRTCT - 2013) Proceedings published in International
Journal of Computer Applications (IJCA) (0975 8887), 2013.
[3] Miss. Chaitrali S. Dangare, Dr. Mrs. Sulabha S. Apte, A data mining approach for
prediction of heart disease using neural networks, international journal of computer
engineering and technology, 2012.
[4] Dr. V.V.R. Maheswara Rao , Dr. V. Valli Kumari , N. Silpa An Extensive Study On
Leading Research Paths On Big Data Techniques , International Journal of
Computer Engineering & Technology (IJCET) Volume 6, Issue 12, Dec 2015, pp. 2034, Article ID: IJCET_06_12_004.
[5] N. Aditya Sundar, P. Pushpa Latha, M. Rama Chandra, Performance analysis of
classification data mining techniques over heart diseases data base, international
journal of engineering science and advanced technology, 2012.
[6] Shadab Adam Pattekari and Asma Parveen, Prediction system for heart disease using
nave bayes, International Journal of Advanced Computer and Mathematical
Sciences, 2012.
[7] Hadoop : http://hadoop.apache.org
[8] T. White, Hadoop : The Definitive Guide, O'LEILLY, (2009).
[9] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters,
In Proceeding of the 6th Symposium on Operating Systems Design and
Implementation, San Francisco CA, (2004) Dec.
[10] Dhruba Borthakur, The Hadoop Distributed File System: Architecture and Design,
http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf.
[11] Sayantan Sur, Hao Wang, Jian Huang, Xiangyong Ouyang and Dhabaleswar K.
Panda, Can High-Performance Interconnects Benefit Hadoop Distributed File
System?, http://doczine.com/bigdata/2/1371884970_6c99485db9/sur-masvdc10.pdf.
[12] Mapreduce,https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[13] I. Hwang, K. Jung, K. Im, and J. Lee, Improving the Map/Reduce Model through
Data Distribution and Task Progress Scheduling, Journal of the Korea Contents
Association, vol. 10, no. 10, (2010), pp. 78-85.
[14] P K Pedireddla, SA Yadwad An Effective and Efficient Clustering Based on KMeans Using MapReduce and TLBO.- Proceedings of the Second International
Conference on Computer and Communication Technologies,2016,Pages 619-628
Publisher Springer India

International Journal of Computer Application (2250-1797)


Volume 6 No.6, November December 2016

[15] H. C. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker, Map-Reduce-Merge:


Simplified Relational Data Processing on Large Clusters, In Proceedinging of the
ACM SIGMOD International Conference on Management of Data, (2007) Jun 11,
pp.1029-1040.
[16] S. Ghemawat, H. Gobioff, and S. Leung. ,The Google File System, In
Proceeeding of ACM Symposium on Operating Systems Principles, (2003) October
19, pp.29-43.
[17] H. Zhao, S. Yang, Z. Chen, S. Jin, H. Yin, and L. Li, MapReduce Model-Based
Optimization of Range Queries , In Proceeeding of the International Conference on
Fuzzy Systems and Knowledge Discovery(FSKD '12), Sichuan, China, (2012) May
29-31, pp.2487- 2492.
[18] UCI
Machine
learning
Repository
from
http://archive.ics.uci.edu/ml/datasets/Heart+Disease.

Anda mungkin juga menyukai