Anda di halaman 1dari 65

INTRODUCTION

Medical Informatics is the collaboration of information sciences and health care indus-
try, it basically deals with managing and using the patients health care information and
data by making use of disciplines of the several fields like computer science, informa-
tion science and primarily health care industry. It works with different resources, strate-
gies, and methods required to optimise the attainment, storage, retrieval and efficient use
of patients data available.

Cardiovascular diseases are rising at an unimaginable rate in India as well as the other
countries, heart attack is one of the major killers in the world for both men and women
about 25% of deaths are caused by heart attacks that makes it about one in four deaths,
and because of the sedentary lifestyle pursued by younger generations these days, heart
diseases have escalated even more, all this makes it extremely important to predict car-
diovascular diseases much more efficiently and correctly and that too well in time so that
the doctor is able to treat that condition in time.

Medical science is one of the fields where humongous amount o data is produced by var-
ious ways and a lot of that data remains unused. Usually this data in hospitals is widely
disseminated and unorganised. But by using some kind of hospital management system,
we can get a dataset which can be used in ways which are not yet thought of. This data if
‘mined’ effectively can provide great help in better and accurate diagnosis of patients
which can save a lot of money. Data mining is a computational process that makes use of
statistical techniques to find ‘patterns’ in huge data sets and it also helps in establishing
relationships by analysing chunks of data. The main objective here is to make use of data
mining for evaluating the past to help in prediction of future.

1
Data Mining is the way toward extracting interesting patterns and knowledge from huge
amount of information. The Data Mining process is a combination of choosing,
analysing, planning, interpreting and evaluating the outcomes . Numerous clinical find-
ing achievement in the data mining techniques for prediction and clustering. Data min-
ing comprises of the different specialised approaches including machine learning, data-
base system and statistic .

The healthcare industry assembles immense measure of healthcare data which are not
abundant to discover hidden information for effective decision making. Utilising distinc-
tive medical profiles such as sex, blood pressure, age, hypertension, lack of physical ac-
tivity, blood sugar it can find the probability of patients getting a coronary illness .

Diagnosing machines or frameworks are quiet useful in this procedure because not every
doctor must have the learning of each and every kind of problem of disease. In this man-
ner automated diagnosing machine is used by them to diagnose the problem accurately.
The WHO consortium has shared this information that ten a great many passing happen
in this world is a consequence of coronary illness. so it was an extremely dangers prob-
lem in world. These systems typically create huge amounts of information which appear
as numbers, charts and images. There are numerous sort of heart disease such as coro-
nary heart disease, cardiomyopathy disease and cardiovascular sickness.

Cardiovascular disease is an illness which specifically impacts the blood circulation in


the body and blood vessels which are associated to the heart. Increment in heart disease
is because of many facts like great BP, smoking, Family history etc. some unique factors
that likewise causes heart illnesses are elevated cholesterol level, hyper solidness, im-
proper diet etc.

2
1.1 OBJECTIVE

The main aim of this project is to design a model or a prototype which will help predict
heart disease for the given patient. This heart disease prediction system will make use of
‘Naive Bayes Theorem’ ,’Decision trees’ and other data mining techniques. This applica-
tion will examine and extract hidden information about the candidates health. It will help
in diagnosing the disease and it will aid the doctors in making intelligent choices. The
result will be displayed in a tabular form which will make it easier to interpret.

1.2 SCOPE

Scope of this project here is to assimilate medical decision making support system with
computer based patient’s records which would decrease errors while diagnosis and sup-
port effective decisions and outcome. This will be possible with the use of different data
mining techniques as they have potential to create an environment full of knowledge and
hidden information that may help perk up the quality of the diagnosis as well as the de-
cisions of the doctor has to make, fundamental purpose of this study is building a Heart
Disease Prediction System (HDPS) by incorporating certain data minding technology
and techniques which include ‘Decision Trees’ and ‘Naive Bayes’ and ‘Neural Network’.
So that it can provide successful treatments, and also help to substantially bring down
treatment prices and also it will enhance the visualisation and ease of interpretation .

2
REQUIREMENT ANALYSIS

2.1 JAVA
Java is a general-purpose computer-programming language that is concurrent, class-
based, object-oriented, and specifically designed to have as few implementation depen-
dencies as possible. It is intended to let application developers "write once, run any-
where" (WORA) meaning that compiled Java code can run on all platforms that support
Java without the need for recompilation. Java applications are typically compiled to
bytecode that can run on any Java virtual machine (JVM) regardless of computer archi-
tecture. As of 2016, Java is one of the most popular programming languages in use par-
ticularly for client-server web applications, with a reported 9 million developers. Java
was originally developed by James Gosling at Sun Microsystems (which has since been
acquired by Oracle Corporation) and released in 1995 as a core component of Sun Mi-
crosystems' Java platform. The language derives much of its syntax from C and C++, but
it has fewer low-level facilities than either of them.

!
1.1 Java

3
Principles

There were five primary goals in the creation of the Java language:
• It must be "simple, object-oriented, and familiar"

• It must be "robust and secure"

• It must be "architecture-neutral and portable"

• It must execute with "high performance"

• It must be "interpreted, threaded, and dynamic"

1.2 Java SE Lifecycle

Java JVM, Swings and Bytecode

4
One design goal of Java is portability, which means that programs written for the Java
platform must run similarly on any combination of hardware and operating system with
adequate runtime support. This is achieved by compiling the Java language code to an
intermediate representation called Java bytecode, instead of directly to architecture-spe-
cific machine code. Java bytecode instructions are analogous to machine code, but they
are intended to be executed by a virtual machine (VM) written specifically for the host
hardware. End users commonly use a Java Runtime Environment (JRE) installed on
their own machine for standalone Java applications, or in a web browser for Java ap-
plets. Standard libraries provide a generic way to access host-specific features such as
graphics, threading, and networking.
The use of universal bytecode makes porting simple. However, the overhead of inter-
preting bytecode into machine instructions made interpreted programs almost always run
more slowly than native executables. Just-in-time (JIT) compilers that compile byte-
codes to machine code during runtime were introduced from an early stage. Java itself is
platform-independent and is adapted to the particular platform it is to run on by a Java
virtual machine for it, which translates the Java bytecode into the platform's machine
language.

5
!
1.3 Java Swings

!
1.4 Java Swings

6
Performance
Programs written in Java have a reputation for being slower and requiring more memory
than those written in C++. However, Java programs' execution speed improved signifi-
cantly with the introduction of just-in-time compilation in 1997/1998 for Java 1.1, the
addition of language features supporting better code analysis (such as inner classes, the
StringBuilder class, optional assertions, etc.), and optimisations in the Java virtual ma-
chine, such as HotSpot becoming the default for Sun's JVM in 2000. With Java 1.5, the
performance was improved with the addition of the java.util.concurrent package, includ-
ing lock free implementations of the ConcurrentMaps and other multi-core collections,
and it was improved further with Java 1.6.

Non-JVM
Some platforms offer direct hardware support for Java; there are micro-controllers that
can run Java bytecode in hardware instead of a software Java virtual machine, and some
ARM based processors could have hardware support for executing Java bytecode
through their Jazelle option, though support has mostly been dropped in current imple-
mentations of ARM.
Automatic memory management
Java uses an automatic garbage collector to manage memory in the object lifecycle. The
programmer determines when objects are created, and the Java runtime is responsible for
recovering the memory once objects are no longer in use. Once no references to an ob-
ject remain, the unreachable memory becomes eligible to be freed automatically by the
garbage collector. Something similar to a memory leak may still occur if a programmer's
code holds a reference to an object that is no longer needed, typically when objects that
are no longer needed are stored in containers that are still in use. If methods for a nonex-
istent object are called, a "null pointer exception" is thrown.

One of the ideas behind Java's automatic memory management model is that program-
mers can be spared the burden of having to perform manual memory management. In

7
some languages, memory for the creation of objects is implicitly allocated on the stack
or explicitly allocated and deallocated from the heap. In the latter case, the responsibility
of managing memory resides with the programmer. If the program does not deallocate
an object, a memory leak occurs. If the program attempts to access or deallocate memo-
ry that has already been deallocated, the result is undefined and difficult to predict, and
the program is likely to become unstable or crash. This can be partially remedied by the
use of smart pointers, but these add overhead and complexity. Note that garbage collec-
tion does not prevent "logical" memory leaks, i.e., those where the memory is still refer-
enced but never used.

Garbage collection may happen at any time. Ideally, it will occur when a program is idle.
It is guaranteed to be triggered if there is insufficient free memory on the heap to allo-
cate a new object; this can cause a program to stall momentarily. Explicit memory man-
agement is not possible in Java.

8
2.2 DATA MINING

2.1 Processes in Data Mining

Data mining is the process of discovering patterns in large data sets involving methods at
the intersection of machine learning, statistics, and database systems. Data mining is an
interdisciplinary subfield of computer science with an overall goal to extract information
(with intelligent methods) from a data set and transform the information into a compre-
hensible structure for further use. Data mining is the analysis step of the "knowledge
discovery in databases" process, or KDD. Aside from the raw analysis step, it also in-
volves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of
discovered structures, visualization, and online updating.

The term "data mining" is in fact a misnomer, because the goal is the extraction of pat-
terns and knowledge from large amounts of data, not the extraction (mining) of data it-
self. It also is a buzzword and is frequently applied to any form of large-scale data or
information processing (collection, extraction, warehousing, analysis, and statistics) as

9
well as any application of computer decision support system, including artificial intelli-
gence (e.g., machine learning) and business intelligence. The book Data mining: Practi-
cal machine learning tools and techniques with Java (which covers mostly machine
learning material) was originally to be named just Practical machine learning, and the
term data mining was only added for marketing reasons. Often the more general terms
(large scale) data analysis and analytics – or, when referring to actual methods, artificial
intelligence and machine learning – are more appropriate.

The related terms data dredging, data fishing, and data snooping refer to the use of data
mining methods to sample parts of a larger population data set that are (or may be) too
small for reliable statistical inferences to be made about the validity of any patterns dis-
covered. These methods can, however, be used in creating new hypotheses to test against
the larger data populations.

In data mining, association rules are created by analyzing data for frequent if/then pat-
terns, then using the support and confidence criteria to locate the most important rela-
tionships within the data. Support is how frequently the items appear in the database,
while confidence is the number of times if/then statements are accurate.

Other data mining parameters include Sequence or Path Analysis, Classification, Clus-
tering and Forecasting. Sequence or Path Analysis parameters look for patterns where
one event leads to another later event. A Sequence is an ordered list of sets of items, and
it is a common type of data structure found in many databases.

Classification parameter looks for new patterns, and might result in a change in the way
the data is organized. Classification algorithms predict variables based on other factors
within the database.Four stages of data mining Clustering parameters find and visually
document groups of facts that were previously unknown. Clustering groups a set of ob-

10
jects and aggregates them based on how similar they are to each other.There are different
ways a user can implement the cluster, which differentiate between each clustering mod-
el. Fostering parameters within data mining can discover patterns in data that can lead to
reasonable predictions about the future, also known as predictive analysis.

• Selection

• Pre-processing

• Transformation

• Data mining

• Interpretation/evaluation

It exists, however, in many variations on this theme, such as the Cross Industry Standard
Process for Data Mining (CRISP-DM) which defines six phases:

• Business understanding

• Data understanding

• Data preparation

• Modeling

• Evaluation

• Deployment

11
2.2 Phases in Data Mining

A simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Vali-
dation.

Polls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is
the leading methodology used by data miners. The only other data mining standard
named in these polls was SEMMA. However, 3–4 times as many people reported using
CRISP-DM. Several teams of researchers have published reviews of data mining process
models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA
in 2008.

Pre-processing

Before data mining algorithms can be used, a target data set must be assembled. As data
mining can only uncover patterns actually present in the data, the target data set must be
large enough to contain these patterns while remaining concise enough to be mined
within an acceptable time limit. A common source for data is a data mart or data ware-
house. Pre-processing is essential to analyze the multivariate data sets before data min-
ing. The target set is then cleaned. Data cleaning removes the observations containing
noise and those with missing data.

12
Data mining involves six common classes of tasks:

Anomaly detection (outlier/change/deviation detection) – The identification of unusual


data records, that might be interesting or data errors that require further investigation.

Association rule learning (dependency modelling) – Searches for relationships between


variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are fre-
quently bought together and use this information for marketing purposes. This is some-
times referred to as market basket analysis.

Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Classification – is the task of generalizing known structure to apply to new data. For ex-
ample, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".

Regression – attempts to find a function which models the data with the least error that
is, for estimating the relationships among data or datasets.

Summarization – providing a more compact representation of the data set, including vi-
sualization and report generation.

The concept of bagging (voting for classification, averaging for regression-type prob-
lems with continuous dependent variables of interest) applies to the area of predictive
data mining, to combine the predicted classifications (prediction) from multiple models,
or from the same type of model for different learning data. It is also used to address the
inherent instability of results when applying complex models to relatively small data
sets. Suppose your data mining task is to build a model for predictive classification, and
the dataset from which to train the model (learning data set, which contains observed
classifications) is relatively small. You could repeatedly sub-sample (with replacement)
from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to

13
the successive samples. In practice, very different trees will often be grown for the dif-
ferent samples, illustrating the instability of models often evident with small data sets.
One method of deriving a single prediction (for new observations) is to use all trees
found in the different samples, and to apply some simple

voting: The final classification is the one most often predicted by the different trees.
Note that some weighted combination of predictions (weighted vote, weighted average)
is also possible, and commonly used. A sophisticated (machine learning) algorithm for
generating weights for weighted prediction or voting is the Boosting procedure.

Boosting

The concept of boosting applies to the area of predictive data mining, to generate multi-
ple models or classifiers (for prediction or classification), and to derive weights to com-
bine the predictions from those models into a single prediction or predicted classification
(see also Bagging).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the
sequence is an "expert" in classifying observations that were not well classified by those
preceding it. During deployment (for prediction or classification of new cases), the pre-
dictions from the different classifiers can then be combined (e.g., via voting, or some
weighted voting procedure) to derive a single best prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support
weights or misclassification costs. In that case, random sub-sampling can be applied to
the learning data in the successive steps of the iterative boosting procedure, where the
probability for selection of an observation into the subsample is inversely proportional to
the accuracy of

the prediction for that observation in the previous iteration (in the sequence of iterations
of the boosting procedure).

14
Data preparation and cleaning is an often neglected but extremely important step in the
data mining process. The old saying "garbage-in-garbage-out" is particularly applicable
to the typical data mining projects where large data sets collected via some automatic
methods (e.g., via the Web) serve as the input into the analyses. Often, the method by
which the data where gathered was not tightly controlled, and so the data may contain
out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender:
Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened
for such problems can produce highly misleading results, in particular in predictive data
mining.

Data Reduction (for Data Mining)

The term Data Reduction in the context of data mining is usually applied to projects
where the goal is to aggregate or amalgamate the information contained in large datasets
into manageable (smaller) information nuggets. Data reduction methods can include
simple tabulation, aggregation (computing descriptive statistics) or more sophisticated
techniques like clustering, principal components analysis, etc.

Deployment

The concept of deployment in predictive data mining refers to the application of a model
for prediction or classification to new data. After a satisfactory model or set of models
has been identified (trained) for a particular application, we usually want to deploy those
models so that predictions or predicted classifications can quickly be obtained for new
data. For example, a credit card company may want to deploy a trained model or set of
models (e.g., neural networks, meta-learner) to quickly identify transactions which have
a high probability of being fraudulent.

15
Machine Learning

Machine learning, computational learning theory, and similar terms are often used in the
context of Data Mining, to denote the application of generic model-fitting or classifica-
tion algorithms for predictive data mining. Unlike traditional statistical data analysis,
which is usually concerned with the estimation of population parameters by statistical
inference, the emphasis in data mining (and machine learning) is usually on the accuracy
of prediction (predicted classification), regardless of whether or not the "models" or
techniques that are used to generate the prediction is interpretable or open to simple ex-
planation. Good examples of this type of technique often applied to predictive data min-
ing are neural networks or meta-learning techniques such as boosting, etc. These meth-
ods usually involve the fitting of very complex "generic" models, that are not related to
any reasoning or theoretical understanding of underlying causal processes; instead, these
techniques can be shown to generate accurate predictions or classification in crossvalida-
tion samples.

The term Predictive Data Mining is usually applied to identify data mining projects with
the goal to identify a statistical or neural network model or set of models that can be
used to predict some response of interest. For example, a credit card company may want
to engage in predictive data mining, to derive a (trained) model or set of models (e.g.,
neural networks, meta-learner) that can quickly identify transactions which have a high
probability of being fraudulent. Other types of data mining projects may be more ex-
ploratory in nature (e.g., to identify cluster or segments of customers), in which case
drill-down descriptive and exploratory methods would be applied. Data reduction is an-
other possible objective for data mining (e.g., to aggregate or amalgamate the informa-
tion in very large data sets into useful and manageable chunks).

SEMMA

• See Models for Data Mining.

• Stacked Generalization

16
• See Stacking.

• Stacking (Stacked Generalization)

The concept of stacking (Stacked Generalization) applies to the area of predictive data
mining, to combine the predictions from multiple models. It is particularly useful when
the types of models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT or CHAID,
linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes pre-
dicted classifications for a crossvalidation sample, from which overall goodness-of-fit
statistics (e.g., misclassification rates) can be computed. Experience has shown that
combining the predictions from multiple methods often yields more accurate predictions
than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking,
the predictions from different classifiers are used as input into a meta-learner, which at-
tempts to combine the predictions to create a final best predicted classification. So, for
example, the predicted classifications from the tree classifiers, linear model, and the
neural network classifier(s) can be used as input variables into a neural network meta-
classifier, which will attempt to "learn" from the data how to combine the predictions
from the different models to yield maximum classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g.,
from multiple datasets used for learning) are Boosting and Bagging (Voting).

17
MEDICAL SCIENCE

3.1 CARDIOVASCULAR DISEASES

Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood ves-
sels. Cardiovascular disease includes coronary artery diseases (CAD) such as angina and
myocardial infarction (commonly known as a heart attack). Other CVDs include stroke,
heart failure, hypertensive heart disease, rheumatic heart disease, cardiomyopathy, heart
arrhythmia, congenital heart disease, valvular heart disease, carditis, aortic aneurysms,
peripheral artery disease, thromboembolic disease, and venous thrombosis. The underly-
ing mechanisms vary depending on the disease. Coronary artery disease, stroke, and pe-
ripheral artery disease involve atherosclerosis. This may be caused by high blood pres-

18
sure, smoking, diabetes, lack of exercise, obesity, high blood cholesterol, poor diet, and
excessive alcohol consumption, among others. High blood pressure results in 13% of
CVD deaths, while tobacco results in 9%, diabetes 6%, lack of exercise 6% and obesity
5%. Rheumatic heart disease may follow untreated strep throat.
It is estimated that 90% of CVD is preventable. Prevention of atherosclerosis involves
improving risk factors through: healthy eating, exercise, avoidance of tobacco smoke
and limiting alcohol intake. Treating risk factors, such as high blood pressure, blood
lipids and diabetes is also beneficial. Treating people who have strep throat with antibi-
oticscan decrease the risk of rheumatic heart disease. The use of aspirin in people, who
are otherwise healthy, is of unclear benefit.
Cardiovascular diseases are the leading cause of death globally. This is true in all areas
of the world except Africa. Together they resulted in 17.9 million deaths (32.1%) in
2015, up from 12.3 million (25.8%) in 1990. Deaths, at a given age, from CVD are more
common and have been increasing in much of the developing world, while rates have
declined in most of the developed world since the 1970s. Coronary artery disease and
stroke account for 80% of CVD deaths in males and 75% of CVD deaths in females.
Most cardiovascular disease affects older adults. In the United States 11% of people be-
tween 20 and 40 have CVD, while 37% between 40 and 60, 71% of people between 60
and 80, and 85% of people over 80 have CVD. The average age of death from coronary
artery disease in the developed world is around 80 while it is around 68 in the develop-
ing world. Disease onset is typically seven to ten years earlier in men as compared to
women.

Types
There are many types of heart disease that affect different parts of the organ and occur in
different ways.

Congenital heart disease

19
This is a general term for some deformities of the heart that have been present since
birth. Examples include:
Septal defects: There is a hole between the two chambers of the heart.
Obstruction defects: The flow of blood through various chambers of the heart is partially
or totally blocked.
Cyanotic heart disease: A defect in the heart causes a shortage of oxygen around the
body.

!
3.1 Affects of Cardiovascular Diseases

Existing cardiovascular disease or a previous cardiovascular event, such as a heart attack


or stroke, is the strongest predictor of a future cardiovascular event.[52] Age, sex, smok-
ing, blood pressure, blood lipids and diabetes are important predictors of future cardio-
vascular disease in people who are not known to have cardiovascular disease. These
measures, and sometimes others, may be combined into composite risk scores to esti-
mate an individual's future risk of cardiovascular disease. Numerous risk scores exist
although their respective merits are debated. Other diagnostic tests and biomarkers re-
main under evaluation but currently these lack clear-cut evidence to support their routine

20
use. They include family history, coronary artery calcification score, high sensitivity C-
reactive protein (hs-CRP), ankle–brachial pressure index, lipoprotein subclasses and par-
ticle concentration, lipoprotein(a), apolipoproteins A-I and B, fibrinogen, white blood
cell count, homocysteine, N-terminal pro B-type natriuretic peptide (NT-proBNP), and
markers of kidney function. High blood phosphorus is also linked to an increased risk.

3.2 Statistics of Cardiovascular Diseases

Causes
Heart disease is caused by damage to all or part of the heart, damage to the coronary ar-
teries, or a poor supply of nutrients and oxygen to the organ.

Some types of heart disease, such as hypertrophic cardiomyopathy, are genetic. These,
alongside congenital heart defects, can occur before a person is born.

There are a number of lifestyle choices that can increase the risk of heart disease. These
include:

• high blood pressure and cholesterol

21
• smoking

• overweight and obesity

• diabetes

• family history

• a diet of junk food

• age

A history of preeclampsia during pregnancy staying in a stationary position for extended


periods of time, such as sitting at work
Having any of these risk factors greatly increases the risk of heart disease. Some, such as
age, are unavoidable. For example, once a woman reaches 55 years of age, heart disease
becomes more likely.

Treatment

There are two main lines of treatment for heart disease. Initially, a person can attempt to
treat the heart condition using medications. If these do not have the desired effect, surgi-
cal options are available to help correct the issue.

Medication

A very wide range of medication is available for the majority of heart conditions. Many
are prescribed to prevent blood clots, but some serve other purposes.

The main medications in use are:

• statins, for lowering cholesterol

• aspirin, clopidogrel, and warfarin, for preventing blood clots

• beta-blockers, for treating heart attack, heart failure, and high blood
pressure

22
• angiotensin-converting enzyme (ACE) inhibitors, for heart failure and
high blood pressure

Your doctor will work with you to find a medication that is safe and effective. They will
also use medications to treat underlying conditions that can affect the heart, such as dia-
betes before they become problematic.

Surgery

Heart surgery is an intensive option from which it can take a long time to recover.

However, they can be effective in treating blockages and heart problems for which med-
ications may not be effective, especially in the advanced stages of heart disease.

The most common surgeries include:

• angioplasty, in which a balloon catheter is inserted to widen narrowed


blood vessels that might be restricting blood flow to the heart

• coronary artery bypass surgery, which allows blood flow to reach a blocked
part of the heart in people with blocked arteries

• surgery to repair or replace faulty heart valves

• pacemakers, or electronic machines that regulate a heartbeat for people


with arrhythmia

Heart transplants are another option. However, it is often difficult to find a suitable heart
of the right size and blood type in the required time. People are put on a waiting list for
donor organs and can sometimes wait years.

Prevention

Some types of heart disease, such as those that are present from birth, cannot be prevent-
ed.

Other types, however, can be prevented by taking the following measures:

• Eat a balanced diet. Stick to low-fat, high-fiber foods and be sure to


consume five portions of fresh fruit and vegetables each day. Increase

23
your intake of whole grains and reduce the amount of salt and sugar in
the diet. Make sure the fats in the diet are mostly unsaturated.

• Exercise regularly. This will strengthen the heart and circulatory system,
reduce cholesterol, and maintain blood pressure.

• Maintain a healthy body weight for your height. Click here to calculate
your current and target body mass index (BMI).

• If you smoke, quit. Smoking is a major risk factor for heart and cardio-
vascular conditions.

• Reduce the intake of alcohol. Do not drink more than 14 units per week.

• Control conditions that affect heart health as a complication, such as


high blood pressure or diabetes.

While these steps do not completely eliminate the risk of heart disease, they can help
improve overall health and greatly reduce the chances of heart complications.

3.2 Statistics

Exercise is one easy way to keep heart disease at bay.

Heart disease is the most common cause of death for both sexes. Here are some statistics
demonstrating the scale of heart disease in the U.S.

• Heart disease causes the deaths of around 630,000 people in the U.S. each
year.

• In the U.S., a person has a heart attack every 40 seconds, and at least one
person dies per minute from an event related to heart problems.

• The health burden placed by heart disease on the U.S. economy is around
$200 billion.

• The most common type of heart disease is coronary heart disease.

• Mississippi is the state with the highest rate of death from heart disease at
233.1 deaths per 100,000 members of the population. The state is closely

24
followed by Oklahoma, Arkansas, Alabama, and Louisiana. Minnesota,
Hawaii, and Colorado have the lowest rates.

3.3 Age standardised Statistics

25
3.2 NEURAL NETWORKS

A neural network is a series of algorithms that endeavors to recognize underlying rela-


tionships in a set of data through a process that mimics the way the human brain oper-
ates. Neural networks can adapt to changing input so the network generates the best pos-
sible result without needing to redesign the output criteria. The conception of neural
networks is swiftly gaining popularity in the area of trading system development.

Neural networks, in the world of finance, assist in the development of such process as
time-series forecasting, algorithmic trading, securities classification, credit risk modeling
and constructing proprietary indicators and price derivatives. A neural network works
similarly to the human brain’s neural network. A “neuron” in a neural network is a math-
ematical function that collects and classifies information according to a specific architec-
ture. The network bears a strong resemblance to statistical methods such as curve fitting
and regression analysis.

A neural network contains layers of interconnected nodes. Each node is a perceptron and
is similar to a multiple linear regression. The perceptron feeds the signal produced by a
multiple linear regression into an activation function that may be nonlinear.

26
In a multi-layered perceptron (MLP), perceptrons are arranged in interconnected layers.
The input layer collects input patterns. The output layer has classifications or output sig-
nals to which input patterns may map. For instance, the patterns may comprise a list of
quantities for technical indicators about a security; potential outputs could be “buy,”
“hold” or “sell.” Hidden layers fine-tune the input weightings until the neural network’s
margin of error is minimal. It is hypothesized that hidden layers extrapolate salient fea-
tures in the input data that have predictive power regarding the outputs. This describes
feature extraction, which accomplishes a utility similar to statistical techniques such as
principal component analysis.

Application of Neural Networks


Neural networks are broadly used, with applications for financial operations, enterprise
planning, trading, business analytics and product maintenance. Neural networks have
also gained widespread adoption in business applications such as forecasting and mar-
keting research solutions, fraud detection and risk assessment.

A neural network evaluates price data and unearths opportunities for making trade deci-
sions based on the data analysis. The networks can distinguish subtle nonlinear interde-
pendencies and patterns other methods of technical analysis cannot. However, a 10 per-
cent improvement in efficiency is all an investor can ask for from a neural network.
There will always be data sets and task classes that a better analyzed by using previously
developed algorithms. It is not so much the algorithm that matters; it is the well-prepared
input data on the targeted indicator that ultimately determines the level of success of a
neural network.

An Artificial Neural Network, often just called a neural network, is a mathematical mod-
el inspired by biological neural networks. A neural network consists of an interconnected
group of artificial neurons, and it processes information using a connectionist approach
to computation. In most cases a neural network is an adaptive system that changes its

27
structure during a learning phase. Neural networks are used to model complex relation-
ships between inputs and outputs or to find patterns in data.

The inspiration for neural networks came from examination of central nervous systems.
In an artificial neural network, simple artificial nodes, called “neurons”, “neurodes”,
“processing elements” or “units”, are connected together to form a network which mim-
ics a biological neural network.

There is no single formal definition of what an artificial neural network is. Generally, it
involves a network of simple processing elements that exhibit complex global behavior
determined by the connections between the processing elements and element parameters.
Artificial neural networks are used with algorithms designed to alter the strength of the
connections in the network to produce a desired signal flow.

Neural networks are also similar to biological neural networks in that functions are per-
formed collectively and in parallel by the units, rather than there being a clear delin-
eation of subtasks to which various units are assigned. The term “neural network” usual-
ly refers to models employed in statistics, cognitive psychology and artificial intelli-
gence. Neural network models which emulate the central nervous system are part of the-
oretical neuroscience and computational neuroscience.

In modern software implementations of artificial neural networks, the approach inspired


by biology has been largely abandoned for a more practical approach based on statistics
and signal processing. In some of these systems, neural networks or parts of neural net-
works (such as artificial neurons) are used as components in larger systems that combine
both adaptive and non-adaptive elements. While the more general approach of such
adaptive systems is more suitable for real-world problem solving, it has far less to do
with the traditional artificial intelligence connectionist models. What they do have in
common, however, is the principle of non-linear, distributed, parallel and local process-
ing and adaptation. Historically, the use of neural networks models marked a paradigm
shift in the late eighties from high-level (symbolic) artificial intelligence, characterized
by expert systems with knowledge embodied in if-then rules, to low-level (sub-symbol-

28
ic) machine learning, characterized by knowledge embodied in the parameters of a dy-
namical system.

Applications
The utility of artificial neural network models lies in the fact that they can be used to in-
fer a function from observations. This is particularly useful in applications where the
complexity of the data or task makes the design of such a function by hand impractical.

3.4 Complexity of Data

Real-life applications

The tasks artificial neural networks are applied to tend to fall within the following broad
categories:

Function approximation, or regression analysis, including time series prediction, fitness


approximation and modeling.

Classification, including pattern and sequence recognition, novelty detection and se-
quential decision making.

29
Data processing, including filtering, clustering, blind source separation and compression.

Robotics, including directing manipulators, Computer numerical control.

Application areas include system identification and control (vehicle control, process
control, natural resources management), quantum chemistry, game-playing and decision
making (backgammon, chess, poker), pattern recognition (radar systems, face identifica-
tion, object recognition and more), sequence recognition (gesture, speech, handwritten
text recognition), medical diagnosis, financial applications (automated trading systems),
data mining (or knowledge discovery in databases, “KDD”), visualization and e-mail
spam filtering.

Artificial neural networks have also been used to diagnose several cancers. An ANN
based hybrid lung cancer detection system named HLND improves the accuracy of di-
agnosis and the speed of lung cancer radiology. These networks have also been used to
diagnose prostate cancer. The diagnoses can be used to make specific models taken from
a large group of patients compared to information of one given patient. The models do
not depend on assumptions about correlations of different variables. Colorectal cancer
has also been predicted using the neural networks. Neural networks could predict the
outcome for a patient with colorectal cancer with a lot more accuracy than the current
clinical methods. After training, the networks could predict multiple patient outcomes
from unrelated institutions.

Neural networks and neuroscience

Theoretical and computational neuroscience is the field concerned with the theoretical
analysis and computational modelling of biological neural systems. Since neural systems
are intimately related to cognitive processes and behaviour, the field is closely related to
cognitive and behavioural modelling.

30
31
3.3 DECISION TREE

A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label. The topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates whether a
customer at a company is likely to buy a computer or not. Each internal node represents
a test on an attribute. Each leaf node represents a class.

3.5 Decision Tree

The benefits of having a decision tree are as follows −

• It does not require any domain knowledge.

• It is easy to comprehend.

• The learning and classification steps of a decision tree are simple and fast.

32
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to
noise or outliers. The pruned trees are smaller and less complex.

Tree Pruning Approaches


There are two approaches to prune a tree −

• Pre-pruning − The tree is pruned by halting its construction early.

• Post-pruning - This approach removes a sub-tree from a fully grown tree.

Cost Complexity

The cost complexity is measured by the following two parameters −

• Number of leaves in the tree, and

• Error rate of the tree.

Decision tree builds classification or regression models in the form of a tree structure. It
breaks down a dataset into smaller and smaller subsets while at the same time an associ-
ated decision tree is incrementally developed. The final result is a tree with decision
nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g.,
Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or deci-
sion. The topmost decision node in a tree which corresponds to the best predictor
called root node. Decision trees can handle both categorical and numerical data.

Decision trees have been applied to problems such as assigning protein function and
predicting splice sites. How do these classifiers work, what types of problems can they
solve and what are their advantages over alternatives?

Many scientific problems entail labeling data items with one of a given, finite set of
classes based on features of the data items. For example, oncologists classify tumors as
different known cancer types using biopsies, patient records and other assays. Decision
trees, such as C4.5 (ref. 1), CART2 and newer variants, are classifiers that predict class

33
labels for data items. Decision trees are at their heart a fairly simple type of classifier,
and this is one of their advantages.

Decision trees are constructed by analyzing a set of training examples for which the
class labels are known. They are then applied to classify previously unseen examples. If
trained on high-quality data, decision trees can make very accurate predictions.

Classifying with decision trees

A decision tree classifies data items (Fig. 1a) by posing a series of questions about the
features associated with the items. Each question is contained in a node, and every inter-
nal node points to one child node for each possible answer to its question. The questions
thereby form a hierarchy, encoded as a tree. In the simplest form, we ask yes-or-no ques-
tions, and each internal node has a ‘yes’ child and a ‘no’ child. An item is sorted into a
class by following the path from the topmost node, the root, to a node without children, a
leaf, according to the answers that apply to the item under consideration. An item is as-
signed to the class that has been associated with the leaf it reaches. In some variations,
each leaf contains a probability distribution over the classes that estimates the condition-
al probability that an item reaching the leaf belongs to a given class. Nonetheless, esti-
mation of unbiased probabilities can be difficult.

Questions in the tree can be arbitrarily complicated, as long as the answers can be com-
puted efficiently. A question’s answers can be values from a small set, such as
{A,C,G,T}. In this case, a node has one child for each possible value. In many instances,
data items will have real-valued features. To ask about these, the tree uses yes/no ques-
tions of the form “is the value > k?” for some threshold k, where only values that occur
in the data need to be tested as possible thresholds. It is also possible to use more com-
plex questions, taking either linear or logical combinations of many features at once5.

Decision trees are sometimes more interpretable than other classifiers such as neural
networks and support vector machines because they combine simple questions about the
data in an understandable way. Approaches for extracting decision rules from decision
trees have also been successful1. Unfortunately, small changes in input data can some-

34
times lead to large changes in the constructed tree. Decision trees are flexible enough to
handle items with a mixture of real-valued and categorical features, as well as items with
some missing features. They are expressive enough to model many partitions of the data
that are not as easily achieved with classifiers that rely on a single decision boundary
(such as logistic regression or support vector machines). However, even data that can be
perfectly divided into classes by a hyperplane may require a large decision tree if only
simple threshold tests are used. Decision trees naturally support classification problems
with more than two classes and can be modified to handle regression problems. Finally,
once constructed, they classify new items quickly.

3.6 Example of Decision Tree

We continue to select questions recursively to split the training items into ever-smaller
subsets, resulting in a tree. A crucial aspect to applying decision trees is limiting the

35
complexity of the learned trees so that they do not overfit the training examples. One
technique is to stop splitting when no question increases the purity of the subsets more
than a small amount. Alternatively, we can choose to build out the tree completely until
no leaf can be further subdivided. In this case, to avoid overfitting the training data, we
must prune the tree by deleting nodes. This can be done by collapsing internal nodes into
leaves if doing so reduces the classification error on a held-out set of training examples1.
Other approaches, relying on ideas such as minimum description length1,6,7, remove
nodes in an attempt to explicitly balance the complexity of the tree with its fit to the
training data. Cross-validation on left-out training examples should be used to ensure
that the trees generalize beyond the examples used to construct them

36
IMPLEMENTATION

4.1 DATA SOURCE

4.1 Attributes

This project takes input values from the user and then employ Naive Bayes Algo-
rithm to match the patients data from the data present in the number of databases.
After calculation and matching the values, the system will predict what all dis-

37
eases have the most probability and when should the patient pay a visit to the
doctor

Data mining uses a lot of different techniques for finding the optimal solution
while recognising patterns,

1. Statistics

2. Clustering

3. Decision Trees

4. Classification

5. Neural Networks

38
IMPLEMENTATION OF NAIVE BAYES CLASSIFIER

A. Classifier

A classifier is a process of mapping from a (discrete or continuous) feature space X to a


discrete set of labels Y. Here we are dealing about learning classifiers, and learning clas-
sifiers are divided into supervised and unsupervised learning classifiers [2]. The ap-
plications of classifiers are wide- ranging. They find use in medicine, finance, mobile
phones, computer vision (face recognition, target tracking), voice recognition, data min-
ing and uncountable other areas.

An example is a classifier that accepts a person's details, such as age, marital status,
home address and medical history and classifies the person with respect to the conditions
of the project.

B. Naive Bayes

In probability theory, Bayes' theorem (often called Bayes' law after Thomas Bayes) re-
lates the conditional and marginal probabilities of two random events. It is often used to
compute posterior probabilities given observations .

For example, a patient may be observed to have certain symptoms. Bayes' theorem can
be used to compute the probability that a proposed diagnosis is correct, given that obser-
vation.

A naive Bayes classifier is a term dealing with a simple probabilistic classification based
on applying Bayes' theorem.In simple terms, a naive

Bayes classifier assumes that the presence (or absence) of a particular feature of a class
is unrelated to the presence (or absence) of any other feature. For example, a fruit may
be considered to be an apple if it is red, round, and about 4" in diameter. Even though
these features depend on the existence of the other features, a naive Bayes classifier con-
siders all of these properties to independently contribute to the probability that this fruit
is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be
trained very efficiently in a supervised learning setting . Naive Bayes classifiers often
work much better in many complex real-world situations than one might expect. Here

39
independent variables are considered for the purpose of prediction or occurrence of the
event.

In spite of their naive design and apparently over- simplified assumptions, naive Bayes
classifiers often work much better in many complex real- world situations than one
might expect. Recently, careful analysis of the Bayesian classification problem has
shown that there are some theoretical reasons for the apparently unreasonable efficacy of
naive Bayes classifiers .

An advantage of the naive Bayes classifier is that it requires a small amount of training
data to estimate the parameters (means and variances of the variables) necessary for
classification. Because independent variables are assumed, only the variances of the
variables for each class need to be determined and not the entire covariance matrix .

C. Theorem

This is a simple probabilistic classifier based on the Bayes theorem, from the Wikipedia
article. This project contains source files that can be included in any C# project.

The Bayesian Classifier is capable of calculating the most probable output depending on
the input. It is possible to add new raw data at runtime and have a better probabilistic
classifier. A naive Bayes classifier assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or absence) of any other feature, given the
class variable. For example, a fruit may be considered to be an apple if it is red, round,
and about 4" in diameter. Even if these features depend on each other or upon the exis-
tence of other features, a naive Bayes classifier considers all of these properties to inde-
pendently contribute to the probability that this fruit is an apple.

D. Bayesian interpretation

In the Bayesian (or epistemological) interpretation, probability measures a degree of be-


lief. Bayes’ theorem then links the degree of belief in a proposition before and after ac-
counting for evidence . For example, suppose somebody proposes that a biased coin is
twice as likely to land heads as tails. Degree of belief in this might initially be 50%. The
coin is then flipped a number of times to collect evidence. Belief may rise to 70% if the
evidence supports the proposition .

For proposition A and evidence B,

P(A), the prior, is the initial degree of belief in A.

40
P(A | B), the posterior, is the degree of belief

having accounted for B.

P(B | A) / P(B) represents the support B provides for

A.

F. Sex classification

Problem: classify whether a given person is a male

or a female based on the measured features. The

features include height, weight, and foot size.

E. Training

Example training set is shown below.

41
RESULTS

Diagnosis of the medical test are considered to be very momentous yet very in-
tricate work that needs special attention and needs to be carried out preciously
and with full efficiency and obviously to automate this work will be very advan-
tageous. So the decision taken up by the doctors is unwontedly biased and gener-
ates errors and further leads to waste of money in terms of medical overhead and
service provided ti the patients. Data mining hence has got the power to build an
automated system with the help of machine learning and build up knowledge
which environment and help in improvement of eminence of service provided to
the patients.

42
SUMMARY AND CONCLUSIONS

The decision in the cardiovascular disease prediction is taken up on by knowledge - en-


vironment build with the help of supervised machine learning technique called naive
bayesian classification. This system takes up the hidden knowledge from previous heart
disease databases. According to this model complex queries can be easily answered with
help of its strength developed on the basis of the and admittance to meticulous informa-
tion and accurateness. This project can be expanded supplementary as it can include fur-
ther more attributes that are not listed alive from the medical reports. Also it may include
time series analysis, association rules, clustering etc. Data used can be continuous in-
stead of the categorial data that is used other than this text mining can be incorporated so
that unstructured data can be analysed and data mining and text mining can be integrat-
ed.

Heart Disease is an incurable disease by its nature. This disease makes a dangerous
complexities such as heart attack and death. The relevance of Data Mining in the Med-
ical field is acknowledged and steps are produced to apply applicable strategies in the
Disease Prediction. The different research works with some compelling procedures done
by various people were studied. Though, various classification techniques are widely
used for Heart Disease Prediction.

43
REFERENCES

1. Blake, C.L., Mertz, C.J.: “UCI Machine Learning Databases”, http://mlearn.ics.uci.e-


du/databases/heartdisease/, 2004

2. Chapman, P., Clinton, J., Kerber, R. Khabeza, T., Reinartz, T., Shearer, C., Wirth, R.:
“CRISP-DM 1.0: Step by step data mining guide”, SPSS, 1-78, 2000.

3. Mrs.G.Subbalakshmi, “Decision Support in Heart Disease Prediction System using


Naive Bayes”, Indian Journal of Computer Science and Engineering.

4. Fayyad, U: “Data Mining and Knowledge Discovery in Databases: Implications for


scientific databases”, Proc. of the 9th Int. Conf. on Scientific and Statistical Database
Management, Olympia, Washington, USA, 2-11, 1997.

5. Giudici, P.: “Applied Data Mining: Statistical Methods for Business and Industry”,
New York: John Wiley, 2003.


6. Han, J., Kamber, M.: “Data Mining Concepts and Techniques”, Morgan Kaufmann
Publishers, 2006.

7. Ho, T. J.: “Data Mining and Data Warehousing”, Prentice Hall, 2005.


8. Intelligent Heart Disease Prediction System Using Data Mining Techniques-Sellappan
Palaniappan, Rafiah Awang 978-1-4244-1968-5/08/ ©2008 IEEE

9. Obenshain, M.K: “Application of Data Mining Techniques to Healthcare Data”, In-


fection Control and Hospital Epidemiology, 25(8), 690–695, 2004.


10. Wu, R., Peters, W., Morgan, M.W.: “The Next Generation Clinical Decision Support:
Linking Evidence to Best Practice”, Journal Healthcare Information Management. 16(4),
50-55, 2002.

11. BalaSundar V, T Devi and N Saravan, “Development of a Data Clustering Algorithm


for Predicting Heart”, International Journal of Computer Applications, vol. 48, pp.
423-428,2012

44
12. Sayali D. Jadhav, H. P. Channe, “Comparative Study of KNN, Naive Bayes and De-
cision Tree Classification Techniques”, International Journal of Science and Research
(IJSR) , Volume 5, Issue 1 , Paper ID: NOV153131,2016

13. SellappanPalaniappan, RafiahAwang “Intelligent Heart Disease Prediction System


Using Data Mining Techniques” IEEE, pp.978-1-4244-1968,2008

14. Mr. P Sai Chandrasekhar Reddy, Mr. PuneetPalagi, Ms. Jaya, “HEART DISEASE
PREDICTION USING ANN ALGORITHM IN DATA MINING”IJCSMC, Vol. 6, Issue.
4, pg.168 – 172,2016 4. DursunDelen, AsilOztekin, Leman Tomak, “An analytic ap-
proach to better understanding and management of coronary surgeries”, Decision Sup-
port Systems vol. 52, pp.698– 705,2012 5. David L. Olson, DursunDelen, YanyanMeng,
“Comparative analysis of data mining methods for bankruptcy prediction”, Decision
Support Systems vol.52, pp.464–473, 2012

15. Akhilesh Kumar Yadav, DivyaTomar and Sonali Agarwal, “Clustering of Lung Can-
cer Data Using Foggy K-Means”, International Conference on Recent Trends in Infor-
mation Technology (ICRTIT), vol. 21, pp.121-126,2013

16. Daljit Kaur and KiranJyot, “Enhancement in the Performance of K-means Algo-
rithm”, International Journal of Computer Science and Communication Engineering,
vol. 2, pp. 724- 729,2013

17. Sanjay Chakrabotry, Prof. N.K Nigwani and Lop Dey , “Weather Forecasting using
Incremental K-means Clustering”, vol. 8, pp. 142-147,2014

18. K. Rajalakshmi, Dr. S. S. Dhenakaran and N. Roobin , “Comparative Analysis of K-


Means Algorithm in Disease Prediction”, International Journal of Science, Engineering
and Technology Research (IJSETR), Vol. 4, pp. 1023-1028,201

19. Mustafa A. Al-Fayoumi, “Enhanced Associative classification based on incremental


mining Algorithm (E-ACIM)”, IJCSI International Journal of Computer Science Issues,
Volume 12, Issue 1, 2015

20. SajidaPerveen, Muhammad Shahbaz, Aziz Guergachi, Karim Keshavjee, “Perfor-


mance Analysis of Data Mining Classification Techniques to Predict Diabetes”, Procedia
Computer Science vol.82, pp.115 – 121,2015

21. BasmaBoukenze,HajarMousannifandAbdelkrimHaqiq, “PERFORMANCE OF


DATA MINING TECHNIQUESTO PREDICT IN HEALTHCARE CASE
STUDY:CHRONIC KIDNEY FAILURE DISEASE”, International Journal of Database
Management Systems (IJDMS) Vol.8, No.3,2016

45
22. MonireNorouzi,AlirezaSouri, and Majid SamadZamini, “A Data Mining Classifica-
tion Approach for Behavioral Malware Detection”,Hindawi Publishing Corporation-
Journal of Computer Networks and Communications,2016

23. Nancy.P, Sudha.V, Akiladevi.R, “Analysis of feature Selection and Classification al-
gorithms on Hepatitis Data”, International Journal of Advanced Research in Computer
Engineering & Technology (IJARCET), Volume 6, Issue 1,2017

46
APPENDIX A

SCREEN SHOTS

47
48
49
APPENDIX B

SOURCE CODE

package HeartDisease;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import javax.swing.JOptionPane;

public class frmMain extends javax.swing.JFrame {


int count=0;
String addressFile="";

public frmMain() {
initComponents();
}

@SuppressWarnings("unchecked")
// <editor-fold defaultstate="collapsed" desc="Generated Code">
private void initComponents() {

jPanel1 = new javax.swing.JPanel();


jButton6 = new javax.swing.JButton();
jButton7 = new javax.swing.JButton();
jLabel1 = new javax.swing.JLabel();
cmbecg = new javax.swing.JComboBox();
txtthal = new javax.swing.JTextField();
jLabel2 = new javax.swing.JLabel();
txtAge = new javax.swing.JTextField();
cmbslope = new javax.swing.JComboBox();
txtoldpeak = new javax.swing.JTextField();
jButton1 = new javax.swing.JButton();
cmbexang = new javax.swing.JComboBox();
jLabel3 = new javax.swing.JLabel();
cmbca = new javax.swing.JComboBox();
jLabel4 = new javax.swing.JLabel();
txtResult = new javax.swing.JLabel();
jLabel5 = new javax.swing.JLabel();
txtThalach = new javax.swing.JTextField();
jLabel6 = new javax.swing.JLabel();
jLabel7 = new javax.swing.JLabel();
jLabel8 = new javax.swing.JLabel();
jLabel9 = new javax.swing.JLabel();
jLabel10 = new javax.swing.JLabel();

50
jLabel11 = new javax.swing.JLabel();
jLabel12 = new javax.swing.JLabel();
jLabel13 = new javax.swing.JLabel();
jLabel14 = new javax.swing.JLabel();
cmbSex = new javax.swing.JComboBox();
cmbCPain = new javax.swing.JComboBox();
txtBP = new javax.swing.JTextField();
txtchol = new javax.swing.JTextField();
cmbfbs = new javax.swing.JComboBox();
jScrollPane1 = new javax.swing.JScrollPane();
txtOutput = new javax.swing.JTextArea();

setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);
setTitle("Health Prediction Tools");

jPanel1.setBackground(new java.awt.Color(255, 95, 131));


jPanel1.setBorder(javax.swing.BorderFactory.createTitledBorder("Application Menu"));

jButton6.setIcon(new javax.swing.ImageIcon(getClass().getResource("/images/Exit.png"))); //
NOI18N
jButton6.setText("Logout");
jButton6.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton6ActionPerformed(evt);
}
});

jButton7.setIcon(new javax.swing.ImageIcon(getClass().getResource("/images/Open file.png"))); //


NOI18N
jButton7.setText("Symptoms");
jButton7.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton7ActionPerformed(evt);
}
});

javax.swing.GroupLayout jPanel1Layout = new javax.swing.GroupLayout(jPanel1);


jPanel1.setLayout(jPanel1Layout);
jPanel1Layout.setHorizontalGroup(
jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel1Layout.createSequentialGroup()
.addContainerGap()
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEAD-
ING)
.addComponent(jButton7, javax.swing.GroupLayout.DEFAULT_SIZE, 160, Short.MAX_-
VALUE)
.addComponent(jButton6, javax.swing.GroupLayout.Alignment.TRAILING, javax.swing.-
GroupLayout.DEFAULT_SIZE, 160, Short.MAX_VALUE))
.addContainerGap())
);
jPanel1Layout.setVerticalGroup(
jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel1Layout.createSequentialGroup()
.addGap(58, 58, 58)
.addComponent(jButton7, javax.swing.GroupLayout.PREFERRED_SIZE, 41, javax.swing.-
GroupLayout.PREFERRED_SIZE)
.addGap(18, 18, 18)

51
.addComponent(jButton6, javax.swing.GroupLayout.PREFERRED_SIZE, 41, javax.swing.-
GroupLayout.PREFERRED_SIZE)
.addContainerGap(316, Short.MAX_VALUE))
);

jLabel1.setFont(new java.awt.Font("Times New Roman", 1, 24)); // NOI18N


jLabel1.setText("Enter Value to View Prediction");

cmbecg.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "0", "1", "2" }));

jLabel2.setText("Age");

cmbslope.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "1", "2", "3" }));

jButton1.setText("Check");
jButton1.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton1ActionPerformed(evt);
}
});

cmbexang.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "0", "1" }));

jLabel3.setText("Sex");

cmbca.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "0", "1", "2", "3" }));

jLabel4.setText("Chest Pen");

txtResult.setText("Result :");

jLabel5.setText("Blood Pressure");

jLabel6.setText("Cholestrol");

jLabel7.setText("FBS");

jLabel8.setText("ECG");

jLabel9.setText("Thalach");

jLabel10.setText("Exang");

jLabel11.setText("Old Peak");

jLabel12.setText("Slope");

jLabel13.setText("CA");

jLabel14.setText("Thal");

cmbSex.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "Male", "Female" }));

cmbCPain.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "1", "2", "3", "4" }));

cmbfbs.setModel(new javax.swing.DefaultComboBoxModel(new String[] { "0", "1" }));

txtOutput.setColumns(20);

52
txtOutput.setRows(5);
jScrollPane1.setViewportView(txtOutput);

javax.swing.GroupLayout layout = new javax.swing.GroupLayout(getContentPane());


getContentPane().setLayout(layout);
layout.setHorizontalGroup(
layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(0, 0, 0)
.addComponent(jPanel1, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.Group-
Layout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(102, 102, 102)
.addComponent(jLabel1)
.addGap(0, 0, Short.MAX_VALUE))
.addGroup(layout.createSequentialGroup()
.addGap(10, 10, 10)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(17, 17, 17)
.addComponent(txtResult)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addComponent(jScrollPane1, javax.swing.GroupLayout.PREFERRED_SIZE, 558,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addContainerGap(javax.swing.GroupLayout.DEFAULT_SIZE,
Short.MAX_VALUE))
.addGroup(layout.createSequentialGroup()
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEAD-
ING)
.addGroup(layout.createSequentialGroup()
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignmen-
t.LEADING, false)
.addGroup(layout.createSequentialGroup()
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignmen-
t.LEADING)
.addComponent(jLabel3)
.addComponent(jLabel4)
.addComponent(jLabel5)
.addComponent(jLabel2))
.addGap(67, 67, 67)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignmen-
t.LEADING, false)
.addComponent(cmbSex, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(txtBP)
.addComponent(txtAge, javax.swing.GroupLayout.DEFAULT_SIZE,
121, Short.MAX_VALUE)
.addComponent(cmbCPain, 0, javax.swing.GroupLayout.DEFAULT_-
SIZE, Short.MAX_VALUE)))
.addGroup(layout.createSequentialGroup()
.addComponent(jLabel8)
.addGap(120, 120, 120)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignmen-
t.LEADING)
.addComponent(cmbfbs, 0, javax.swing.GroupLayout.DEFAULT_SIZE,
Short.MAX_VALUE)

53
.addComponent(cmbecg, 0, javax.swing.GroupLayout.DEFAULT_SIZE,
Short.MAX_VALUE)))
.addComponent(jLabel7))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED,
javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE))
.addGroup(layout.createSequentialGroup()
.addComponent(jLabel6)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED,
javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)
.addComponent(txtchol, javax.swing.GroupLayout.PREFERRED_SIZE, 83,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED,
javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)))
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEAD-
ING)
.addComponent(jLabel9)
.addComponent(jLabel10)
.addComponent(jLabel12)
.addComponent(jLabel11)
.addComponent(jLabel13)
.addComponent(jLabel14))
.addGap(44, 44, 44)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEAD-
ING)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignmen-
t.LEADING, false)
.addComponent(txtoldpeak)
.addComponent(txtthal)
.addComponent(txtThalach, javax.swing.GroupLayout.PREFERRED_SIZE, 103,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addComponent(cmbexang, javax.swing.GroupLayout.PREFERRED_SIZE, 40,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(cmbslope, javax.swing.GroupLayout.PREFERRED_SIZE, 40,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(cmbca, javax.swing.GroupLayout.PREFERRED_SIZE, 40,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(71, 71, 71))))
.addGroup(javax.swing.GroupLayout.Alignment.TRAILING, layout.createSequentialGroup()
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED,
javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)
.addComponent(jButton1, javax.swing.GroupLayout.PREFERRED_SIZE, 218,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(171, 171, 171))))
);
layout.setVerticalGroup(
layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addComponent(jPanel1, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.Group-
Layout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(0, 0, Short.MAX_VALUE))
.addGroup(layout.createSequentialGroup()
.addGap(22, 22, 22)
.addComponent(jLabel1)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel9)

54
.addComponent(txtThalach, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(14, 14, 14)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.TRAILING)
.addComponent(jLabel10)
.addComponent(cmbexang, javax.swing.GroupLayout.PREFERRED_SIZE, 20,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel11)
.addComponent(txtoldpeak, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addComponent(jLabel12, javax.swing.GroupLayout.PREFERRED_SIZE, 20,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(cmbslope, javax.swing.GroupLayout.PREFERRED_SIZE, 20,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(18, 18, 18)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addComponent(jLabel13)
.addComponent(cmbca, javax.swing.GroupLayout.PREFERRED_SIZE, 20,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(18, 18, 18)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel14)
.addComponent(txtthal, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.-
GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)))
.addGroup(layout.createSequentialGroup()
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(txtAge, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.-
GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel2))
.addGap(23, 23, 23)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel3)
.addComponent(cmbSex, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.-
GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(18, 18, 18)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel4)
.addComponent(cmbCPain, javax.swing.GroupLayout.PREFERRED_SIZE, 29,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(10, 10, 10)
.addComponent(jLabel5))
.addGroup(layout.createSequentialGroup()
.addGap(11, 11, 11)
.addComponent(txtBP, javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.-
GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addComponent(jLabel6)
.addComponent(txtchol, javax.swing.GroupLayout.PREFERRED_SIZE, 25,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()

55
.addGap(12, 12, 12)
.addComponent(jLabel7)
.addGap(18, 18, 18)
.addComponent(jLabel8))
.addGroup(layout.createSequentialGroup()
.addGap(5, 5, 5)
.addComponent(cmbfbs, javax.swing.GroupLayout.PREFERRED_SIZE, 32,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addComponent(cmbecg, javax.swing.GroupLayout.PREFERRED_SIZE, 34,
javax.swing.GroupLayout.PREFERRED_SIZE)))))
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(46, 46, 46)
.addComponent(jScrollPane1, javax.swing.GroupLayout.PREFERRED_SIZE, 67,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGroup(layout.createSequentialGroup()
.addGap(78, 78, 78)
.addComponent(txtResult)))
.addGap(18, 18, 18)
.addComponent(jButton1, javax.swing.GroupLayout.PREFERRED_SIZE, 33, javax.swing.-
GroupLayout.PREFERRED_SIZE)
.addGap(49, 49, 49))
);

pack();
}// </editor-fold>

private void jButton6ActionPerformed(java.awt.event.ActionEvent evt) {


// TODO add your handling code here:
this.dispose();
new frmLogin().setVisible(true);
}

private void jButton1ActionPerformed(java.awt.event.ActionEvent evt) {


// TODO add your handling code here:
String age,sex,cp,trestbps,chol,fbs,restecg,thalch,exang,oldpeak,slope,ca,thal;
age=txtAge.getText().toString().trim();
String cmbsex=cmbSex.getSelectedItem().toString();
if(cmbsex.equals("Male")){
sex="1";
}else{
sex="0";
}
cp=cmbCPain.getSelectedItem().toString();
trestbps=txtBP.getText().toString().trim();
chol=txtchol.getText().toString().trim();
fbs=cmbfbs.getSelectedItem().toString().trim();
restecg=cmbecg.getSelectedItem().toString().trim();
thalch=txtthal.getText().toString();
exang=cmbexang.getSelectedItem().toString();
oldpeak=txtoldpeak.getText().toString();
slope=cmbslope.getSelectedItem().toString();
ca=cmbca.getSelectedItem().toString();
thal=txtThalach.getText().toString();
String
input=age+"\t"+sex+"\t"+cp+"\t"+trestbps+"\t"+chol+"\t"+fbs+"\t"+restecg+"\t"+thal+"\t"+exang+"\t"+ol
dpeak+"\t"+slope+"\t"+ca+"\t"+thalch;

56
String output=DB.NavieBayesClassifier.analayseData(input);
if(output.equalsIgnoreCase("Zero"))
{
String st1="You Should Contact The Doctor of Re-admission Within 10 Days";
txtOutput.setText(output+"\n Prediction:"+st1);
}
else if(output.equalsIgnoreCase("One"))
{
String st1="You Should Contact The Doctor of Re-admission Within 5 Days";
txtOutput.setText(output+"\nPrediction :"+st1);
}
else if(output.equalsIgnoreCase("Two"))
{
String st1="You Should Contact The Doctor of Re-admission Immidiately";
txtOutput.setText(output+"\n Prediction :"+st1);
}
}

private void jButton7ActionPerformed(java.awt.event.ActionEvent evt) {


// TODO add your handling code here:
CheckSymp chk=new CheckSymp();
chk.show();

public static void main(String args[]) {


/* Set the Nimbus look and feel */
//<editor-fold defaultstate="collapsed" desc=" Look and feel setting code (optional) ">
/* If Nimbus (introduced in Java SE 6) is not available, stay with the default look and feel.
* For details see http://download.oracle.com/javase/tutorial/uiswing/lookandfeel/plaf.html
*/
try {
for (javax.swing.UIManager.LookAndFeelInfo info : javax.swing.UIManager.getInstalledLook-
AndFeels()) {
if ("Nimbus".equals(info.getName())) {
javax.swing.UIManager.setLookAndFeel(info.getClassName());
break;
}
}
} catch (ClassNotFoundException ex) {
java.util.logging.Logger.getLogger(frmMain.class.getName()).log(java.util.logging.Level.SE-
VERE, null, ex);
} catch (InstantiationException ex) {
java.util.logging.Logger.getLogger(frmMain.class.getName()).log(java.util.logging.Level.SE-
VERE, null, ex);
} catch (IllegalAccessException ex) {
java.util.logging.Logger.getLogger(frmMain.class.getName()).log(java.util.logging.Level.SE-
VERE, null, ex);
} catch (javax.swing.UnsupportedLookAndFeelException ex) {
java.util.logging.Logger.getLogger(frmMain.class.getName()).log(java.util.logging.Level.SE-
VERE, null, ex);
}
//</editor-fold>

java.awt.EventQueue.invokeLater(new Runnable() {
public void run() {
new frmMain().setVisible(true);

57
}
});
}
// Variables declaration - do not modify
private javax.swing.JComboBox cmbCPain;
private javax.swing.JComboBox cmbSex;
private javax.swing.JComboBox cmbca;
private javax.swing.JComboBox cmbecg;
private javax.swing.JComboBox cmbexang;
private javax.swing.JComboBox cmbfbs;
private javax.swing.JComboBox cmbslope;
private javax.swing.JButton jButton1;
private javax.swing.JButton jButton6;
private javax.swing.JButton jButton7;
private javax.swing.JLabel jLabel1;
private javax.swing.JLabel jLabel10;
private javax.swing.JLabel jLabel11;
private javax.swing.JLabel jLabel12;
private javax.swing.JLabel jLabel13;
private javax.swing.JLabel jLabel14;
private javax.swing.JLabel jLabel2;
private javax.swing.JLabel jLabel3;
private javax.swing.JLabel jLabel4;
private javax.swing.JLabel jLabel5;
private javax.swing.JLabel jLabel6;
private javax.swing.JLabel jLabel7;
private javax.swing.JLabel jLabel8;
private javax.swing.JLabel jLabel9;
private javax.swing.JPanel jPanel1;
private javax.swing.JScrollPane jScrollPane1;
private javax.swing.JTextField txtAge;
private javax.swing.JTextField txtBP;
private javax.swing.JTextArea txtOutput;
private javax.swing.JLabel txtResult;
private javax.swing.JTextField txtThalach;
private javax.swing.JTextField txtchol;
private javax.swing.JTextField txtoldpeak;
private javax.swing.JTextField txtthal;
// End of variables declaration
}

package Analayse.heart.disease.opensource.classifiers;

import Analayse.heart.disease.opensource.dataobjects.Document;
import Analayse.heart.disease.opensource.dataobjects.FeatureStats;
import Analayse.heart.disease.opensource.dataobjects.NaiveBayesKnowledgeBase;
import Analayse.heart.disease.opensource.features.FeatureExtraction;
import Analayse.heart.disease.opensource.features.TextTokenizer;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

/**
* Implements a basic form of Multinomial Naive Bayes Text Classifier
*/

58
public class NaiveBayes {
private double chisquareCriticalValue = 10.83; //equivalent to pvalue 0.001. It is used by feature selec-
tion algorithm

private NaiveBayesKnowledgeBase knowledgeBase;

/**
* This constructor is used when we load an already train classifier
*
* @param knowledgeBase
*/
public NaiveBayes(NaiveBayesKnowledgeBase knowledgeBase) {
this.knowledgeBase = knowledgeBase;
}

/**
* This constructor is used when we plan to train a new classifier.
*/
public NaiveBayes() {
this(null);
}

/**
* Gets the knowledgebase parameter
*
* @return
*/
public NaiveBayesKnowledgeBase getKnowledgeBase() {
return knowledgeBase;
}

public double getChisquareCriticalValue() {


return chisquareCriticalValue;
}

/**
* Sets the chisquareCriticalValue parameter.
*
* @aashi chisquareCriticalValue
*/
public void setChisquareCriticalValue(double chisquareCriticalValue) {
this.chisquareCriticalValue = chisquareCriticalValue;
}

/**
* Preprocesses the original dataset and converts it to a List of Documents.
*
* @aashi trainingDataset
* @return
*/
private List<Document> preprocessDataset(Map<String, String[]> trainingDataset) {
List<Document> dataset = new ArrayList<Document>();

String category;
String[] examples;

Document doc;

59
Iterator<Map.Entry<String, String[]>> it = trainingDataset.entrySet().iterator();

//loop through all the categories and training examples


while(it.hasNext()) {
Map.Entry<String, String[]> entry = it.next();
category = entry.getKey();
examples = entry.getValue();

for(int i=0;i<examples.length;++i) {
//for each example in the category tokenize its text and convert it into a Document object.
doc = TextTokenizer.tokenize(examples[i]);
doc.category = category;
dataset.add(doc);

//examples[i] = null; //try freeing some memory


}

//it.remove(); //try freeing some memory


}

return dataset;
}

/**
* Gathers the required counts for the features and performs feature selection
* on the above counts. It returns a FeatureStats object that is later used
* for calculating the probabilities of the model.
*
* @aashi dataset
* @return
*/
private FeatureStats selectFeatures(List<Document> dataset) {
FeatureExtraction featureExtractor = new FeatureExtraction();

//the FeatureStats object contains statistics about all the features found in the documents
FeatureStats stats = featureExtractor.extractFeatureStats(dataset); //extract the stats of the dataset

//we pass this information to the feature selection algorithm and we get a list with the selected fea-
tures
Map<String, Double> selectedFeatures = featureExtractor.chisquare(stats, chisquareCriticalValue);

//clip from the stats all the features that are not selected
Iterator<Map.Entry<String, Map<String, Integer>>> it =
stats.featureCategoryJointCount.entrySet().iterator();
while(it.hasNext()) {
String feature = it.next().getKey();

if(selectedFeatures.containsKey(feature)==false) {
//if the feature is not in the selectedFeatures list remove it
it.remove();
}
}

return stats;
}

/**

60
* Trains a Naive Bayes classifier by using the Multinomial Model by passing
* the trainingDataset and the prior probabilities.
*
* @param trainingDataset
* @param categoryPriors
* @throws IllegalArgumentException
*/
public void train(Map<String, String[]> trainingDataset, Map<String, Double> categoryPriors) throws
IllegalArgumentException {
//preprocess the given dataset
List<Document> dataset = preprocessDataset(trainingDataset);

//produce the feature stats and select the best features


FeatureStats featureStats = selectFeatures(dataset);

//intiliaze the knowledgeBase of the classifier


knowledgeBase = new NaiveBayesKnowledgeBase();
knowledgeBase.n = featureStats.n; //number of observations
knowledgeBase.d = featureStats.featureCategoryJointCount.size(); //number of features

//check is prior probabilities are given


if(categoryPriors==null) {
//if not estimate the priors from the sample
knowledgeBase.c = featureStats.categoryCounts.size(); //number of cateogries
knowledgeBase.logPriors = new HashMap<String,Double>();

String category;
int count;
for(Map.Entry<String, Integer> entry : featureStats.categoryCounts.entrySet()) {
category = entry.getKey();
count = entry.getValue();

knowledgeBase.logPriors.put(category, Math.log((double)count/knowledgeBase.n));
}
}
else {
//if they are provided then use the given priors
knowledgeBase.c = categoryPriors.size();

//make sure that the given priors are valid


if(knowledgeBase.c!=featureStats.categoryCounts.size()) {
throw new IllegalArgumentException("Invalid priors Array: Make sure you pass a prior proba-
bility for every supported category.");
}

String category;
Double priorProbability;
for(Map.Entry<String, Double> entry : categoryPriors.entrySet()) {
category = entry.getKey();
priorProbability = entry.getValue();
if(priorProbability==null) {
throw new IllegalArgumentException("Invalid priors Array: Make sure you pass a prior prob-
ability for every supported category.");
}
else if(priorProbability<0 || priorProbability>1) {

61
throw new IllegalArgumentException("Invalid priors Array: Prior probabilities should be be-
tween 0 and 1.");
}

knowledgeBase.logPriors.put(category, Math.log(priorProbability));
}
}

//We are performing laplace smoothing (also known as add-1). This requires to estimate the total fea-
ture occurrences in each category
Map<String, Double> featureOccurrencesInCategory = new HashMap<String, Double>();

Integer occurrences;
Double featureOccSum;
for(String category : knowledgeBase.logPriors.keySet()) {
featureOccSum = 0.0;
for(Map<String, Integer> categoryListOccurrences : featureStats.featureCategoryJointCount.val-
ues()) {
occurrences=categoryListOccurrences.get(category);
if(occurrences!=null) {
featureOccSum+=occurrences;
}
}
featureOccurrencesInCategory.put(category, featureOccSum);
}

//estimate log likelihoods


String feature;
Integer count;
Map<String, Integer> featureCategoryCounts;
double logLikelihood;
for(String category : knowledgeBase.logPriors.keySet()) {
for(Map.Entry<String, Map<String, Integer>> entry : featureStats.featureCategoryJointCount.en-
trySet()) {
feature = entry.getKey();
featureCategoryCounts = entry.getValue();

count = featureCategoryCounts.get(category);
if(count==null) {
count = 0;
}

logLikelihood = Math.log((count+1.0)/(featureOccurrencesInCategory.get(category)+knowl-
edgeBase.d));
if(knowledgeBase.logLikelihoods.containsKey(feature)==false) {
knowledgeBase.logLikelihoods.put(feature, new HashMap<String, Double>());
}
knowledgeBase.logLikelihoods.get(feature).put(category, logLikelihood);
}
}
featureOccurrencesInCategory=null;
}

/**
* Wrapper method of train() which enables the estimation of the prior
* probabilities based on the sample.
*
* trainingDataset

62
*/
public void train(Map<String, String[]> trainingDataset) {
train(trainingDataset, null);
}

/**
* Predicts the category of a text by using an already trained classifier
* and returns its category.
*
* @aashi text
* @return
* @throws IllegalArgumentException
*/
public double getMaxScore(){
return maxScore;
}
Double maxScore=Double.NEGATIVE_INFINITY;
public String predict(String text) throws IllegalArgumentException {
if(knowledgeBase == null) {
throw new IllegalArgumentException("Knowledge Bases missing: Make sure you train first a clas-
sifier before you use it.");
}
maxScore=Double.NEGATIVE_INFINITY;
//Tokenizes the text and creates a new document
Document doc = TextTokenizer.tokenize(text);
String category;
String feature;
Integer occurrences=0;
Double logprob;

String maxScoreCategory = null;

//Map<String, Double> predictionScores = new HashMap<>();


for(Map.Entry<String, Double> entry1 : knowledgeBase.logPriors.entrySet()) {
category = entry1.getKey();
logprob = entry1.getValue(); //intialize the scores with the priors

//foreach feature of the document


for(Map.Entry<String, Integer> entry2 : doc.tokens.entrySet()) {
feature = entry2.getKey();

if(!knowledgeBase.logLikelihoods.containsKey(feature)) {
continue; //if the feature does not exist in the knowledge base skip it
}

occurrences = entry2.getValue(); //get its occurrences in text


System.out.println(occurrences);
logprob += occurrences*knowledgeBase.logLikelihoods.get(feature).get(category); //multiply
loglikelihood score with occurrences
}
//predictionScores.put(category, logprob);

if((logprob>maxScore)) {

maxScore=logprob;
maxScoreCategory=category;
}

63
}

return maxScoreCategory; //return the category with heighest score


}
}

64