Anda di halaman 1dari 25

NATIONAL INSTITUTE OF TECHNOLOGY

K URUKSHETRA
MAJOR PROJECT
REPORT
ON

FINDING SEMANTIC RELATIONSHIP AMONG


ASSOCIATED MEDICAL TERMS

SUBMITTED TO:
Dr R.M Sharma

SUBMITTED BY:
Manisha Singh(111497)
Sneha Bairagi(111717)
Abhinav Rai(511004)

CONTENTS
1.
2.
3.
4.

5.
6.
7.
8.
9.

Introduction
Motivation
Problem Statement
Description
4.1 Steps involved
4.1.1 Tokenization
4.1.2 Stemming
4.1.3 POS Tagging
4.1.4 Annotating corpora and searching patterns
Java
JDBC
Conclusion
Future work
References

Acknowledgments
2

5
5
6
6
7
7
7
8
10
20
22
24
24
25

We express our profound gratitude and indebtedness to Prof. R.M. Sharma, Department of
Computer Science and Engineering, NIT Kurukshetra for supporting the present topic and for
their inspiring intellectual guidance, constructive criticism and valuable suggestion
throughout the project work.

Date - 4/05/2015
Kurukshetra

Manisha Singh
Sneha Bairagi
Abhinav Rai

ABSTRACT
3

The machine learning field has gained its thrust in almost any domain of research and just
recently has become a reliable tool in the medieval domain. The experimental domain of
automatic learning is used in tasks such as medical decision support, medieval imaging,
protein-protein interaction, extraction of medical knowledge, and for overall patient
management care. Machine Learning is envisaged as a tool by which compute-based systems
can be integratedin the healthcare field in order to get a better, well-organised medical care. It
describes a ML-based methodology for building an application that is capable of identifying
and disseminating healthcare information. It extracts sentences from published medical
papers that mention diseases and treatments, and identifies semantic relations that exist
between diseases and treatments. Our evaluation results for these tasks show that the
proposed methodology obtains reliable outcomes that could be integrated in an application to
be used in the medical care domain. The potential value of this paperstands in the ML settings
that we propose and in the fact that we outperform previous results on the same data set.

1. Introduction
4

Because of the enormous increase in the research in the medical domain, information
extraction tools become more and more important for practitioners of the medical domain.
Finding the relevant information in medical domain is still very problematic because most of
the data on the internet is poorly structured, amorphous, and unable to deal with problems
algorithmically. Most of the data is contained by the journal of medicines and biology which
makes this type of textual mining a central and core problem. In this project, we have focused
on Disease-Medicine co-occurrence relationship extraction from the text of the literature.. It
will be a very valuable contribution in the field of public health to auto-identification of
relationship from medicinal records between the disease and treatment to support the process
of diagnosis.
In this project we are presenting a methodology for extracting useful information from large
medical data. In this project we are applying some techniques of data mining to extract
treatment corresponding to a disease from huge corpus of data. The system tries to identify
the relationship of an active disease and extract relevant medicine for the patient. With the
growing number of medical thesis, research papers, research articles, researchers are faced
with the difficulty of reading a lot of research papers to gain knowledge in their field of
interest. So this system helps the user to extract disease-treatment relationship without
reading the whole document. From the extracted file treatment of the particular disease is
filtered and displayed to the user. Thus the user gets the required information alone which
saves his time and improves the quality of the result. This text mined document can be used
in medical health care domain where a doctor can analyse various kinds of treatment that can
be given to patient with particular medical disorder. The doctor can update the knowledge
related to particular disease or its treatment methodology. A large-scale and accurate list of
drug-disease treatment pairs derived from published biomedical literature can be used for
drug repurposing[1]. The extracted pairs themselves contain many interesting drug-disease
repurposing pairs with evidence from case studies or small-scale clinical studies. Second,
these pairs can be used in network-based systems approaches for drug repurposing. For
example, if drug 1 is similar to drug 2 and disease 1 can be treated by drug 1 then we can
hypothesize that disease 1 can also be treated by drug 2. Here drug-disease relationships will
be important to connect drugs to diseases.

2. Motivation
There is a huge volume of data growing on the internet in the form of research papers and
web documents. The amount of medical literature continues to grow and specialize. The
traditional healthcare system is also becoming one that hug the internet and electronic world.
Electronic Health Records (EHR) is becoming the standard in the healthcare domain.
Researches and studies show that the potential benefits of having an EHR system are:
Health information recording and clinical data repositories immediate access to patient
diagnoses, allergies, and lab test results that enable better and time-efficient medical
decisions;
Medication management rapid access to information regarding potential adverse drug
reactions, immunizations, supplies, etc. Decision support the ability to capture and use quality
medical data for decisions in the workflow of healthcare; and Obtain treatments that are
5

tailored to specific health needsrapid access to information that is focused on certain


topics. In order to embrace the views that the EHR system has, we need better, faster, and
more reliable access to information. All research discoveries come and enter the repository at
high rate, making the process of identifying and disseminating reliable information a very
difficult task.
A system can be effective if and only if it takes need of user into account. There
are two types of users who have 1)Interest & knowledge in medical field and 2) No
interest in medical field. Both groups face problems when it comes to retrieve information
about any disease. For people belonging to second group, this task is very tedious and time
consuming. As they dont have knowledge of medical terms , its hard to understand
medical documents and how to extract relevant data from irrelevant. People of group one
find it time consuming to extract relevant documents from irrelevant. They want a system to
provide them all the relevant information quickly and efficiently. This system solves
problems of people of both groups . It will also allow them to get access of recent data.

3. Problem Statement
Problem: Finding Semantic Relationship among associated medical terms using
pattern.Sematic relationship among trems basically refers to hidden meaning between the
terms like between drug and disease the hidden meaning is treatment. In this we are trying
to find out the treatments for the diseases by processing the relevant documents using
NLP(natural language processing) techniques which can be used by doctors to improve their
knowledge by knowing about latest treatments discovered and can also be used in drugrepurposing.

4. Description
In this project we are coming out with a system that will be used to identify various
medicines available for a particular disease. In this project input will be the disease name and
will extract the medicines available for the disease from the text documents available in
unstructured format. So basically we are processing the text documents to get the diseasetreatment pairs available in documents.
Proposed Algorithm:
Following is the used algorithm:
Input : Disease, Rules.
Output: Medicine, Semantic Relationship.
1. For any disease do
Extract paper form Medline.
2. Tokenize the document.
3. Remove all stopwords.
4. Perform stemming.
5. POS tagging is preformed to separate required part of speech.
6. convert this corpora to annotated corpora.
7. From annotated sentences
6

Extract sentence having atleast one medicine and one disease.


8. Pattern is searched between disease and medicine.
9. Medicines are associated and ranked based on frequency and superiority.
10. Semantic relationships are then presented to user.
4.1 Seps involved:
4.1.1 Tokenization
Tokenization is the process of breaking up the given text into units called token. The tokens
may be words or number or punctuation mark. Tokenization does this task by locating word
boundaries. Ending point of a word and beginning of the next word is called word
boundaries. Tokenization is also known as word segmentation.
Challenges in tokenization
Challenges in tokenization depends on the type of language. Languages such as English and
French are referred to as space-delimited as most of the words are separated from each other
by white spaces. Language such as Chinese and Thai are referred to as unsegmented as words
do not have clear boundaries. Tokenising unsegmented language sentences requires additional
lexical and morphological information(root words, affixes, parts of speech). Tokenization is
also affected by writing system and the typographical structure of the words. Structures of
languages can be grouped into three categories:
Isolating: words do not divide into smaller units. Example: Mandarin Chinese
Agglutinative: words divide into smaller units. Example: Japanese,Tamil
Inflectional: Boundaries between morphemes are not clear and ambiguous in terms of
grammatical meaning. Example:Latin.
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or
other meaningful elements called tokens. The list of tokens becomes input for further
processing such as parsing or text mining. In NLP we tokenize a large piece of text to
generate tokens which are smaller pieces of text (words, sentences, etc.) that are easier to
work with.
4.1.2 Stop Word Removal
In computing, stop words are words which are filtered out before or after processing of
natural language data (text). There is no single universal list of stop words used by
allprocessing of natural language tools, and indeed not all tools even use such a list. Before
text analysis a stop word list is developed for the removal of semantically insignificant
words, this lists vary in size. For our technique we have list of stop word including common
words, phrases and characters. Stopword contains the high frequency terms that are to be
ignored from the text as they are not giving any useful information for our scenario. The most
common stopwords in our case are 'a', 'the', 'of'' etc. Stop words are basically a set of
commonly used words in any language, not just English. The reason why stop words are
critical to many applications is that, if we remove the words that are very commonly used in a
given language, we can focus on the important words instead. For example in the context of
search engine, if your search query is how to develop information retrieval applications, If
7

the search engine tries to find web pages that contained the terms how, to, develop,
information, retrieval, applications the search engine is going to find a lot more pages
that contain the terms how , to than pages that contain information about developing
information retrieval applications because the terms how and to are so commonly used in
the English language. So, if we disregard these two terms, the search engine can actually
focus on retrieving pages that contain the keywords: develop information retrieval
applications which would more closely bring up pages that are really of interest. This is
just the basic intuition for using stop words. Stop words can by used in a whole range of tasks
and these are just a few:
Supervised machine learning removing stop words from the feature space
Clustering removing stop words prior to generating clusters
Information retrieval preventing stop words from being indexed
Text Summarization excluding stop words from contributing to summarization
scores and removing stop words when computing ROUGE scores.
Types of stop words: Stop words are generally thought to be a single set of words. It really
can mean different things to different application. For example, in some applications
removing all stop words right from determiners (e.g. the, a, an) to prepositions (e.g. above,
across, before) to some adjectives (e.g. good, nice) can be an appropriate stop word list. To
some applications however, this can be detrimental. For instance, in sentiment analysis
removing adjective terms such as good and nice as well as negations such as not can
throw algorithms off their tracks. In such cases, one can choose to use a minimal stop list
consisting of just determiners or determiners with prepositions or just coordinating
conjunctions depending on the needs of the application.
Examples of minimal stop word lists:
Determiners Determiners tend to mark nouns where a determiner usually will be followed
by a noun examples: the, a, an, another.
Coordinating Conjunctions Coordinating conjunctions connect words, phrases, and clauses.
Examples : form, an, nor, but, or, yet, so
Prepositions Prepositions express temporal or spatial relations. Examples : in, under,
towards, before.
4.1.3 Stemming:
Stemming is the term used in linguistic morphology and information retrieval to describe the
process for reducing inflected words to their word stem, base or root form-generally a written
word form. The stem needs not to be identical to the morphological root of the word; it is
usually sufficient that related words map to the same stem, even if this stem is not in itself a
valid root. Stemming is a pre-processing step in Text Mining applications as well as a very
common requirement of Natural Language processing functions. In fact it is very important in
most of the Information Retrieval systems. The main purpose of stemming is to reduce
different grammatical forms/word forms of a word like its noun, adjective, verb, adverb etc.
to its root form.[2] We can say that the goal of stemming is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.For example
'reader' and 'reading' are reduced to 'read' so that terms can lead to similarity detection.
8

Stemming does not seem to depend on the domain but depends on the language of text. But
our findings show that stemming effects to the semantics of term. It has been seen that most
of the times the morphological variants of words have similar semantic interpretations and
can be considered as equivalent for the purpose of IR applications. Since the meaning is same
but the word form is different it is necessary to identify each word form with its base form. In
stemming, conversion of morphological forms of a word to its stem is done assuming each
one is semantically related. There are mainly two errors in stemming over stemming and
under stemming. Over-stemming is when two words with different stems are stemmed to the
same root. This is also known as a false positive. Under-stemming is when two words that
should be stemmed to the same root are not. This is also known as a false negative. Paice has
proved that light-stemming reduces the over-stemming errors but increases the understemming errors. On the other hand, heavy stemmers reduce the under-stemming errors while
increasing the over-stemming errors.
Various Stemming algorithms available are:
Truncate(n): The most basic stemmer was the Truncate (n) stemmer which truncated a word
at the nth symbol i.e. keep n letters and remove the rest. In this method words shorter than n
are kept as it is. The chances of over stemming increases when the word length is small.
S-Stammer: An algorithm conflating singular and plural forms of English nouns. This
algorithm was proposed by Donna Harman. The algorithm has rules to remove suffixes in
plurals so as to convert them to the singular forms.
Lovins Stemmer: This was the first popular and effective stemmer proposed by Lovins in
1968. It performs a lookup on a table of 294 endings, 29 conditions and 35 transformation
rules, which have been arranged on a longest match principle [6]. The Lovins stemmer
removes the longest suffix from a word. Once the ending is removed, the word is recoded
using a different table that makes various adjustments to convert these stems into valid
words. It always removes a maximum of one suffix from a word, due to its nature as single
pass algorithm. The advantages of this algorithm is it is very fast and can handle removal of
double letters in words like getting being transformed to get and also handles many
irregular plurals like mouse and mice, index and indices etc.
Drawbacks of the Lovins approach are that it is time and data consuming. Furthermore, many
suffixes are not available in the table of endings. It is sometimes highly unreliable and
frequently fails to form words from the stems or to match the stems of like-meaning words.
The reason being the technical vocabulary being used by the author.
Porters Stemmer: Porters stemming algorithm is as of now one of the most popular
stemming methods proposed in 1980. Many modifications and enhancements have been done
and suggested on the basic algorithm. It is based on the idea that the suffixes in the English
language (approximately 1200) are mostly made up of a combination of smaller and simpler
suffixes. It has five steps, and within each step, rules are applied until one of them passes the
conditions. If a rule is accepted, the suffix is removed accordingly, and the next step is
performed. The resultant stem at the end of the fifth step is returned.
9

The rule looks like the following: <condition><suffix>-><new suffix>


For example, a rule (m>0) EED EE means if the word has at least one vowel and
consonant plus EED ending, change the ending to EE. So agreed becomes agree while
feed remains unchanged. This algorithm has about 60 rules and is very easy to
comprehend.
Paice/Husk Stemmer:The Paice/Husk stemmer is an iterative algorithm with one table
containing about 120 rules indexed by the last letter of a suffix [14]. On each iteration, it tries
to find an applicable rule by the last character of the word. Each rule specifies either a
deletion or replacement of an ending. If there is no such rule, it terminates. It also terminates
if a word starts with a vowel and there are only two letters left or if a word starts with a
consonant and there are only three characters left. Otherwise, the rule is applied and the
process repeats. The advantage is its simple form and every iteration taking care of both
deletion and replacement as per the rule applied. The disadvantage is it is a very heavy
algorithm and over stemming may occur.
Dawson Stemmer:This stemmer is an extension of the Lovins approach except that it covers
a much more comprehensive list of about 1200 suffixes. Like Lovins it too is a single pass
stemmer and hence is pretty fast. The suffixes are stored in the reversed order indexed by
their length and last letter. In fact they are organized as a set of branched character trees for
rapid access. The advantage is that it covers more suffixes than Lovins and is fast in
execution. The disadvantage is it is very complex and lacks a standard reusable
implementation.
4.1.4 POS Tagging:
part-of-speech tagging (POS tagging or POST), also called grammatical tagging or wordcategory disambiguation, is the process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, based on both its definition, as well as its context
i.e. relationship with adjacent and related words in aphrase, sentence, or paragraph.
The process of assigning a part-of-speech to each word in a sentence.

10

Eg: <s>Come September, and the UJF campus is abuzz with new and returning students.</s>
After POS tagging sentence will be:
<s>Come_VB September_NNP ,_, and_CC the_DT UJF_NNP campus_NN abuzz_JJ
with_IN new_JJ and_CC returning_VGB students_NNS ._.</s>
These labels actually came from PENN TAGSET and this PENN TAGSET came from
university of Pennsylvania which is a famous place for natural language processing work.
Foundation is based on noisy channel model.
Noisy Channel
(wn,wn-1,..............,w1)

(tm,tm-1,...............t1)

Sequence w is transformed into sequence t.


w*|t*=argmaxP(w/t)
w

Guess at the correct


Sequence

correct
Noisy transformation
sequence

Here noisy channel is a metaphor for a computation where input is coming and suspected to
noise at every stage of processing and an output is generated. On the input side we have word
sequence and on the output side we have the tag sequence.
Argmax Computation:
Let y=f(x) be a function.Then
y*=max(y) for all x.
Compare max with argmax: x*=argmax(f(x)) for all x.

Bayesian Decision Theory


Given the random variables A and B then P(A/B)=P(A)*P(B/A)/P(B).
P(A/B)=posterior probability
P(A)=prior probability
P(B/A)=likelihood
Assumption: Choose that value as the decision whose probability is highest
A*=argmax(P(A|B))
A
=argmax(P(A).P(B|A))
A
Computing and using P(A) and P(B|A), both need
11

Looking at the internal structures of A and B


Making independence assumptions
Putting together a computation from smaller parts

Best tag sequence t*, t*=argmax(P(t|w))


t
After applying Bayes theorem
=argmax(P(t)*P(w|t))
t
Here P(w) can be ignored because it is going to remain for all t.
P(t) is prior probability of tag sequence t. It also acts as a filter for bad tags.
Some of the POS Tages are:
NN Noun e.g Dog_NN
VM- Main Verb e.g.RUN_VM
VAUX Auxiliary verb e.g. IS_VAUX
JJ Adjective e.g. Red_JJ
PRP Pronoun e.g. You_PRP
NNP Proger Noun e.g. John_NNP
CC Coordinating conjuction e.g. jack and_CC jill
CD Cardinal number e.g. four_CD children
MD Modal e.g. you may_MD go etc.
POS tag ambiguity
I bank1 on the bank2 on the river bank3 for my transactions.
Here bank1 is verb, the other two banks are noun
Process of POS Tagging
List all possible tag for each word in a sentence.
Choose best possible tag sequence.
Example:
Sentence: people jump high
People : noun/verb
Jump : noun/verb
High : noun/adjective
People : noun/verb
People are the assests of a country. Here people is noun.
The place was peopled with the members of the tribes. Here people is used as verb.
Jump : noun/verb
I jumped over the fence. Here jump is used as a verb.
This was a good jump. Here jump is a noun.
High : noun/adjective
High hills. Here high is used as a adjective.
After the win , he was on a high. Here high is used as a noun.
12

^ people jump high .

Each tag here is considered as a state and ^(hat) is considered as the starting state and .(dot) is
considered as the ending state. If there are N words in a sentence then we get a tag sequence
having N+2 states because here hat and dot states are also considered. All the possible tag
sequences are tried and then the tag sequence with the maximum probability is assigned to
word sequence. So finding the tag sequence has been reduced to graph traversal.
Best tag sequence
= T*
=argmaxP(T|W)
=argmaxP(T)P(W|T)
P(T)=P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0)....................P(tn|tn-1tn-2.........t0)P(tn+1|tntn-1..........t0)
= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1)....................P(tn|tn-1tn-2)P(tn+1|tntn-1)
(Trigram Assumption)
P(tn|tn-1tn-2)=no of times (tntn-1tn-2) sequence occurs divided by number of times (t n-1tn-2)
sequence occurs.
P(W|T)=P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0 t0-tn+1)...P(wn|w0-wn-1 t0-tn+1)P(wn+1|w0-wn t0-tn+1)
Assumption: a word is completely determined by its tag. This is inspired by speech
recognition.
P(W|T)=P(w0|t0)P(w1|t1)P(w2| t2)...P(wn|tn)P(wn+1|tn+1) (Lexical probability)
Example of calculation from actual data.
Corpus
Let the data be ^ Ram got many NLP books. He found them all very interesting.
POS Tagged
^N V A N N . N V N A R A.
Recording numbers(Bigram assumption)

13

^
0
0
0
0
0
1

^
N
V
A
R
.

N
2
1
1
1
0
0

V
0
2
0
0
0
0

A
0
1
1
0
1
0

R
0
0
0
1
0
0

.
0
1
0
1
0
0

Probabilities
^
N
V
A
R
.
^
0
1
0
0
0
0
N
0
1/5
2/5
1/5
0
1/5
V
0

0
0
A
0
1/3
0
0
1/3
1/3
R
0
0
0
1
0
0
.
1
0
0
0
0
0
P(ram|N)=P(wi=ram|ti=noun)=no of times ram occurs as noun/total number of nouns.
Lexical probabilities
Ram Got Man
y
^
N
V
A
R
.

NLP

Book
s

He

Foun
d

The
m

All

Very

Interesting

HIDDEN MARKOV MODEL


A very powerful tool for problem solving in statistical AI is markov assumption and bayes
theorm.Let us consider a problem having three urn and all of them having red, blue and green
balls.Balls are drawn randomly from urns and this sequence is termed as the observation
sequence.We have to find out the state sequence i.e the urn sequence.

Red balls:30

Red balls:10
14

Red balls:60

Green balls:50
Blue balls:20

Green balls:40
Blue balls:50

Green balls:10
Blue balls:30

Probability of transition to another Urn after picking a ball:


U1
U2
U1
0.1
0.4
U2
0.6
0.2
U3
0.3
0.4

U3
0.5
0.2
0.3

Probability of drawing ball from urn


R
U1
0.3
U2
0.1
U3
0.6

B
0.2
0.5
0.3

G
0.5
0.4
0.1

Let the observation sequence is RRGGBRGR. We have to find out the state sequence. Many
problems in AI fall into this class predict hidden from observed.
Diagrammatic representation

Observations and states


Here states S0 and S9 as initial and final states respectively. After S 8 the next state is S9 with
probability 1, i.e, , P(S9|S8)=1. O0 is a E-transition.
O0 O1 O2 O3 O4 O5 O6 O7 O8
OBS: E R R G G B R G R
State: S0 S1 S2 S3 S4 S5 S6 S7 S8 S9
15

Si = U1/U2/U3.A particular state


S: state sequence
O:observation sequence
S*= best possible state(urn) sequence
Goal : maximize P(S*|O) by choosing best S. Maximize P(S|O) where S is the state
sequence and O is the observation sequence.
S*= argmax(P(S|O))
S
S*=argmax(P(S)*P(O|S))
S
P(S)=P(S0-9)
P(S)=P(S0)P(S1|S0)P(S2|S0-1)P(S3|S0-2)P(S4|S0-3)............P(S9|S0-9)
By Markov Assumption
P(S)=P(S0)P(S1|S0)P(S2|S1)P(S3|S2)P(S4|S3)............P(S9|S8)
P(O|S)=P(O0|S0-9)P(O1|O0,S1-8)P(O2|O0-1,S0-9)P(O3|O0-2,S0-9)..............P(O8|O0-7,S0-9)
Assumption that ball drawn depends only on the urn choosen.
P(O|S)=P(O0|S0)P(O1|S1)P(O2|S2)P(O3|S3)............P(O8|S8)
P(S|O)=P(S)P(O|S)
P(S|O)=P(S0)P(S1|S0)P(S2|S1)P(S3|S2).......P(S8|S7)P(S9|S8)P(O0|S0)P(O1|S1)P(O2|S2)P(O3|S3)....
..........P(O8|S8)
P(S).P(O|S)=[P(O0|S0) P(S1|S0)].[ P(O1|S1) P(S2|S1)].[ P(O2|S3) P(S3|S2)].................[ P(O8|S8)
P(S9|S8)]
After S8 the next state is S9 with probability 1, i.e.,P(S9|S8)=1
Here we are having 3 urns and 8 observations . so number of times these above computations
to be done is 38 i.e |number of states|length of observation sequence.
So to improve these computations , Viterbi algorithm is used.
The medical terms from above stage is provided to POS tagger to correctly elaborate all
syntactic categories such as noun, verb, adjective, pronoun that can be used to identify part of
speech. For purpose of our task we are using POS Tagger which is based on Markov
Assumption[3] and uses Viterbi algorithm[4]. From all the text we consider only four part of
speech i.e. Noun, Pro-Noun, Verb and Adjective. Noun are used because each entity of our
domain is treated as Noun by POS e.g. dengue, malaria etc and Pro-Noun are used because in
most of the paragraph a term initially starts with the Name entity called Noun by our POS
and in the remainder portion of paragraph terms occur as pro-noun. Each pronoun is tested by
moving in backward direction to access the pointed Noun. Verb shows the link of relationship
among the nouns and adjectives are used to show the strength of relation e.g. severe, low,
high.
In POS tagging :
Best t*=argmax(P(t/s)) =argmax(P(t)*P(s/t)) where t is the tag sequence and s is the sequence
which is to be tagged.
16

If Hidden Markov Model is used to compute the tags , then complexity is going to be more
but his trigram assumption is taken into consideration i.e the current tag depends upon the
previous two tags and Viterbi algorithm is applied to perform tagging.
Viterbi Algorithm
Given:
The HMM which means:
a) Start state: s1
b) Alphabet A={a1 a2 ..... an}
c) Set of states S={s1s2....sn}
d) Transition probability P(si to sj via ak) for all i,j,k which is equal to P(sjak/si).
Find the output string a1a2.....at.
To find: the most likely sequence of states c1c2....ct which produces the given output
sequence i.e c1c2c3....ct=argmax(P(c/a1a2....at));
Data structures:
a) A N*T array called SEQSCORE to maintain the winner sequence always(N=#states ,
T=length of output sequence).
b) Another N*T array whose BACKPTR to recover the path.
c)
Three distinct steps in Viterbi implementation
a) Initilization
b) Iteration
c) Sequence identification
Initilization:
SEQSCORE(1,1)=1.0
BACKPTR(1,1)=0.0
For(i=2 to N) do
SEQSCORE(i,1)=0.0
Iteration:
For(t=2 to T) do
For(i=2 to N) do
SEQSCORE(i,t)=max(j=1,N)
BACKPTR(i,t)=index j that gives the max above.
Sequence Identification
C(T)=i that maximizes SEQSCORE(i,T)
For i from (T-1) to 1 do
C(i)=BACKPTR[C(i+1),i+1].
Understand Viterbi algorithm with a example

17

Developing the tree:

Tag sequence is a1a2a1a2. We have to find the state sequence.


Probability table
E
A1
A2
S1
1.0
0.1
0.09
18

A1
0.012

A2
0.0081

S2

0.0

0.3

0.06

0.027

0.0054

BACKPTR table
E
A1
A2
A1
A2
S1
0
1
2
2
2
S2
1
2
1
2
By using the BACKPTR table state sequence is obtained. Best state sequence obtained is
S1S2S1S2S1.This has reduced the complexity to a greater extent.
4.1.4 Annotating Corpora and Searching patterns:
In this corpora is annoted as disease or medical terms. Sentences are tagged with disease
entities from the clean disease lexicon and drug entities from the drug list. The tagging was
based on case-insensitive exact string matching for high precision and efficiency. Then
pattern is searched between disease and drug. Pattern could be drug pattern disease if the drug
entity precedes the disease entity or disease pattern drug if disease precedes the drug.The
patterns that we are using for drug pattern disease are: in, in the treatment of, for, in patients
with, for the treatment of, treatment of, therapy for, therapy in, for treatment of, against, in
the management of, therapy of, treatment for, treatment in, in a patient with, in treatment of,
in children with, to cure, is used to cure , is used for curing , in the management of , is used to
manage, reduces, in the treatment of patients, prevents, is used to prevent, to prevent, for the
management of, to treat, can be used to control symptoms of , can be used as medication for,
can be used to improve symptoms, can be used as a antibiotic for, can be used to relieve sign
symptoms, can be used to relieve symptoms, can be used to reduce symptoms, can be
effective for, may be effective in the treatment of , can be used to prevent and the patterns
that are used for disease pattern drug are can be treated with , interventions to control
disease are, symptoms can be improved with, symptoms can be controlled with, symptoms
can be improved with, antibiotics for the disease are, antibiotics that can be used are, your
doctor may recommend, symptoms can be reduced with, can be prevented with.
How to check the quality of tagging
Three parameters:
Precision P=|A ^ O|/|O|

It measures out of those obtained which proportion is correct.


Recall R=|A ^ O|/|A|

It measures out of those correct how many are actually got.


F-score=2PR/(P+R)

Harmonic mean
If every word is given a tag and no word is left out. So sizeof(A)=sizeof(O)
Therefore, presision=recall=F-score.
5. JAVA

19

Java is a programming language and a platform. Java is a high level, robust, secured and
object-oriented programming language. Any hardware or software environment in which a
program runs, is known as a platform. Since Java has its own runtime environment (JRE) and
API, it is called platform.
A simple java example:
class Simple
{
public static void main(String args[])
{
System.out.println("Hello Java");
}
}
According to Sun, 3 billion devices run java. There are many devices where java is currently
used. Some of them are as follows:

Desktop Applications such as acrobat reader, media player, antivirus etc.

Web Applications such as irctc.co.in, javatpoint.com etc.

Enterprise Applications such as banking applications.

Mobile

Embedded System

Smart Card

Robotics

Games etc.

Features of JAVA

Simple : Java is a simple language because syntax is based on C++(so easier for
programmers to learn after C++). It has removed many confusing and/or rarely-used
features e.g., explicit pointers ,operator overloading etc. There is no need to remove
unreferenced objects because there is automatic garbage collection in java.

Object-Oriented : Object-oriented means we organize our software as a combination


of different types of objects that incorporates both data and behaviour. Object-oriented
programming(OOPs) is a methodology that simplify software development and
maintenance by providing some rules. Basic concepts of OOPs are: Object, Class,
Inheritance, Polymorphism, Abstraction, Encapsulation.

20

Platform independent : A platform is the hardware or software environment in which a


program runs. There are two types of platforms software-based and hardware-based.
Java provides software-based platform. The Java platform differs from most other
platforms in the sense that it's a software-based platform that runs on top of other
hardware-based platforms.It has two components: runtime environment and
API(application programming interface). Java code can be run on multiple platforms
e.g.Windows,Linux,Sun Solaris,Mac/OS etc. Java code is compiled by the compiler
and converted into bytecode.This bytecode is a platform independent code because it
can be run on multiple platforms i.e. Write Once and Run Anywhere(WORA).

Secured : Java is secured because it has no explicit pointer and programs run inside
virtual machine sandbox.

Robust : Robust simply means strong. Java uses strong memory management. There
are lack of pointers that avoids security problem. There is automatic garbage
collection in java. There is exception handling and type checking mechanism in java.
All these points makes java robust.

Architecture neutral : There is no implementation dependent features e.g. size of


primitive types is set.

Portable : java bytecode can be carried anywhere.

High Performance : Java is faster than traditional interpretation since byte code is
"close" to native code still somewhat slower than a compiled language (e.g., C++)

Multithreaded : A thread is like a separate program, executing concurrently. We can


write Java programs that deal with many tasks at once by defining multiple threads.

21

The main advantage of multi-threading is that it shares the same memory. Threads are
important for multi-media, Web applications etc.

Distributed : applications can also be distributed in java. RMI and EJB are used for
creating distributed applications. We may access files by calling the methods from any
machine on the internet.

6. JDBC
Java JDBC is a java API to connect and execute query with the database. JDBC API uses jdbc
drivers to connect with the database. Before JDBC, ODBC API was the database API to
connect and execute query with the database. But, ODBC API uses ODBC driver which is
written in C language (i.e. platform dependent and unsecured). That is why Java has defined
its own API (JDBC API) that uses JDBC drivers (written in Java language). API (Application
programming interface) is a document that contains description of all the features of a
product or software. It represents classes and interfaces that software programs can follow to
communicate with each other. An API can be created for applications, libraries, operating
systems, etc.

22

7 Steps to connect to database


There are 5 steps to connect any java application with the database in java using JDBC. They
are as follows:

Register the driver class

Creating connection

Creating statement

Executing queries

Closing connection

Register the driver class: The forName() method of Class class is used to register the driver
class. This method is used to dynamically load the driver class.
Syntax of forName() method
public static void forName(String className)throws ClassNotFoundException
Create the Connection object: The getConnection() method of DriverManager class is used to
establish connection with the database.
Syntax of getConnection method
public static Connection getConnection(String url)throws SQLException
public static Connection getConnection(String url,String name,String password)
throws SQLException
Create the statement object: The createStatement() method of Connection interface is used to
create statement. The object of statement is responsible to execute queries with the database.
Syntax of createStatement method
public Statement createStatement()throws SQLException
Exeute the query: The executeQuery() method of Statement interface is used to execute
queries to the database. This method returns the object of ResultSet that can be used to get all
the records of a table.
Syntax of executequery() method
public ResultSet executeQuery(String sql)throws SQLException
Close the connection object: By closing connection object statement and ResultSet will be
closed automatically. The close() method of Connection interface is used to close the
connection.
Syntax of close method
public void close()throws SQLException
Connectivity with access with DSN
23

Connectivity with type1 driver is not considered good. To connect java application with type1
driver, create DSN first, here dsn name is mydsn.
import java.sql.*;
class Test
{
public static void main(String ar[])
{
Try
{
String url="jdbc:odbc:mydsn";
Class.forName("sun.jdbc.odbc.JdbcOdbcDriver");
Connection c=DriverManager.getConnection(url);
Statement st=c.createStatement();
ResultSet rs=st.executeQuery("select * from login");
while(rs.next())
{
System.out.println(rs.getString(1));
}
}
catch(Exception ee){System.out.println(ee);}
}
}

7. Conclusion
We tackle in this project is a task that has applications in information retrieval, information
extraction, and text summarization. We identify potential improvements in results when more
information is brought in the representation technique for the task of classifying short
medical texts. Experimental result shows that the technique used in the proposed work
minimizes the time and the work load of the doctors in analyzing information about certain
disease and treatment in order to make decision about patient monitoring and treatment. This
system
helps users especially doctors in saving their time and they can know easily about a disease
its treatment and symptoms and can analyses more about a various treatments associated with
a particular disease. This text mined document can be used in medical health care domain
where a doctor can analyse various kinds of treatment that can be given to patient with
particular medical disorder. The doctor can update the knowledge related to particular disease
or its treatment methodology or the details of medicine that are in research for a particular
disease. The doctor can gain idea about particular medicine that are effective for some patient
but causes side effect to patient with some additional medical disorder. The patient can also
use this extracted document to get clear understanding about a particular disease its
symptoms, side effects, its medicines, its treatment methodologies.
24

8. Future Scope
A wide future scope exists for this project. We can make this project more user-friendly by
allowing user to also extract information regarding cure, symptoms and prevention of
disease. It involves expanding the project to finding the root cause of the disease and then by
taking the patient history or condition and providing him the dose accordingly. The future
idea is based on viewing the composition of medicine and after applying it on patient report
identifying that is it be suiting him.

9. References
[1] Rong Xu and QuanQiu Wang Large- scale extraction of accurate drug-disease treatment
pairs from biomedical literature for drug repurposing, Issue 2013.
[2] Fadi Yamout, Further Enhancement to the Porters Stemming Algorithm, Issue 2006.
[3] Ray S and Craven M,Representing sentence structure in Hidden Markov Models for
information extraction, Proceedings of IJCAI-2001.
[4] M. S. Ryan and G. R. Nudd., The Viterbi Algorithm, Department of Computer Science,
University of Warwick, Coventry,England,Issue 1993.
[5] Jesse Davis jdavis Mark Goadrich, The Relationship Between Precision-Recall and ROC
Curves, Department of Computer Sciences and Department of Biostatistics and Medical
Informatics, University of Wisconsin-Madison,USA.

25