Anda di halaman 1dari 79

Intelligent Q&A System

Beyond

124069T K.M.K. Hasantha


124181E P.N. Udawatta
124003M I.A. Abeysekera

Faculty of Information Technology

University of Moratuwa

June 2017
Intelligent Q&A System

Beyond

124069T K.M.K. Hasantha


124181E P.N. Udawatta
124003M I.A. Abeysekera

Dissertation submitted to the Faculty of Information Technology, University of Moratuwa,


Sri Lanka for the partial fulfillment of the requirements of the Honours Degree of Bachelor of
Science in Information Technology.

June 2017

1
Declaration

We declare that this thesis is our own work and has not been submitted in any form for
another degree or diploma at any university or other institution of tertiary education.
Information derived from the published or unpublished work of others has been
acknowledged in the text and a list of references is given.

Name of Student (s) Signature of Student (s)

K.M.K. Hasantha

P.N. Udawatta

I.A. Abeysekera

Date:

Supervised by

Name of Supervisor(s) Signature of Supervisor(s)

Dr. (Ms.) A.T.P. Silva

Date:

i
We dedicate this work to
our Parents

ii
Acknowledgement

I wish to express my sincere gratitude to my supervisor Dr. (Ms.) A.T.P. Silva for the
excellent supervision, guidance, support, encouragement and patience he has given in making
this a successful research.

It also gives us great pleasure in acknowledging the generous support of all academic staff at
IT Faculty of University of Moratuwa and to all my external lecturers for sharing their
knowledge and providing all kinds of supportive guidance which immensely contributed in
making our project successful.

We would like to express our appreciation and gratitude to all our friends and fellow batch
mates for their support and help given throughout this project.

Last but not least we would like to offer our deepest gratitude to our loving parents for their
continuing love and support extended to us throughout this project. Not forgetting our
supportive brothers and sisters and helpful cousins and all our relatives for being there for us
to make this project a success.

iii
Abstract
With exponentially growing human knowledge it is impossible for a person learn or memorize
all the knowledge even within a very limited field of study. Knowledge on demand is the next
big requirement any profession. Intention of this project is to come up with a computer system
that can understand the meaning and intention of text content and provide answers to user
questions. Our proposed solution is an ontology based intelligent question and answering
system that relies on collection information via text mining process and mapping them to the
ontology. Then a question classifier is used to classify questions into a predefined class map
which becomes the input to the answer generation module that generated SPARQL query to
retrieve the expected answer.

iv
Table of Contents
1. Table of Figures ................................................................................................................. x

1. Introduction ........................................................................................................................ 1

1.1. Background .............................................................................................................. 1

1.2. Aim .......................................................................................................................... 2

1. Objectives ................................................................................................................ 2

2. Literature Review............................................................................................................... 3

2.1. Text mining approaches and ontology learning related approaches ........................ 3

2.1.1. Info Sleuth Research Project ................................................................................ 3

2.1.2. TEXT-TO-ONTO Ontology ................................................................................ 4

2.1.3. Ontology Learning (AIFB) .................................................................................. 4

2.1.4. Hasti project ......................................................................................................... 5

2.2. Question classification and machine learning related approaches .......................... 7

2.2.1. Question Classification comparison Research ..................................................... 7

2.2.2. Multi-class classification method ...................................................................... 10

2.2.3. Conclusion ......................................................................................................... 11

2.3. Natural Language to SPARQL Generation Approach .......................................... 12

2.3.1. QASYO[8] ......................................................................................................... 12

2.3.2. Natural Language Query Interpretation into SPARQL Using Patterns [9] ....... 13

2.3.3. AutoSPARQL[ 10]............................................................................................. 15

3. Technology Adapted ........................................................................................................ 17

3.1. NLTK..................................................................................................................... 17

3.2. WordNet ................................................................................................................ 17

3.3. Sciket-learn ............................................................................................................ 17

3.4. SPARQL Wrapper ................................................................................................. 18

3.5. Apache Jena Fuseki ............................................................................................... 18

3.6. SPARQL ................................................................................................................ 18

v
3.7. OwlReady .............................................................................................................. 18

3.8. Stanford Named Entity Recognizer ....................................................................... 18

3.9. Random Forest Classifier ...................................................................................... 18

3.10. Pandas: data analysis toolkit .............................................................................. 18

4. Our Approach................................................................................................................... 19

4.1. Ontology generation through text mining ............................................................. 19

4.2. Question classification ........................................................................................... 21

4.2.1. Introduction ........................................................................................................ 21

5.1. Answer Generation Module................................................................................... 22

6. Analysis and Design ........................................................................................................ 23

6.1. High Level Architecture ........................................................................................ 23

6.2. Ontology learning and knowledge gathering......................................................... 23

6.2.1. Web Spider - Crawler ........................................................................................ 25

6.2.2. Tokenizer ........................................................................................................... 25

6.2.3. Named Entity Recognition Tagger .................................................................... 25

6.2.4. POS Tagger ........................................................................................................ 25

6.2.5. Word Analyzer ................................................................................................... 26

6.2.6. Relationship Builder .......................................................................................... 26

6.2.6.1. Relationship Categorization - Hand Built Pattern .......................................... 26

6.2.6.2. Relationship Categorization Verbs Relations ............................................. 27

6.2.6.3. Relationship Classification using bag of words ............................................. 27

6.2.6.3.1. TF-IDF ........................................................................................................ 27

6.2.6.3.2. Bag of Words .............................................................................................. 28

6.2.6.3.3. Random Forest Classification Model ......................................................... 28

6.2.7. OWL - Ontology Creator ................................................................................... 29

6.2.8. Ontology Manager ............................................................................................. 29

6.2.9. OWL2RDF convertor ........................................................................................ 29

vi
6.2.10. RDF- Ontology............................................................................................... 29

6.2.11. Apache Fuseki Endpoint ................................................................................ 30

6.3. Question Classification .......................................................................................... 30

6.3.1. Question Preprocessing ...................................................................................... 32

6.3.2. Process of disease word identification and replacement.................................... 32

6.3.3. N-grams.............................................................................................................. 33

6.3.4. Veterization ........................................................................................................ 33

6.3.5. Vector space model ............................................................................................ 34

6.3.6. Classification...................................................................................................... 35

6.3.7. Logistic regression ............................................................................................. 35

6.3.8. Gaussian Naive Bayes........................................................................................ 35

6.3.9. Support Vector Machine .................................................................................... 36

6.4. Answer Generation Module................................................................................... 36

6.4.1. Name Entity Recognition ................................................................................... 37

6.4.2. Medical Term Recognition ................................................................................ 37

6.4.3. SPARQL Query Generation .............................................................................. 39

6.4.4. Question Clarification Module .......................................................................... 39

7. Implementation ................................................................................................................ 41

7.1. Knowledge Extraction to Ontology automation .................................................... 41

7.1.1. Named Entity Recognition (NER) Model.......................................................... 41

7.1.2. Supervised Relationship Type Classification Model ......................................... 42

7.1.2.1. Preprocessing of documents ........................................................................... 43

7.1.2.2. Setting up the bag of features ......................................................................... 43

7.1.3. Hand built patterns to generate relations ........................................................... 44

7.1.4. Semantic Templates ........................................................................................... 45

7.1.5. OWL2RDF converter......................................................................................... 46

7.1.6. Ontology Endpoint ............................................................................................. 46

vii
7.2. Question Classification Module ............................................................................ 46

7.2.1. Preprocessor ....................................................................................................... 47

7.2.2. Advantage of having a disease word replacer.................................................... 48

7.2.3. Vectorization Process......................................................................................... 49

7.2.4. Classification process......................................................................................... 49

7.2.5. Accuracy calculation .......................................................................................... 50

7.3. Answer Generation Module................................................................................... 50

7.3.1. Name Entity Recognition ................................................................................... 50

7.3.2. Medical Term Recognition ................................................................................ 51

7.3.2.1. Pseudo code for searching by term: ............................................................... 51

7.3.2.2. Pseudo code for searching by CUI: ................................................................ 51

7.3.3. SPARQL Query Generation .............................................................................. 52

7.3.4. Question Clarification Module .......................................................................... 52

8. Evaluation ........................................................................................................................ 53

8.1. Ontology Automation Modules Evaluation ......................................................... 53

8.1.1. Data Retrieval .................................................................................................... 53

8.1.2. Information Conceptualization .......................................................................... 53

8.1.3. Relationship Identification ................................................................................. 54

8.1.4. Ontology Automation ........................................................................................ 54

8.2. Question classification Evaluation Model ............................................................. 55

8.2.1. Following test were performed. ......................................................................... 55

8.2.2. For the Training Set ........................................................................................... 55

8.2.3. For the Testing dataset ....................................................................................... 55

8.3. Answer Generation Module................................................................................... 56

9. Conclusion and Further Work .......................................................................................... 58

References ................................................................................................................................ 60

Appendix A .............................................................................................................................. 62

viii
Appendix B .............................................................................................................................. 65

ix
1. Table of Figures
Figure 2:1 Architecture of the ontology learning environment [2]............................................ 4
Figure 2:2 Hasti project architecture [4] .................................................................................... 6
Figure 2:3 the Structure of a Halex Entry [4] ............................................................................ 7
Figure 2:4 A ............................................................................................................................... 9
Figure 2:5 B ............................................................................................................................... 9
Figure 2:6 Table P [6] .............................................................................................................. 10
Figure 2:7 Table Q [7] ............................................................................................................. 10
Figure 2:8 Table R ................................................................................................................... 11
Figure 2:9 Components and search for synonyms using WordNet ......................................... 12
Figure 2:10 generic query pattern used in this approach ......................................................... 14
Figure 4:1 Answer Generation Module ................................................................................... 22
Figure 6:1 high level architecture ............................................................................................ 23
Figure 6:2 Text to Ontology Design ........................................................................................ 24
Figure 6:3 Question Classification Flow ................................................................................. 30
Figure 6:4 Question Pre-processing model .............................................................................. 32
Figure 6:5 Process of disease word identification and replacement ........................................ 33
Figure 6:6 Answer Generation Module High Level Arcitecture ............................................ 36
Figure 7:1 NER model operation ............................................................................................. 41
Figure 7:2 Relationship Construction Mechanism .................................................................. 45
Figure 7:3 Key steps of question classification module .......................................................... 46
Figure 7:4 Question conversion to preprocessed question ...................................................... 47

Table 5:1 Syntactic patterns for relationship extraction .......................................................... 26


Table 5:2 Question Classification Samples ............................................................................. 31
Table 6:1 Property features ...................................................................................................... 42
Table 7:1 NER iterative classification performances .............................................................. 53
Table 7:2 Accuracy Comparison between levels of Learning ................................................. 54
Table 7:3 Zhou et al. results [13] ............................................................................................. 54
Table 7:4 With disease word replacement module .................................................................. 55
Table 7:5 Without disease word replacement module ............................................................. 55
Table 7:6 With disease word replacement module .................................................................. 55
Table 7:7 Without disease word replacement module ............................................................. 56

x
Table 7:8 Ability of providing correct answer ......................................................................... 57

xi
Chapter 1
1. Introduction
Main goal of the intelligent Q & A system is to provide accurate answers to natural language
questions using a given context. System consist to 3 main components. Ontology automation
module, question classification module & answer generation module. System currently focuses
on answering questions asked about the medical domain. An ontology is created on the medical
domain and populated with information about disease and their symptoms causes, risks &
prevention methods using unstructured data extraction methods. This system will able to
identify meaningful links between various parts of text and able to understand how words have
been interrelated to make meaningful sentences. Further system will able to identify the hidden
content of sentences by retrieving information from relationships and analyzing dependencies
between words and clauses. Since questions can take various forms, a question classification
module is used to identify the question type and classify them into predefined classes. Answer
generation module takes the question and the question class from the classifier and extract the
key information from the question. Then the collected information is use them to build
SPARQL queries to traverse the ontology to generate meaningful answers. Accuracy of the
system depends on the accuracy of each module. Since every module's output becomes an input
to the other module, error occurred in one module can easily proceed to another module. Thus
accuracy of each module is individually measured and improved.

1.1. Background
It is said that until 1900 human knowledge double approximately every century. By the end of
World War II knowledge was doubling every 25 years [1]. Current numbers shows that all of
human knowledge is doubling about every 13 months. With exponentially growing human
knowledge it has become increasingly difficult for a one person to become an expert particular
field of study or profession. We are at a point where traditional approaches such as memorizing
information and summarizing are not enough to function efficiently in any profession. Humans
are no longer efficient enough to process all the information available and generate valuable
knowledge out of it. This is a bottleneck that limits human creativity [1]. Requirement arises
to automate the process of providing accurate knowledge on demand. Since majority of human
knowledge is in written from, a system that understands the complexity of human language
well enough to recognize the meaning and intention of written knowledge so that it can abstract

1
that information & knowledge to address human demands can easily be the next revolutionary
step of mankind.

1.2. Aim
The aim of this project is to develop an intelligent agent which is capable of generating accurate
answer for a given question by extracting details from given knowledge base. System must be
able to answer all sorts of question which can range from direct questions to questions require
a deep understanding. In addition system will be able to improve continuously itself during
both training process and continues usage.

1. Objectives
Building an ontology model to represent the knowledge of the text.
Analyzing distribution of semantics using Latent semantic analysis [2].
Mapping the text (given inputs-question and document) in natural language into useful
representations.
Improve system knowledge base through a continuous training process
Extract meaning from text using statistics and machine learning
Generate accurate answer using Natural Language Generation (text planning, sentence
planning, text realization)
Develop interface to get user inputs (questions) and display the output.

2
Chapter 2
2. Literature Review
2.1. Text mining approaches and ontology learning related approaches
In this section we have considered that techniques and methodologies used by several
researches towards information retrieval and extraction from documents. Text mining generally
consists of set of natural language processing steps before information is actually extracted.

2.1.1. Info Sleuth Research Project


According to Info Sleuth research project at MCC (Microelectronics and Computer
Technology
Corporation) a study they have used active information gathering concept. This was mainly
based on agent based system which constructed to carry out several distinct information
management activities in a variety of environments. In their approach they have used several
text mining techniques and other real time data gathering approaches. They have used
Thesaurus (GEMET) [1] to create a standardized vocabulary and enable multi-lingual
translation of the terms of queries and results. Using web crawlers and document classifiers
offline resource agents gather information which includes JDBC, text, flat files, and images as
well. Here we interested only on text component. Source data is extracted POS tagged
automatically which retrieved from internet free-docs. To carry out this methodology human
experts provides seedwords which represents high level concepts. System process documents
according to seedwords and place them in respective places in ontology.

In order to extract information related to concepts uses superficial syntactic analysis. Which
includes pattern matching and local context (NPs) with word sense disambiguation. Relation
extraction is carried out automatically which based on linguistic property of noun components
and the inheritance hierarchy. This system is mainly based on corpus based learning.

Representation of this methodology based ontology is naturally hierarchical structure.


Optionally while collecting data into ontology it indexes documents for further retrieval in
databases.

Problems of this approach is that it recognize different sentences that talk about the same
concept, word sensing problems. Uses heterogeneous resources as their data sources for
producing ontology.

3
2.1.2. TEXT-TO-ONTO Ontology
TEXT-TO-ONTO Ontology Learning Environment, is a project which construct ontologies
from texts and conceptual structures based on discovering general architectures for those
ontologies and structures [2].

Figure 2:1 Architecture of the ontology learning environment [2].

From text processing server it returns text that is annotated by XML and this XML-tagged text
is fed to the Learning & Discovering component to further evaluation which will model the
ontology. By its architecture it is very continent the process of text mining into ontology.

Main process of this System is Tex & Processing Management, Text Processing Server,
Learning & Discovering Algorithms, Lexical data base and domain lexicon, Ontology
Modeling Environment. TextToOnto proceeds through ontology import, extraction, pruning,
and refinement stages. Main advantage of this system is it has diverse algorithms that helps in
term extraction and taxonomy construction procedures. Also this provides ontology
maintenance algorithms as well such as ontology pruning and refinement algorithms.

2.1.3. Ontology Learning (AIFB)


This is another study which uses above mentioned text-to-onto as associative tool in generating
ontology with help of other tools as well which are SMES and Onto Editor. In this study they

4
have gathered data from free text which is available over the internet. In their methodology
they have used tokenizer, morphological analysis, and named entity recognition, part of speech
tagging and chunk parser throughout the concept extraction process. To extract relations
between data they have used co-occurrence clustering of concepts which cluster similar data,
also Heuristic rules based on linguistic dependency relations and general association rules in
machine learning.

Further relations has been identified using as follows. Term extraction, Synonym Extraction
which describes Pointwise Mutual Information (PMI) measure to extract synonyms.
Where if two events x and y is defined as:

Where P(x,y) is the probability for a joint occurrence of x and y and P(x) is the probability for
the event x[3]. This denotes that P(x,y) >= P(x).P(y) then it has positive value for PMI. Else
itl take either negative or zero as PMI value. This approach can be used to calculate the
statistical dependence of two words on the Web. Following is PMI value generated from google
from counting hits.

Then at concept learning this system has focused on approaches which induces concepts by
using concept clustering, linguistic analysis and inductive methods.[3] In concept clustering
concepts are formed and ordered hierarchically at the same time. In Linguistic analysis has
used to derive intentional description of concepts in natural language form.

Ontological learning related to this study has used several tools such as SMES, TextToOnto,
OntoEditor. When in implementation of knowledge representation language and store the
learned ontology primitives in a meta model called as Possible Ontologies Model (POM).
Advantage of this model is it has more control over ontology engineer since it can easily trace
back to original corpus changes.

2.1.4. Hasti project


This is a project to implement ontology building approach and test the approach. Staring from
small ontology kernel this approach will construct the ontology through natural language

5
corpus as an automated system. This kernel consists with primitive concepts, relations and
operators to build a suitable ontology [5].

Figure 2:2 Hasti project architecture [4]

Document Input text conversion into ontology in following methodology:

Morphological and syntactic analysis and extracting new words features, Building sentence
structures (SSTs), Extracting conceptual-relational knowledge (primary concepts), Adding
primary concepts to the ontology, Ontology reorganization.

In Hasti project Lexicon is call as Halex (Hasti Lexicon). Halex structure in brief can be
identified as set of knowledge type entries such as Morphological Knowledge, Syntactic
Knowledge, Semantic Knowledge, and Pragmatic Knowledge. These are used to mitigate

6
ambiguity in Natural language. Halex consist N different senses of a word which it may take.

Figure 2:3 the Structure of a Halex Entry [4]

In this project Ontology is defined by y O=(C, R, A, Top) [5] where O is ontology, C is the set
of all concepts, R is the set of all assertions, A is axioms, Top is top level in the hierarchy.

In Hasti has Natural Language processing component which analyzes Lexical, Morphology,
Syntax, SST, and Predicator. Input text pass through these analyzers and generate knowledge
about words.

2.2. Question classification and machine learning related approaches

Despite that fact that classification of text/ documents is common and well answered area of
natural language processing, it is quite challenging to classify questions in order to support an
intelligent question and answer system. This is because question answering is different from
common search engine process. It requires to find a concise answer instead of finding a
matching set of documents. Target text is less likely to match with the question text. Thus it is
important to understand syntax and semantic of the question. [6] This is commonly achieved
via machine learning approach.

2.2.1. Question Classification comparison Research

This research article focus on different type of classification methods used to create intelligent
question and answering systems. It enfasis on the Importance of using machine learning
approaches against manually constructed rule based question mapping techniques by pointing
out following advantages.

7
Advantages of Machine learning approach [6]

Efficient and effective in learning insightful features.

Flexibility: ability to be re-trained into a new taxonomy.

They have compared two classification with different level of features extracted from the
questions and reported their accuracy. Results are presented in <Table P>. Two classifiers used
are Support vector machines (SVM) and Maximum Entropy Models.

2.2.1.1 Feature extraction from the question.

To input questions to classifiers some form of representation of questions is needed. It is not


possible to input a question as a string of words and use the classification methods. Following
are the list of features extracted from each questions as bag of features.

Wh-word.

Head word.

Wordnet semantic feature for wordnet.

Word grams.

Word shape feature.

Wh-word is basically the question word which is one of who, why, where, which, when, how,
what and rest. rest are the questions that don't have any question word. Example: - Name of
a disease that cause rashes?

Head word is the noun or verb that considered as the key-word in the question. Example: -
what is group of fish are called? In this question headword is fish and it is type Entity: animal
(refer Table Q) [6]. However as article explains identifying head word is not so
straightforward. First approach to head word extraction is via syntactic parser ( Chaniak parser
/Stanford parser) and Modified form of Collins rules.

Consider the following example.

What year did the Titanic sink

8
Figure A[6] is the result via a syntactic parser with the use of Colins rule Head word is
identified as did. However after modifications of rules to give priority to noun instead of
verb or verb phrase, head word is identified as year (Figure B)[6].

Figure A Figure B

Figure 2:4 A
Figure 2:5 B

Before using the above mentioned method for head word searching, Set of regular expression
rule based algorithm is used to identify the commonly known question types and find the head
word. Questions with when, where or why will return no head words since those type of hw-
words are considered as high informative hw-words. If the algorithm fail to identify the head-
word modified Collins rules based parse trees are used to find the head word. In a case where
the extracted head word is with noun or noun phrase tag then as the last step first word of that
question with the noun or noun phrase is considered as the tag head word.

WordNet semantic feature is to identify meaningful related word for the extracted head word.
WordNet is a tool used for semantic analysis and it is used to identify hypernyms or head words
found. Hypernyms is a generic word for a given word. It can be at different levels.

Example

Dog - > Domestic Animal - > Animal.

Hyper name of Dog at depth 1 is Domestic Animal and depth 2 is Animal. Hypernyms can
exist on verb-sense and noun-sense. Type of a sense can introduce a ambiguity to the head
word and cause noise in the data set. Correct sense for the head word is identified using Lesk
algorithm which calculates the maximum number of common words between the question and
the each sense type words.

9
Fourth feature extracted from the question is wordgasm. Basically a n-gram is subsequence of
N words from a given question. They have used unigrams, bigrams, and trigram features for
the experiment.

As the last feature Word Shape is considered. Which is one of all upper case, all lower case, mixed
case and all digits.
Experiments are done using subsets of total feature sets via both SVM and ME classifiers and
results are reported in the following Table P

Figure 2:6 Table P [6]

2.2.2. Multi-class classification method


This is a similar approach focus on the same hierarchical classification using same set of classes
and subclasses for classification as the previous method. Following is the table of classes and
subclasses use for classification in this and previous method.

Figure 2:7 Table Q [7]

10
Classification is done by identifying the type of the answers expected from the question. Type
of the answer is first classified as 5 Coarse classes which is classified using a coarse classifier
and each Coarse class has fine classes which are classified using fine classifier [7]. Both
classifiers are fed with features extracted from questions from the same feature extractor. 6
Primitive feature of the question are considered here. Which are nouns, pos tags, chunks, head
chunks, named entities and semantically related words [7]

Here the noticeable difference in the approach is the multi-class classification. In the previous
method using SVM and EM classifiers each question is identified as belong to a one class and
subclass. In this approach multiple candidate Coarse classes are selected from the Coarse
classifier and all the fine classes of each selected Coarse classes are used to identify candidate
fine classes using the fine classifier.

However they have only managed to construct a multi-classifier to only Coarse classes and
classification is done using one feature at a time.

Figure 2:8 Table R

2.2.3. Conclusion
When comparing the experimental results from both approaches (Table P and Table R) it is
clear that significant improvements to the accuracy can be achieved using multi-class
classifiers. Even when classification is done using all the 6 features, accuracy remain below
90% in the first approach. However with multi-class classifier with only one feature can have
accuracy close to or if not over 90% in most cases.

11
2.3. Natural Language to SPARQL Generation Approach
This section gives an abstraction of researches which have been conducted on extracting
information from ontology by converting natural language to sparql queries.

2.3.1. QASYO[8]
QASYO is an question answering system using YAGO ontology as knowledge base. YAGO
is an ontology which extract information from WordNet, Wikipedia and GeoNames and it has
been integrated to link data cloud by linking to DBpedia and SUMO ontologies. QASYO
integrates semantic ontologies with natural language processing in an unified framework. It
extract key words from question to identify question type using semantic analysis.

Question answering process of QASYO is consist of question analysis and answering retrieving
phases.

Question analysis phase generate a query pattern by classifying parsing the question. Query
pattern is a natural language query which is labelled with ontology concepts and morphological
information. Classification is done by categorizing question as W/H questions (who, what,
when, which and where) or yes/no questions. Answer type is identified Based on the question
category and the question type is matched with the entity type in ontology. Then semantic
triples are generated using Linguistic Components and search for synonyms using WordNet
the detected unknown components.

Figure 2:9 Components and search for synonyms using WordNet


Above figure shows the system architecture of QASYO .Answer retrieving phases uses query
pattern as input and searches the pattern in ontological databases and detect relations. Then

12
gives the answer if it exist in ontologies or simply gives a message as don't know if it is not
in the knowledge base.

2.3.2. Natural Language Query Interpretation into SPARQL Using Patterns [9]

This system suggest a way of designing queries expressed in terms of conceptual graphs and
adapt Semantic Web languages instead of Graphs. System introduces pivot language by
allowing to express relations in keyword queries. Pivot query is a new query which is obtained
from translating obtained dependency graphs into new graph using identified name entities and
dependencies. Those queries contain relationships which are connected to keywords. Pivot
queries are matched with predefined patterns to obtain potential list of query interpretation. It
input natural language query and rank SPARQL queries and the associated answers as output.
It justify patterns from literature by translating natural language to pivot queries. Those patterns
contain repeatable sub patterns and optional patterns.

Generated patterns have four triples as (G, Q, SP, Q). G is a RDF graph which represent query
family and generalize the structure of the pattern. Q shows the quantification of elements and
it is a sub set of G.SP contain set of sub-patterns sp of p such that,

v is cut vertex of G. card min and card max means minimum and maximum cardinalities which
have been used to categorize sub patterns. Cardinality with zero are the sub patterns which are
optional and the patterns which have cardinality value greater than 1 are not optional. S is
descriptive sentence template.

13
In descriptive sentences , n substrings swi correspond to the n sub-patterns and wj correspond
to the m selected elements in m substrings which are unique. Following image shoes the generic
query pattern used in this approach.

Figure 2:10 generic query pattern used in this approach

Instantiation of pattern element can be displayed using bellow equation.

This is the sub pattern obtained by instantiating q by the resource in the pattern
p. This instantiation only possible if is compatible with q. Sub patterns are generated by
nesting pattern p = (G, Q, SP, s) in main pattern recursively. A pattern are considered as a sub-
pattern
if it not nested in another pattern and the maximal
and minimal cardinalities equal to 1.Instantiation mechanism of pattern and sub pattern
remains the same.

14
This approach generate relevancy mark and suggested query interpretation to reformulate the
query to prevent habitability in answer generation. Habitability problem is ,user entering
question which are out of the system capabilities.

2.3.3. AutoSPARQL[ 10]

AutoSPARQL proposes QTL(Query Tree Learner) algorithm which is able to fill one of gap
in research and practice in area of generating sparql queries from natural languages. A query
tree is the structure which is used internally by the Query Tree Learner algorithm .Query tree
roughly represent a SPARQL query. It
uses supervised machine learning techniques to allows users to ask questions without knowing
the underlying knowledge base schema beforehand. System generates SPARQL queries based
on positive and negative examples. Positive examples are the resource which are included in
the results of sparql query and the negative examples are the resources which are not included
in the results of sparql query. This approach gives the freedom to user to ask questions like
other question answering systems or to directly search for a interested resource.

This is the definition of a query tree. RDF resources have been denoted as R, L represent set
of RDF literals , S represent set of strings and SQ denote set of sparql queries. Restriction of
a function with a domain D is denoted as f|D. Definition of a sub tree is given bellow.

15
Query tree maps each and every resource in RDF graph. When mapping a resource to a query
tree, system has been limited to a recursion depth to increase the efficiency. Maximum nesting
of triple patterns correspond to recursion depth and the maximum nesting
of triple patterns can be learned by the QTL algorithm.

This is the work flow of AutoSPARQL. System suggest questions if query result does not
match with user intend. If user is interested in suggested question then again execute the QTL
process and generate answers. This is active learning environment on top of QTL.

16
Chapter 3
3. Technology Adapted
3.1. NLTK
NLTK is a toolkit available which can be imported using python and easy to use package. This
has useful techniques to do most of the natural language processes. In our context we have use
nltk for several process such as tokenizing sentences and words, POS tagging on new
relationship type builder section.

For extend it has inbuilt packages to carry out other POS tagging and parsing facilities which
we have not use in our approach. NLTK is useful when to carry out most of the initial level
process like word tokenize, sentence tokenize, regexp base tokenize, access several set of
copora to train modules.

3.2. WordNet
This consist lexical database in English Language. This provides synonyms for each words
using synset. Other than that WordNet has stored with short definitions and examples of these
synonyms sets.

In WordNet it uses hypernym hierarchy which allows users to traverse through up and down
of this hierarchy to find out the relationship between word classes.

3.3. Sciket-learn
Sciket-learn is a machine learning library using in Python. Used for Classification, Regression,
Clustering, Dimensionality reduction, Model selection and Preprocessing. SVM is a binary
classifier. However our requirement is to develop a multi class classifier. Sciket-learn has One
Vs One Classifier with enable multi class classification from binary classification. One Vs One
Classifier does this by constructing classifier per pair of classes. During the prediction time
most voted class is selected.

17
3.4. SPARQL Wrapper
This is a Wrapper for SPARQL service. Used to run sparql queries again locally hosted .owl
files. (Efficient that rdflib python library in query execution).

3.5. Apache Jena Fuseki


Used for RDF data services via HTTP. Used to host the ontology locally. This is act as ontology
endpoint which SPARQL queries can be used upon to retrieve data and information.

3.6. SPARQL
This is the RDF query language. Used for retrieve and manipulate data stored in RDF format.

3.7. OwlReady
OwlReady is a library available for python 3 to carry out any owl related operations. The
operations related to our system are ADD new Classes, INSERT new Individuals under
classifications, ADD new property types. Map each relationship using predefined property
type.

Other than that this library allows to create new ontologies, update ontology between online
and local files.

3.8. Stanford Named Entity Recognizer


This is a java related tool which allows to train our own NER models and use it to recognize
new Named entity types. Stanford NER provides external jar files which requires to import
from python model in order to use this tagger.

3.9. Random Forest Classifier


Random Forest Classification approach uses decision trees based approach to predict on results.
In this classifier it create several decision trees from training data. These decision trees vote for
the relevant outcome expected and finally predicted classification is picked up by the highest
score voted for a decision.

3.10. Pandas: data analysis toolkit


Pandas toolkit is used to work with structured data types and carry out fast & flexible operations
on data. It allows to working with relational or labeled data easily. In data analysis we have
used real world data. Therefore such situations it is advisable to use such powerful and flexible
open source data analysis and manipulation tool.

18
Chapter 4
4. Our Approach
4.1. Ontology generation through text mining
Text mining to learning ontology is critical process which we found through this project. In
order proceed to content based question and answering first we must provide high quality data
extracted through content. Answer quality, trueness and validity is mostly depend on the data
extracted from the source. To support to answer generation for the request query we are
providing an ontology which has ability to learning through the context provided.

Throughout this project we have selected Medical Web sources as our domain which we try to
provide solutions for the questions ask related to Diseases. Initial data gathering will be carried
out based on web content which available at these web sources with the help of a web spider
which crawler through web pages. This crawler will gather documented data from web sources
as list type and paragraph type separately. These unstructured data documents will be stored in
data directories which can be later used for the processing processes.

In this approach which aligned with the design first we feed data as documents to the system.
Input data document is in a form of unstructured data which is crawled from web as it is. These
data then go through set of preprocessing approaches inside mechanism. Tokenization is the
process of breaking a stream of text into sentences, words, phrases, symbols, or other
meaningful elements which then called as tokens. Token is a group of characters which has
collective meaning.

Tokenizing general example:

Sentence: int height = 100;

After Tokenization

Token1 (keyword, int)

Token2 (identifier, height)

Token3 (=,)

Token4 (constant, 100)

Token5 (;,)

19
After Tokenizing we use NER model to recognize named entities and tag them using BIO-NER
tagging approach. This Tagging includes DISEASE (e.g.: Heart Attack), CREATURE (e.g.:
Mosquito), MICROORGANISM (e.g.: Virus), BIOLOGY (e.g.: Muscles), NN (e.g.: Airborne
Droplets). Each Entity classification includes B or I tags prefix to their term which indicates B
for beginning of the entity name or I for inside of the entity name. Also we use O to represent
any word outside of the chunk.

From these recognized Named Entities we recognize subjects and object which required to
create relationship between. Then in between sentence part will be analyzed using Relationship
builder. Which is associated with several internal and external modules. For common scenarios
this relationship builder will identifies several common relationship types.

Semantic relationship types identified are Synonymic relationships, Causal relationships and
Hyponymic relationships.

Synonymic relationships are in a form where Entity1 is equivalent to Entity2 type. E.g.
Dengue is a type of Virus.

Causal relationships captures causative relationship types which between Entities. E.g. Brain
disorder is caused by human immunodeficiency virus

Hyponymic relationships are in a form which Entity1 and Entity2 are similar types. E.g.
Myocarditis is similar to Influenza

If none of these identified then we look for alternate approach inside this model. Which by
following a POS tagging we identify Verbs and Adverb types (e.g. VB, VBD, RB, RBR, etc.)
and create new relationship types based on that.

External module is separately trained module to understand the relationship out of sentences
which then classify sentences into disease_is, is_cuased_by, cause_symptom,
has_treatment and other types. This helps to grab more relations which may have missed
by the internal module.

These semantic relationships then map with each concept identified by previous module. These
will be arranged into semantic templates which Subject, Object, and their properties into a
format.

Knowledge will be extracted at sentences level and word level will be represented in concepts
and their relations. Later they will be converted into ontology elements by ontology creator.

20
Ontology creator task is to define primary concepts such as disease, creature, microorganism
type etc. Which then has ability to create inter-relations between noun and verb phrases and
place them in original place at the ontology.

This ontology will be used in further developing and learning on questions which user allows
to query on the system. Text mining and ontology development output will be this learned
ontology.

4.2. Question classification


4.2.1. Introduction
5. Question answering is different from an average search engine. Search engine outputs a set
of related documents to the search phrase where Q & A system needs to provide a specific
answer to asked question. This require a syntactic and semantic understanding of the
question which is provided as a string without any other references.

(Table C)

Inputs: - Factual wh-question, Type String.

(Questions such as Can you bring me the pen? are not considered here.)

Output based on:- Question type, This is the type of Entities present in question.

Expected Answer type, this is the semantic type of the expected answer.

Goal: - Categorize the question into different semantic classes based on the nature of the
question and expected answer type.

Question Semantic classes: - To identify the type of the matching rdf:type of search phrase.

Answer Semantic classes: - To identify the type of the expected answer.

21
Example:-

What causes diabetes?

Question Type = Entity:diseases

Answer Type = Description: Reason

SELECT ?cause

WHERE{

?de rdf:label ?label.

FILTER regex(?label, "diabetes ", "i" ).

?de rdfs:class base:Diseases. <- Question Type

?de base:causes ?cause. <-Answer Type

5.1. Answer Generation Module


Answer generation module takes the input from the question classifier module and use the
Question class type to identify the keywords of the question. Using keywords system identified
predicated in the ontology and then generate SPARQL queries. SPARQL query match the
entities in ontology with the natural language queries and generate the answers.

Figure 4:1 Answer Generation Module

22
Chapter 5
6. Analysis and Design
6.1. High Level Architecture

Figure 6:1 high level architecture

6.2. Ontology learning and knowledge gathering


In this section we describe the major components relate to text mining and ontology learning
through knowledge gathered through text mining. In our design we have separately shown that
the areas which are important areas relates to document collection, data extraction, data
preprocessing and analysis, sentiment analysis of words when carrying out NER tagging
operations, relationship building approaches, and overall of ontology creating via semantic
templates build via system.

23
Figure 6:2 Text to Ontology Design

24
6.2.1. Web Spider - Crawler
Task of this crawler is to absorb data from web source provided with focused web content
which is predefined to the crawler. From that system can separately gather set of documents
related to each web page, web context separately. This documents will be classified into answer
types and question types which will be stored in separate data directories for future reference
in ontological learning procedure.

6.2.2. Tokenizer
After document is feed into the system tokenizer will initially identify sentences and create
splits from whitespaces which will generate new tokens. At this tokenizer with the help of
regular expressions we were able to make this process accurate by removing punctuations. Also
at this stage we have considered on stopwords (e.g. is, am, the, with, etc) as well in order to
optimize the tokenization process.

e.g.

Sentence: People with flu can spread it to others up to about 6 feet away.

Tokenized: ['People', 'with', 'flu', 'can', 'spread', 'it', 'to', 'others', 'up', 'to', 'about', '6', 'feet', 'away']

Stopword removed: ['People', 'flu', 'spread', 'others', '6', 'feet', 'away']

6.2.3. Named Entity Recognition Tagger


After word tokenization those words can be pass through NER tagging process which tag
custom Named Entities which were trained during the training process. For this section it has
been used a model which were trained using Stanfords NER Conditional Random Filed
Classifier. Training details has further described in Implementation section under experiments.

Eg. ([Heart, B-DISEASE], [Attack, I-DISEASE])

6.2.4. POS Tagger


This is a very important step in the process. Where we need to carry out tagging each word
component based on given sequence of words. Each token will be assigned with a syntactic
word category (i.e. NNP, VB, TO, DT, etc.). Usually this uses Viterbi and Hidden Markov
Model based POS tagging approach. Where tags are assigned as a learning from the context.
e.g.

Tagged: [('People', 'NNS'), ('with', 'IN'), ('flu', 'NN')]

25
In design POS tagging coming under new relation type generation which involve to identify
verbs (VB, VBD, VBG, VBN, VBP and VBZ) and adverbs (RB, RBR and RBS).

6.2.5. Word Analyzer


At this stage we have words with their named tags. Firstly in this stage identify NER and carry
out merging operation based on BIO concept. In order to proceed to relationship identification
this module first check whether the analyzing sentence has at least two or more NER tags none
relates to outside the chunk tagged words. This is a systematics process which undergo when
each abstract is analyzing.

6.2.6. Relationship Builder


Relation builder module used to create relationship between natural language concepts.
Relationships can be categorized using several approaches. Such as: Hand built patterns,
bootstrapping methods, supervised methods, etc.

However this module is included with a semi supervised method followed by bootstrapping
method and hand built patterns.

6.2.6.1. Relationship Categorization - Hand Built Pattern


Mainly this model is carried out to recognize hand build patters and classify relationship into
following classification categories. Table 5:1 includes detailed version of relationship type
categorization.

Table 6:1 Syntactic patterns for relationship extraction

Relationship Type Relationship Pattern Categorized Into


is a
is equivalent to
Synonymic is
also known as
is also called
caused by
Causal "spread by" caused by
"occurs by"
such as
e.g., for example
Hyponymic for instance similar to
including
especially

26
6.2.6.2. Relationship Categorization Verbs Relations
This is another approach to further identify relationships which are not categorized under Hand
Build Pattern. Therefore this module uses set of natural language based technological
approaches to categorize by identifying verbs and adverb relationship of sentences.

Using NLTK POS tagger words in sentence can be tagged appropriately. Then from that verbs
which are tagged and adverbs we can create new type of relationships to match identified
named entities in previous modules. POS tags relates to verbs are VB, VBP, VBZ, VBD, VBG,
VBN. And POS tags related to adverbs are RB, RBR, RBS are considered. Therefore using a
simple regex pattern to recognize words starting V and R it can construct words which are
verbs and adverbs. Words are feed in an order to this section. So by keeping that order we
create new relationship types form verbs and adverbs.

6.2.6.3. Relationship Classification using bag of words


Further to gather more relationships and also to identify sentences of relationships this separate
module has been trained. In experiment section all related data and accuracy levels has been
stated. This module works from features collected from sentences which then processed to train
bag of words with size of 5000 most prominent features. By using TF-IDF vectorizer, feature
which are infrequent has selected as most important words to create bag of words vector array.

Then that feature vectors are fit to the Random Forest Classification model. Doing so it allows
to predict on classes for newly entered sentences. For do that using pickle we store a trained
model in a dump file. So when new data element requires to be predicted on its classification
just have to load the model file from the dump and feed it back to the model as its classification
model.

In this model it has been trained to classify sentences into 5 classes which are, is_caused_by,
cause_symptom, has_treatment, disease_is and other. These are very basic classes that data to
put into. Which has kept the model simple and increased its accuracy.

6.2.6.3.1. TF-IDF
TF-IDF is stands for Term Frequency and Inverse Document Frequency. Idea behind this
concept is to identify most important words. Words such as is, a, the, etc. appears in almost all
the documents which are least significant. Therefore by applying tf-idf it can be identified some
words which are significant by scoring it using following equation.

27
TF-IDF = TF x IDF

Where TF stands for Term Frequency and IDF stands for Inverse Document Frequency.

TF and IDF can be calculated using following two equations.

TF = Number of time term t appears in a document / Number of total words in that document

IDF = log_e(Total number of documents/ Number of documents with term t in it)

This implies that a term appears less number of documents and same term appears more times
in a single document means that term is high important term.

6.2.6.3.2. Bag of Words


Bag of words is a most common technique use in information extraction. Idea behind this
concept is to represent text sentences using bag (multiset) of words disregarding grammar and
the word order. This is used as a tool for feature generation. By applying tf-idf on bag of words
it can generate most prominent features of a given set of documents.

Eg:

D1: dengue is a virus.

D2: dengue is caused by mosquitoes

Bag of words: [dengue, virus, is, caused, a, by, mosquitoes]

D1: [1,1,1,0,1,0,0]

D1: [1,0,1,1,0,1,1]

6.2.6.3.3. Random Forest Classification Model


RFC is an ensemble learning method used for classification. This operates by constructing mass
number of decision trees at the training time and outputting the mode of the decision tree
decisions which is respect to the classified class.

In our case we have developed this algorithm to use with 100 estimators (100 decision tree
subsets).
Under sklearn ensemble algorithms this algorithm can be found and can be used as a standard
SVM, Nave Bayes algorithm.

28
6.2.7. OWL - Ontology Creator
List of semantic relations will be extracted from the given sentence. This structure will be
guided with semantic templates which has allows to find most suitable method to update
ontology based on extracted knowledge. Primary concepts will be created as a noun phrases
and verb phrases inter related concepts. Then this will guide those concepts towards original
placement in the ontology.

Following is a sample semantic template which demonstrate a sequence which a subject and
object that arranges on a specific template to match and map into sematic RDF triplets.

eg.

Classes [disease, creature, biology ]

Main Properties [caused_by, disease, creature]

Classification [[dengue, disease], [mosquito, creature]]

Relationship [dengue, caused_by, mosquito]

6.2.8. Ontology Manager


Task of the ontology manager is to write new data into ontology, update data and removal of
irrelevant data. Ontology manager can be used as such tool in order to develop new concepts
from text based knowledge stored at the ontology.

6.2.9. OWL2RDF convertor


This is an external plugin developed using java based open rdf API. This is supported by
windows batch file which allows communication between python and java libraries created.
This module has two library components developed during this project which are OWL2RDF
convertor (Converts OWL formatted files into RDF format) and RDF2OWL convertor
(Converts RDF formatted files into OWL format). This component is created due to Apache
Fuseki RDF endpoint only supports rdf content. Therefor owl files has to be converted into rdf
format.

6.2.10. RDF- Ontology


In our design ontology is carrying major weight of the project since it requires to facilitate
knowledge to the major processes. Question type generation, answer generation, text to
ontology learning process to identify primary concepts.

29
An ontology is an explicit, formal specification of a share conceptualization [a formal
definition]. In other words an ontology is an abstract model which is understandable by
machines and represents shared knowledge which is accessible to extract information or infer
new knowledge from it.

In OWL ontology it has entities which represent owl classes and two types of property
assertions. They are Object type property and data type property. Objects in the classes are
mapped with each other using these property relationships.

E.g. people with flu

:disease a owl:class

:people a owl:class

:flu :isDiseasTypeOf :disease

:people :infected :flu

In this case a (rdf:type), isDiseasTypeOf and :infected can be identified as object type
properties which has created relations among each context. Corresponding individuals has been
mapped into their classes.

6.2.11. Apache Fuseki Endpoint


This Module is used to host RDF files in an online endpoint. Then anyone outside the system
can just carry out SPARQL queries to retrieve knowledge data stored inside this ontology.

Ontology can be update as required after 1st update to the server via PUT http requests. The
advantage of using PUT is that it overrides entire ontology. Therefore all the modifications
done inside local ontology will also be corrected at the online ontology.

6.3. Question Classification


Question classifier takes natural language questions as inputs and classify them according to the
expected answer and type of the question.

Figure 6:3 Question Classification Flow

30
Natural language question is first presented to the preprocessor, which remove and replace unnecessary
features and enhance features that benefits the classification process. Once question is processed it is
presented to the vectorizer, which uses its predetermined feature vectors to vectorize the question so
that question can be fed to the classifier in vector base form. Classifier outputs the class of the question
based on the expected answer and question type.

Table 6:2 Question Classification Samples

Class Name Description

ABT_diseases Question that are directly requesting information about a particular disease. (ex
:-what is chickenpox ?, want to know more about arthritis)

ABT_symptoms Question that requests information about symptoms for a particular disease.(ex :-
what are the symptoms of heart attack ?, how to diagnose dengue fever ?)

ABT_causes Question that requests information about the causes of a disease ?


(ex :- what are the causes of heart attack ?, what leads to autism ?)

ABT_prev Question that request information about the prevention methods of a disease. (
ex :- how to prevent from cholera ?, how to counter osteonecrosis ?)

IS_sym_dis Question that request to guess the disease from given symptoms.
(ex :- Are red spots and fever symptoms of dengue fever ?, Difficult breathing
and muscle twitching)

IS_cau_dis Question that wants to check if a particular behaviour leads to a disease. (Ex :-
Does STD transmit via public bathrooms ? Does excessive drinking leads to
liver problems ?)

IS_prev_dis Question that wants to check if a particular prevention method work for a
disease (ex :- Does vaccines prevent dengue fever ?, Can i reduce the risk of
having cholera by boiling drinking water ?)

ABT_risk Question about risk from particular diseases. (ex :- Is chickenpox fatal ?,can
cataract lead to blindness ?)

ABT_treatment Asking about treatment for particular disease (ex :- how to treat skin-burns, how
to cure chickenpox ?)

31
6.3.1. Question Preprocessing

Question preprocessor takes the natural language question process it into a form that is more suitable
for the vectorization process. Purpose of the preprocessor is to enhance the features of the question to
increase the accuracy of the classification process.

Figure 6:4 Question Pre-processing model


Question is first tokenized int to set of words. Each word subjected to a lemmatization process to
identify the base or dictionary word to avoid repetition of similar words in the feature set. Then the
question is presented to the Disease Word Replace Module, which replaces words(bigrams, unigrams)
that represent diseases with a single word. This is because question dataset consist of large number of
questions each referring to a particular disease. If each disease name is considered as a feature when
creating the feature set word vector will be significantly large. By replacing every disease name with a
single word disease significantly reduce the feature vector size and increase the efficiency of the
classifier. Also when a new question is presented with an unknown disease, Disease Word replace
module identify that word representing a disease and replace it the word disease. This will increase
the accuracy of the classifier for that question.
Example:-
Original Question
What are the early symptoms of heart attack?
After the Disease word replace module
What are the early symptoms of disease?

6.3.2. Process of disease word identification and replacement.

In each question a disease can be represented by different number of words. Here only unigram and
bigram words are considered. First each word of the Question is checked for words representing
diseases and them if a word found it is replaced with the word replacement word (ex :-disease) then

32
reset of the words are used to from unigrams and again each unigram is checked for phrases representing
diseases. If found those phrases are replaced with a single Replacement word (ex: - disease) to identify
the disease word two methods are used. First method uses the NLTK.wordnet library as a source to
identify hypernyms for the given word or phrase. Then list of those hypernyms are compared with the
set of root words that represent diseases. If a similarity found that word is identified as a disease and
replaced with the replacement term.
Following are the set of root words used
{'disease','illness','cancer','contamination','defect','disorder','epidemic','fever','flu','illness','sickness','syn
drome'}

Second method look for the diseases words in the ontology. It uses a predefined SPARQL query to
check if a particular term is represented as a disease in the ontology. It search for Instances that are of
type disease and which also happens to share common name with the phrase word that is compared.

Figure 6:5 Process of disease word identification and replacement

6.3.3. N-grams
If the feature set consist of only unigram words, different sentence consisting with same words (but
different order) will be represented as the same sentence.
Example :- I can go home , can i go home. Both will have the same vector if only unigrams are
considered as the feature set. However with N-grams can i and i can will be represented as two
bigrams and this will result in two different vectors. Thus it is important to use N-grams to preserve
the sentence structure when represented as a vector.

6.3.4. Veterization
Questions must be represented in a vector form in order to train a classifier. There are several methods
that can be used to vectorize a text content.
Bag of words
Word2vec
Vector space model representation

33
Bag of features create a feature vector using each word available in the data set (documents) and that
feature vector is to vectorize a given statement or document. It consider the availability of each word
and the frequency of features appear. Every word carries smiler weight in this representation. This
representation does not consider the semantic or the structure of the documents. Word2Vec is a reason
deep learning approach to vectorize words in a way that captures their relation with other words.
However to apply this method dataset much contain minimum of over one million words. Initially when
tested this resulted in very poor accuracy. Thus as moderate approach Vector space model
presperenstion is used for the vectorization.

6.3.5. Vector space model


Initial step for the vector space model is to select a feature vector to represent the given text documents.
(in this case dataset of questions). Normally, Feature vectors consist of words and bigrams or maybe
trigrams that present in the documents. Then the feature vector is used to represent each question (or
sentence) as vector. When creating the vector for each question (or sentence) tf-idf (term frequency-
inverse document frequency) of each feature word of the question (or sensense) is used as the value for
the feature value. Then the vector is normalized so that question (or sentence) length won't influence
the vector created from it.
Term frequency (tf) is the frequency each feature in a question (or sentence). Idf is the inverse document
frequency. This is used to add a weight to each word-feature so that rare words/phrases in a sentence
has higher weight than common words/phrases when represented as a vector. Idf is calculated as follows

1+
Idf(t,d) = ( ) +1
1+(,)

Here f is the term (word) and the d is the document (in our case the question) that idf need to be found.
Where n is the number of documents (question) in the dataset. df(t,d) is the number of
documents(question) the term t is present. Idf is multiplied with tf (term frequency) to calculate the idf
value for each term. Tf-idf representation ensure perfect balance between common words and rare
words. After the if-idf vector is calculated it is normalized using the following method.

= =
||
1 2 + 22 +. . . . +22

Once normalize size of the sentence/document/question won't influence vector values.

34
6.3.6. Classification

Vectorized question are subjected to a classifier to identify the class of each question. To find out the
better classification method, 3 type of classifiers are used and their accuracies are compared with each
other. Since the dataset is quite small following classification methods are selected. Better classifier is
selected with their accuracy with the training set and testing set.
Logistic Regression.
Gaussian Naive Bayes.
Support Vectors

6.3.7. Logistic regression

Logistic regression is a regression model that is used for classification. This a binary classifier that can
be extended multiple classes using one-vs-rest or cross- entropy loss schemes. Hypothesis is a sigmoid
function and inputs that outputs greater than 0.5 or equal for the hypothesis are classified as 1 class and
others are classified as 0. Cost function and the gradient decent approach to minimize the cost function
is mentioned below.

() = ( () log( ( () ) + (1 () ) log( ( () )))


()
= () ( ( () ) () )

6.3.8. Gaussian Naive Bayes

Gaussian naive bayes is considered as the baseline (popular) classification method for text based data.
Since it consider frequency of words appear in documents it is better suited for larger document
classifications. However it is simple and fast model that can be easily implemented with little
preprocessing. Naive bayes is a conditional probability model that can calculate the probability
hypothesis given set of conditions. Assume of X is a n dimensional vector representing a certain
sentence. Conditional probability can be used to calculate the probability that the sentence is belong to
a particular class(type) given that X vector. ( P(Class=Abt_diseas|X) = probability of that class of the
question is ABT_disease given that the question vector X ). To achieve this Gaussian naive bayes use
bayes theorem in combination with the chain rule. In this experiment Gaussian naive bayes classifier is
used as a baseline to measure other two classifiers accuracy.

35
6.3.9. Support Vector Machine

Support vector machine is a binary classifier that uses separating hyperplane to classify data points.
When given a labeled training data, algorithm calculates an optimale hyperplane which categorize new
examples. Since this is a binary classifier one against one approach is used for multi-class classification
problems. Advantage of svm is that it can handle data points that are not linearly separable using a non-
learner kernel (ex :- RBF).
Using the same dataset and vectorizer each classifier is trained and their accuracy for the training &
testing set is compared to identify the best classifier suitable for this application. Once question type is
identified from that selected classifier question class is presented to the answer generation module.

6.4. Answer Generation Module

Figure 6:6 Answer Generation Module High Level Arcitecture

36
Answer generation module use natural language query(question) as initial input and generate
answers(output) using semantic approach. First module recognize entities in question using
natural language processing and using question type as input to module which is provided by
question classification module. Then module recognize medical terms in question using
Unified Medical Language System (UMLS) and generate SPARQL queries using recognizes
semantic types and entities. Generated SPARQL queries search in ontology which is the output
of ontology generation module and input to the answer generation module. When ambiguity
occurs in selecting answer module ask questions from user and clarify the question so as to
generate the best answer.

6.4.1. Name Entity Recognition

This phase use questions of the user and question type from classification module as initial
inputs.
First question will be tokenized in to sentences and words. Filtering process will work on these
tokenized words to remove stop words. Then synonyms will be generated for each filtered word
using word-net and identify the entities which have connection with ontology. Generally in this
phase module identify what is user asking about and the ontology classes which have the
answers for specific user question.

Ex: What are the symptoms of Arthritis?


Output:- Disease , Symptom (Matching classes in ontology)

For the above question this phase recognize there is a disease in this question and user need
answer for the symptom of that specific disease.

6.4.2. Medical Term Recognition

Medical term recognition phase identifies medical terms such as disease names and symptoms
using Unified Medical Language System (UMLS). Unified Medical Language System, is a set
of files and software that brings together many health and biomedical vocabularies and
standards to enable interoperability between computer systems. This phase use Metathesaurus

37
of UMLS which have semantic Network,Lexicon and Lexical Tools. Searching in UMLS have
two sub phases as search by term and search by Concept Unique Identifier(CUI).

(I) Search by term in UMLS


First, for all filtered tokens this process will search unigram terms in UMLS. If medical terms
not found in unigram search then bigram search will be executed. Both of these searches will
be done using natural language terms.

(II) Search by CUI in UMLS


After searching for terms, system shows list of matching CUIs for specific natural language
terms. Then process identify exact matching term and search that specific CUI for retrieve
result. CUI search gives basic overview,definition and semantic type of specific medical term.

Based on semantic type this phase recognize medical terms.

Ex: Arthritis
Search Results (2219)
C0003864 Arthritis
C0003865 Arthritis, Adjuvant-Induced
C0003868 Arthritis, Gouty
C0003869 Arthritis, Infectious
C0003872 Arthritis, Psoriatic
C0003873 Rheumatoid Arthritis
C0003875 Arthritis, Viral
C0003892 Neurogenic arthropathy ...............

CUI: C0003864
Semantic type: Disease or Syndrome

At first, medical term detection was tried with word-net and it failed to detect some diseases
which have more than one word. So as to increase the accuracy of the process, searching for
medical terms is done by Unified Medical Language System.

38
6.4.3. SPARQL Query Generation

Final output of both Name Entity Recognition and Medical Term recognition phases create
an array which is consist of all the classes,medical terms and their values.

Ex: What are the leads to diabetes and what are the symptoms of it?
Output:- [[Class:Disease, Value:Diabetes], [Class:Cause, Value: ''NULL''],
[Class:Symptom, Value:''NULL'']]

SPARQL Query Generation phase generate semantic triples which consist of object,predicate
and subject for every sub array which consist of a class and value. Phase identifies NULL
values are the terms which need to be answered. Here query use random variables foe
unrecognized subjects and then make relation between two triples. Using this process system
generate SPARQL queries. Then generated queries will run via Apache Jena Fuseki sparql
server and extract answer from ontology which is the output of Ontology Generation Module.

Reason for using ontology instead of relational databases is ontologies are capable of inferring
implicit information from existing details and relations.

6.4.4. Question Clarification Module

When there's a ambiguity in selecting answers for a given question system ask question from
user to clarify the asked question and then regenerate SPARQL queries for clarified question
and extract answer from ontology.

CASE 01
Ex: I have excessive thirst and increased urination. My vision is getting blurry. Do I have
diabetes?
Output:- You have high probability of having diabetes

Here medical tern recognition phase recognize excessive thirst, increased urination and blurry
vision as symptoms. Then search for these symptoms in ontology using SPARQL queries.

39
Searching does not find diseases which have same symptom set. So there will not occur an
ambiguity. So diabetes will select as answer.

CASE 02
Ex: I have fatigue , breathing difficulties and Coughing up blood. Recently I have loss my
weight a lot.
What may be am I having?
Output 1:- Do you have pain in bones?
If user input YES , Output 2:- You have probability of having lung cancer
If user input NO , Output 2:- You have probability of having bronchiectasis

In this question recognized symptoms are fatigue , breathing difficulties ,Coughing and
weight loss. When searching for symptoms, module find same symptom set in two places.
That mean both lung cancer and bronchiectasis have all given symptoms. This is the case
where ambiguity occurs. So system starts to compare symptoms of both selected diseases and
select a unique symptom for one disease from these two diseases. For this question system
identify that bronchiectasis don't show symptom of bone pain and then ask from user
whether he is having bone pain so as to clarify the disease. According to user input system
select best answer.

40
Chapter 6
7. Implementation
7.1. Knowledge Extraction to Ontology automation
This section has detailed explanation on the experiments carried out and the conducted
experimentations conditions which has been used. Experiments carried out in several levels in
some cases and details and accuracy improvements are stated with in this section. In knowledge
extraction we have focused on identifying key concept and their relations using following
models. NER model for classify entities and a relationship classification model to predict
classification of a given sentence.

7.1.1. Named Entity Recognition (NER) Model


For this model we have used Stanfords CRF Classifier to tag entities which are respective to
our domain which is disease and relations. Training has carried out manually creating data sets
for training and testing operations.

Figure 7:1 NER model operation

41
Flow model [Figure6.1] describes the way the processes has been conducted.

From Unstructured data gathered from large corpus relates to disease which are from CDC
[11]. Those data then preprocessed into words and their labels with the guidance of CDC and
other medical data information providers such as PubMed, WHO to classify such words into
appropriate classifications. Dataset which has been constructed is attached under Appendix.

These manually tagged data has been split into training set and test set to train the model and
test its accuracies. Using CRF classifier these labeled data is trained and fit into a model. To
train model separate batch file has been written and it includes the path of java libraries to be
included and path of property file which includes all other details from feature selection and
model to be trained from dataset.

Table 7:1 Property features

trainFileList classifiers/dis_train4_BOI.tok
serializeTo classifiers/dis-bio-ner-model.gz
map word=0,answer=1
useClassFeature TRUE
userWord TRUE
UseNgrams TRUE
noMidNGrams TRUE
useDisjunctives TRUE
maxNGramLeng 6
usePrev TRUE
useNext TRUE
useSequences TRUE
usePrevSequences TRUE
maxLeft 3
useTypeSeqs TRUE
useTpyeSeqs2 TRUE
useTypeSequences TRUE
wordShape chris2useLC

7.1.2. Supervised Relationship Type Classification Model


From manually constructed datasets which are labeled and under its category within tab
separated text formatted documents this model has used pandas [12] data analysis toolkit. This
is similar to pythons inbuilt file operation manager. However using this tool make it ease when
carrying out complex data gathering and storing operations. For both training and test data sets

42
pandas data extraction objects are separately assigned. These data is gathered under following
panda settings:

# "header=0" first line of the file contains column names.


# "delimiter=\t" fields are separated by tabs
# quoting=3 ignore doubled quotes
header=0,
encoding="ISO-8859-1"
delimiter="\t", quoting=3

These settings meaning that the reading document contains first line with column names.

In our case: id classification relation

Encoding is set to ISO-8859-1 due to some of characters inside the documents were not
supporting utf-8 encoding. Delimiter explains how the columns are separated from each other
and \t is for tab separated indicator. Quoting at 3 is to state that we ignore any double quotes
inside document.

7.1.2.1. Preprocessing of documents


These retrieved data is then process through several preprocess activities to extract features
out of these sentences. First we remove HTML headers using BeautifulSoap based parser. It is
always recommend to use html parser to remove html tags rather than using regex operations
since regex reduces the efficiency of code segments and might miss some of tags if it is not
well defined. Then it has removed non-letters from the sentences. From these remaining
sentences which are now including just letter we convert them into their lower case and split
sentences into words. From these words we remove stopwords with the help of nltks
stopwords. By setting nltk.stopwords into a set this has improved fast comparision operation
with words we need to remove as stopwrods. Finally these words are joined back into sentences
with spaces.

7.1.2.2. Setting up the bag of features


After carrying out preprocessing we have defined the tf-idf vectorizer with its initial properties
set as follows.

(analyzer="word", stop_words=None, ngram_range=(1, 2),


max_df=1.0,min_df=1,
max_features=5000, norm=u'l2', use_idf=True)

43
This vectorizer helps to fit our preprocessed data to fit into a vector model and transform into
features. In our case we have set our number of features to be max at 5000. Then its required
to convert this feature vector into array to perform training using classification models.
This model has been trained using several models. They are RFC, Logistic regression model
and SVM linear SVC models. Training data is evaluated at Evaluation chapter of this
dissertation.
When in training Random forest classifier number of estimators has been set up to 100 trees.
This creates 100 decision trees to predict final result from an input sentence.
7.1.3. Hand built patterns to generate relations
Hand built patterns is a relationship extraction model which by recognizing relationship in
between two concepts. These two concepts are Subject and its Object which has linked by a
property. From NER model we recognize Subject and Object with their related class. Then we
use these patterns to recognize general relationship category which that relationship emphasize.

44
Figure 7:2 Relationship Construction Mechanism

7.1.4. Semantic Templates


Semantic templates are used arrange conceptualize knowledge to be place in the correct place
of the ontology which is created at the end of this ontology generation module. Semantic
templates has been first arranged at a separate class with the use of OwlReady. Once instance
of the class has been instantiated that object has the ability to put data into ontology. This class

45
requires data to be arranged in the way that semantic templates has been organized. In design
part there is more detailed structure of semantic templates.

7.1.5. OWL2RDF converter


In java platform using OWL API this model has separately created as external model.
Implementation under python is linked by a batch file which supports in windows environment.

7.1.6. Ontology Endpoint


Ontology generated via this process is finally uploaded into an online endpoint. In this project
we are using Apache Fuseki Web End Point to host these created ontologies. Hosting of
ontology is automated and integrated inside the system therefore time to time updates can be
carry out as appropriate.

7.2. Question Classification Module


Following are the key steps of the implementation process
Formation of the training and testing dataset (medical related questions )
Creating the pre-processor
Training the classifier.
Accuracy measurements and improvement.

Figure 7:3 Key steps of question classification module

Dataset include over 900 question that have similar number of questions representing each class. Since
collected question were unstructured class of each questions need were entered manually. Then the next
batch of testing questions are generated as a byproduct of ontology generation process. Instead of
manually going through each question trained classifier is used to classify them into classes and then
errors were corrected manually. However testing set of question are little bit biased for particular type
of question classes. It is important to have have roughly similar number of questions for each class
when training the classifier. Question are collected from various medical forums using web-crawlers
and python library beautifulsoup (which is used to collect data from various forms of web pages )

Training Dataset - 909 questions , 9 classes and roughly 10 question per each class.
Testing Dataset - 600 Question, number of questions are little bit biased for some classes.

46
7.2.1. Preprocessor
Preprocessor is used to enhance the features of the each question before submitting to the vectorizer.
Nltk library lemmatization module and tokenization module is used fopr the early stages of the
preprocessor.
question_list =dataset
FOREACH question IN question_list:
tokenized_question = TOKENIZER(question)
Lemmatized_tokenized_question = []
FOREACH word IN tokenized_question:
Lemmatized_tokenized_question +=LEMMATIZATION(word)
RETURN Lemmatized_tokenized_question

Once question is tokenized and lemmatized it is presented to the Disease word replace module. Disease
word replacement module search for words or phrases that represent disease in each sentence and
replaces the word/phrase with a one term disease. To increase the efficiency, first bigrams
representing disease phrases removed and replaced. After that rest of the words are checked for
unigrams.

Figure 7:4 Question conversion to preprocessed question

There are two methods used for finding disease words.


Using hypernyms from wordnet corpus of nltk library
Using ontology class matching.

Nltk is a natural language processing toolkit/platform. It consist of a croups named wordnet that is a
lexical database for English language. Wordnet can be used to find meaning of words,synonyms,
antonyms and also hypernyms.

Hypernyms :- hypernym for a name is a word that has related but broad meaning than that word. For
example hypernym for a car is automobile. With the use of nltk wordnet corpus for a given word
hypernym path can be found.
Example :- hypernym path for word heart_attack is
Entity > physical_condition > disorder > cardiovascular_disease

47
If a particular word hypernym path leads to words such as disease, illness, cancer, contamination, defect,
disorder, epidemic, fever, flu, illness, sickness and syndrome it can be decided that word represent some
kind of a name for a disease. With the following function is used to find hypernyms using the nltk
corpus wordnet.

FIND_DISEASE_WORD(word)
hypernyms = FIND_HYPERNAMES(word)
FOREACH hypername IN hypernames
IF hypernyms IN predefine_desease_word_set:
RETURN TRUE
RETURN FALSE

Second method is to use the centered ontology which answers are generated. There diseases are
recorded as instances. If a instance is rdf:type disease. Thus, SPARQL queries can be used to find out
if a particular disease exist with the given name.

SELECT COUNT(disease)
WHERE {
?disease rdf:type base:disease.
?disease base:label ?strName.
FILTER regex(?strName, "chickenpox", "i" ).
}

If the COUNT(disease) returns a value 1 or more then it can be considered that disease exist with the
name chickenpox

7.2.2. Advantage of having a disease word replacer


Number of feature words reduce significantly (unwanted features can reduce the efficiency of
the classifier)
When ontology grows with new information classifier does not need to re trained or improved.
o Input question is what are the symptoms of cystinosis. However classifier is not
trained with word cystinosis and fail to identify it as a disease. If cystinosis is
identified as a disease in ontology preprocessor removes the word cystinosis with the
word disease. Now the classifier can easily identify that the question is in
ABT_diseases class.

48
7.2.3. Vectorization Process

Vectorization is done using TfidfVectorizer from sklearn.feature_extraction module. Preprocessed


question transformed into vectors using the TfidfVectorizer,fit_transfrom function.

vectorizer =TfidfVectorizer(min_df=1,ngram_range=(1,2),token_pattern=r'\b\w+\b')

Min_df = minimum document frequency is set to one. Once vectorizer is formed, a new word is only
considered if that word exist at least 1 time in the training dataset. Ngram_range is set to 1 to 2 to
consider both unigrams and bigrams. TfidfVectorizer does the function of CountVectorizer ( create the
feature set and form vectors with each term frequency) and influence the frequency value with rarity of
each words by converting them to tf-idf values. TfidfVectorizer output vectors are also normalized.
Once vectorizer is created it is saved using nltk library pickle so that it can be reused in both training,
testing, and using phase of the classifier.

7.2.4. Classification process.


3 classification algorithms are used. Logistic regression, Gaussian naive bayes, Support vectors.
Gaussian naive bayes classifier is used as a baseline classifier for other 2 classifiers. Training set consist
of unbiased 909 questions all preprocessed and vectorized. Testing set consisted of 600 questions that
are also preprocessed and vectorized. Same training sets are subjected to each classifier under following
conditions

Pre_processor
Without Disease word replacement.
With disease word replacement.

Vectorizer
Ngram_range is set to 1
Ngram_range is set to 1-2
Ngram_range is set to 1-3
Linear regression
Multi_class = multinomial , which is that it is using cross- entropy loss schemes for
multi class classification. (recommended in [PAS_6])

49
7.2.5. Accuracy calculation

For both training set and testing set accuracy is calculated as shown below.
Accuracy = Correct Predictions / Total number of tests

7.3. Answer Generation Module

7.3.1. Name Entity Recognition

Python Natural Language Toolkit(NLTK) was used to tokenize and remove stop words in
questions in Name Entity Recognition phase. First approach of finding entities is searching for
synsets in word-net and getting hypernym path for each synonyms. Then classes will be
identified by matching synonyms. To clarify that module don't miss any entities in questions
,second approach of entity finding will be executed for tokens. This approach have already
defined synonym set for each classes in ontology. Tokens will be iterate through those synonym
set and identify corresponding classes in Ontology. Here
before searching for matching synonyms, morphological affixes will be removed from tokens
and given synonym set so as to get all the word to basic form. This is done by using Porter
Stemmer function in NLTK. Pseudo code for this approach is given below.

FOR i IN RANGE(token_size):
token = token[i - 1]
row_number = 0
FOR syn_set IN synonyms_list:
row_number += 1
FOR synonym IN syn_set:
IF stemming(token) IN stemming(synonym):
class = Ontology_entity[row_number - 1])
RETURN class

50
7.3.2. Medical Term Recognition
Medical Term Recognition is done by UMLS API. UMLS version 2016AB is using for the
search.
Authentication for accessing API is obtained by API key given by UMLS . First words will be
search from API and identity matching CUI(Concept Unique Identifier). Then selected CUI
will be search again in API so as to obtain semantic type. Output of the API request are
converted in to JASON format.
7.3.2.1. Pseudo code for searching by term:
WHILE TRUE:
access = AuthClient.getst(AuthClient.gettgt())
pageNumber += 1
query = {'string': string, 'ticket': access, 'pageNumber': pageNumber }
query['searchType'] = "exact"
r = requests.get(uri + content_endpoint, params=query)
r.encoding = 'utf-8'
items = json.loads(r.text)
jsonData = items["result"]
FOR result IN jsonData["results"]:
cui = result["cui"]
RETURN cui

7.3.2.2. Pseudo code for searching by CUI:

query = {'ticket':AuthClient.getst (AuthClient.gettgt())}


r = requests.get(uri+content_endpoint, params=query)
r.encoding = 'utf-8'
items = json.loads(r.text)
jsonData = items["result"]
jsonData["semanticTypes"]
FOR stys IN jsonData["semanticTypes"]:
semantic_type = stys["name"]
RETURN semantic_type

51
7.3.3. SPARQL Query Generation

Input for this phase is an array which is have class and value tuple. Here the value module need
to search for is mentioned as NULL. For each class value tuple SPARQL query will be
generated an then each queries will matched with relations. Pseudo code for generating queries
is given below .(?rv is random string variable)

FOREACH tuple IN entity_List:

IF (class!=''NULL'' AND value !=''NULL''):


?rv1 TYPE :class
?rv1 LABEL value
IF (class!=''NULL'' AND value ==''NULL''):
?rv2 TYPE :class
?rv2 LABEL ?NULL
?rv1 ?rv2 ?rv3
RETURN ?NULL

SPARQL queries will be run via Apache Jena Fuseki SPARQL server.

7.3.4. Question Clarification Module

IF answer_count ==1:
RETURN answer
ELSE IF answer_count > 1:
FOR value IN answer1:
IF value NOT IN answer2:
unique_value = value
question_generation(unique_value)
IF answer_unique_value=="YES":
RETURN answer1
ELSE:
RETURN answer2
ELSE:
RETURN answer_list
ELSE:
RETURN "not enough facts"

52
Chapter 7
8. Evaluation
8.1. Ontology Automation Modules Evaluation
8.1.1. Data Retrieval
Objective behind this model is to automate the data gathering from online web sources. This
has been successfully achieved via a web spider which crawl through web pages and organizing
data into local directories. Initial approaches were carry out using individual pages information
retrieval via semi structured web pages. However not just being stick to semi structured this
approach further developed to unstructured data retrieval. Control experiments limits to the
single web domain to gather most of the data which significant to ontology development based
on MEDICAL domain.

8.1.2. Information Conceptualization


Target behind the information conceptualization model was to recognize Subject and Objects
in order to use them in semantic templates to automate ontology generation. In order to
identify concepts we have used Named Entity recognition model. By tagging entities with
respective class tags those entities can be then classify into ontology classes.

This Conceptualization model has developed using Stanfords CRF classifier by training
NER tagging model. CRF classifier is a most suitable approach for most applications like
POS tagging, binary classification, named entity classification, etc.

Following table indicates the accuracies for different levels.

Table 8:1 NER iterative classification performances

Level of Size of the Data Classification Accuracy


Training Set Used Precision % Recall F1_Score
1 2556 CRF 0.690318627451 0.680196078431 0.689900110988
2 33554 CRF 0.799929721441 0.818681318681 0.76328489306

These accuracies were obtained via a python based accuracy score tester. Which is tested to
calculate precision, recall and their f1-score measures.

53
8.1.3. Relationship Identification
During implementation of this model it was difficult to find the datasets relevant to the domain
which we have selected. Therefore datasets were prepared using web based data sources. And
tagged them iteratively while training and improving their accuracies. Following figure 6.1.2.1
indicates the level of training carried out and the accuracy gained after improving classification
results.

Table 8:2 Accuracy Comparison between levels of Learning

Level of Size of the Data


Classifier Precision Recall F1_Score
Training Set(documents)
1 300 RFC 0.497759075 0.507692308 0.44534499
2 500 RFC 0.559786974 0.564102564 0.556693338
3 750 RFC 0.606862304 0.61025641 0.601844704
4 1050 RFC 0.635260831 0.620512821 0.612085159
5 1226 RFC 0.647260905 0.641025641 0.633216417
5 1226 SVM 0.599861657 0.584615385 0.572464081
5 1226 LRM 0.635601292 0.625641026 0.612799712

From this it shows improving model reaches higher accuracies with the training data set
improves. Zhou et al. results [reference] similar approach shows that maximum obtained
accuracy is around 69%. Following figure

Table 8:3 Zhou et al. results [13]

Features Precision Recall F1_Score

Words 69.2 23.7 35.3


Entity type 67.1 32.1 43.4
Mention Level 67.1 33.0 44.2
Overlap 57.4 40.9 47.8
Chunking 61.5 46.5 53.0
Dependency Tree 62.1 47.2 53.6
Parse Tree 62.3 47.6 54.0
Semantic Resources 63.1 49.5 55.5

8.1.4. Ontology Automation


Most of the studies has carried out this operation based on semi automation approaches. Which
by starting with filling some of ontology data manually using packages such as portage and fill
remaining ontology using automated approaches. Objective of our model is to fully automate

54
this ontology generation. With following several Named Entity Identification, Relationship
Extraction techniques this module has achieved its objective.

8.2. Question classification Evaluation Model


8.2.1. Following test were performed.

Different n-gram counts for the tf-idf vectorizer.


o Find the n-gram count with maximum accuracy.
With and without the disease word replacement

8.2.2. For the Training Set

Table 8:4 with disease word replacement module

Classifier unigram bi-gram tri-gram

Logistic regression 90.788 92.34 92.34

Naive bayes 84.29 91.076 92.229

SVM 91.33 91.44 91.44

Table 8:5 without disease word replacement module

Classifier Without disease word replacement module

Logistic regression 87.22

Naive bayes 82.13

SVM 86.55

8.2.3. For the Testing dataset

Table 8:6 with disease word replacement module

Classifier unigram bi-gram tri-gram

Logistic regression 81.54 83.87 83.87

Naive bayes 76.23 80.34 81.51

SVM 83.23 85.43 85.43

55
Table 8:7 without disease word replacement module

Classifier Without disease word replacement module

Logistic regression 74.39

Naive bayes 75.15

SVM 73.12

Test question set consist of small number of question that had varied meanings from their respective
classes. Questions such as combinations of symptoms and preventions were often idenfiied incorrectly.
Example :- What are the symptoms and preventions of heart attack was identified as ABT_symptoms
and What are the preventions and symptoms of heart attack was identified as ABT_Preventon.
However best class that suits such questions are ABT_diseases, since users can get the overall
information about the disease including the symptoms and prevention methods. Testing set did not had
enough examples from IS_sym_dis, IS_caus_dis and IS_prev_dis classes. However those questions
were correctly predicted.
From both training set, and testing set accuracies it is clear that using bigram can increase the accuracy
to a significant amount. However the going from bigram to trigram did not increase the accuracy in
most cases. Logistic regression showed high accuracy in all situations. Disease word replacement
module has a significant impact on the accuracy of the classifier. Roughly 10% increase of accuracy
can be seen between with and without Disease word replacement module.
Multiclass classification module of 6 Primary classes reported maximum of 92.2% of accuracy using
bigrams headwords and wh-word[6] and over 5000 training data. However this classifier with 9 classes
managed to achieve similar accuracy with fewer training sets involved. Since this is focused on medical
domain, classes used for classification are more specific than the classes used in Multi Class classifier
mention in literature review section[6]. In Order to change the domain new generalized class system
need to be introduced for the classification process.

8.3. Answer Generation Module


This system can provide answer to some specific type of questions. System can give correct
answers for question which directly ask about causes, prevention, symptom, treatment and risk
of a disease. In addition system is capable of predicting disease when user provide list of
symptoms he's having. But when it comes to a very specific question like what is diabetes
blood sugar range? or what is normal blood sugar range?' system cant provide answers. For
answer these type of question ontology should be extended with new criteria, classes and with

56
more information. Some type of question and the ability of providing correct answer by system
have listed below.
Table 8:8 Ability of providing correct answer

Question types Example Capability


of
answering
Descriptive/disease What is Dengue? YES
Want to know about Alzheimer's Disease.
Direct/list/causes What stimulate/trigger diabetes? YES
What leads/drives to diabetes?
What are the causes of diabetes?
Direct/list/symptoms What are the symptoms/sign/indication of YES
Arthritis?
What are the clues that I am having arthritis?
How to diagnose arthritis?
Direct/list/prevention How can I prevent from Autism? YES
How to stop/halt/deter/hinder/restrain Autism?
Autism forestall/foreclose ways?
Direct/list/treatment Medication for skin burn? YES
What are the treatments for skin burn?
Direct/yes-no/causes Does STD transmit via public bathrooms? YES
Direct/yes-no/symptoms Is excessive thirst symptom of diabetes? YES
Direct/yes-no/prevention Does vaccines prevent dengue fever? YES
Can I reduce the risk of having cholera by boiling
drinking water?
Yes-no or Is chickenpox fatal? YES
descriptive/risk Can cataract lead to blindness?
Specific/prediction I have fatigue, breathing difficulties and YES
Coughing up blood. Recently I have loss my
weight a lot. What may be am I having?
Direct/specific Normal blood sugar range? NO
Questions which don't have Drugs NO
enough details in ontology to Diet
answer Emergency
Side effects
Health habits
When to meet doctor
severeness of disease
How treatments work
Vaccine
Details of test/exams/surgeries

57
Chapter 8
9. Conclusion and Further Work

Throughout the project with in this 1 year of time frame data extraction and ontology
automation model has certainly achieved its basic footsteps to ontology definition for any given
domain. In other words by following the initial steps of this content this can develop another
ontology definition for another domain. However drawback of this model is that it requires
finding large data sets and have to manually improve trained data to improve accuracy by the
model. For general topics this model can be fixed with the availability of datasets. Relationship
extraction is heavily dependent on NER and supervised relationship recognition models. For
NER model it is at satisfactory level which accuracies at 75% above. Therefore entities can be
recognized with higher probability value. But relationship extraction is below the satisfactory
level (around 63%). This undesired outcome could be as a result of conducting a domain
specific approach to train model. However with increased volume of data set this can further
improve at least up to 70%.

Question identification is an important step of creating an intelligent Q & A system. Since


questions can come in different from it is important to identify their intention before answering
the question. Group of classes can be used to represent different type of questions (with
different intentions/expectations). Those group of classes must be broad enough to cover most
of the common questions asked from that domain and specific enough to tackle questions that
request specific details about a certain topic. Our approach of question classifier managed to
cover broad area of questions and correctly identify them based on the intended answer type.
However in some cases it failed to identify questions that are asking for specific little details
about certain topics. This shows that the question classification module can benefit from a
secondary classifier that classifies each primary class into their own secondary classes.

This work presents a semantic and linguistic based approach for the extraction of medical
entities using semantic relations in medical domain. This approach have five main steps.

(I) Natural language query analysis


(II) Entity and term recognition
(III) SPARQL query generation
58
(IV) Extracting answers from ontology
(V) Ambiguity clarification

The accuracy of the module is based on the precision of identified terms and entities. Usage of
UMLS and WordNet have increased the performance of the system. In addition effectiveness
and accuracy of the system depend on the accuracy of question classification module. This
module is useful for common users and it can be develop further for usage of doctors with
ontology expansion and pattern recognition between diseases and other entities and relations.

Further work

Expanding ontology with more classes, entities and information


Getting descriptive user feedback to further solve the ambiguity issues
Usage of other standard medical libraries to increase the accuracy.( OMIM - On line
Mendelian Inheritance in Man, NCBI- National Center for Biotechnology Information,
MeSH - Medical Subject Headings)
Referencing previously asked keywords by a specific user
Question linking ( extracting info from previous questions)
Integrate spell corrector

59
References
[1]R.J Bayardo, et al., "InfoSleuth: Agent-Based Semantic Integration of Information in Open
and Dynamic Environments", Microelectronics and Computer Technology Corporation
,Austin, Texas.
[2]A. Maedche and S. Staab, "The TEXT-TO-ONTO Ontology Learning Environment",
Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany.
[3]P. Cimiano, A. Madche, S. Staab and J. Volker, "Ontology Learning", Institute AIFB,
University of Karlsruhe, Karlsruhe, Germany.
[4]S. Mishra and S. Jain, "Automatic Ontology Acquisition and Learning", epartment of
Computer Applications, Teerthankar Mahaveer University, Moradabad, U.P, 2014.
[5]M. Shamsfard and A. Barforoush, "Learning Ontologies from Natural Language Texts",
Computer Engineering Dept., Amir Kabir University of Technology, Hafez ave., Tehran,
Iran.
[6] Huang, Zhiheng, Marcus Thint, and Zengchang Qin. "Question Classification Using
Head Words and Their Hypernyms." Proceedings of the Conference on Empirical Methods in
Natural Language Processing - EMNLP '08 (2008): n. pag. Web.
<http://www.aclweb.org/anthology/D08-1097>.

[7]Xin Li, and Dan Roth. "Learning Question Classifiers." Learning Question Classifiers
(2004): n. pag. Learning Question Classifiers. Web. 01 Feb. 2017.
<http://cogcomp.cs.illinois.edu/Data/QA/QC/>.

[8]A. M. Moussa and R. F. Abdel-Kader, "QASYO: A Question Answering System for


YAGO Ontology", Undergraduate, Electrical Engineering Department, Faculty of
Engineering, Port-Said University, 2011.

[9]C. Pradel, O. Haemmerle and N. Hernandez, "Natural Language Query Interpretation into
SPARQL Using Patterns", IRIT, Universite de Toulouse le Mirail, 2013.

[10]J. Lehmann and L. Buhmann, "AutoSPARQL: Let Users Query Your Knowledge Base",
University of Bonn,University of Leipzig, 2011.

[11]"CDC", Centers for Disease Control and Prevention. [Online]. Available:


https://www.cdc.gov/. [Accessed: 15- Feb- 2017].
[12]"pandas: powerful Python data analysis toolkit", Pandas.pydata.org, 2017. [Online].
Available: http://pandas.pydata.org/pandas-docs/stable/. [Accessed: 03- Jun- 2017].
[13]M. Mintz, S. Bills, R. Snow and D. Jurafsky, "Distant supervision for relation extraction
without labeled data", Stanford University / Stanford, CA 94305.
[14][online].available: https://www.nlm.nih.gov/research/umls/

[15][online]. Available: https://www.w3.org/TR/rdf-sparql-query/

60
[16][online]. Available: https://jena.apache.org/tutorials/sparql.html

61
Appendix A
Individual Contribution

Name of the student: P. N Udawatta (124181E)

As a member of the beyond project group I selected the Question classification module of
the intelligent question answering system. Goal of the Question classification module is to train
a question classifier that can identify the question type based on the expected answer of the
question. During the early stages (before interim) following are my contributions to the project.
Populate the testing ontology using data collected from traversing the web pages using.
beautifulsoup module (for the testing purposes)
Identifying set of classes for the question classifier.
Build a training and testing dataset based on the classes defined.
Training the prototype classifier.
After the interim period my focus was on improving the accuracy using different methods.
Creation of the question preprocessing system.
Implementation of the vector based model.
Using 3 different classifiers and twitching small details of the training process to reach
a higher accuracy.
Studying the possibility of generalized classification model that can be used to expand
the domain.

62
Name of the student: I. A. Abeysekera (124003M)

I was responsible to research about new concept where to collect data from data sources and
map those unstructured data into an ontology. This concept seemed as minor task at the
beginning however when digging into deeper I have found that its more important to retrieve
high quality information in order to provide quality answers. There was high correlation
between Question answers and data extraction. Therefore I have gone through several related
concepts who have carried out similar approaches. While gathering those information my
mindset changed and understood how those principles applied in text mining and data
gathering.

Following are the areas which I have involved more during this research study:

Worked on two different text learning models.


NER classification model and supervised relations recognition model.
Developed a Hand written patterns based relation recognition model.
Improved a data extracting model which is based on web spider.
Manually trained data sets using an approach which is similar to bootstrapping iterative
development.
Created an ontology auto generating model.
Utilize adequate time to generalize ontology generation model.
Carried on set of experiments on others work to compare whether those models can be
used in some models in project such as Word2Vec, Syntax NET, Tensorflow, etc.

As a group member I had to always communicate with my colleague members to keep the
individual processes in track. Always asking what are the things they need as inputs from my
system I have carried out the design phase.

During Final implementation I had work mostly on improving accuracies of the training
models. At this stage I had to carry out several test cases relates to ontology results. Due to
relations extracted from the system always showing new patterns which should have
categorized properly, I had to test on several grammar types on chunk parsers and other relation
extraction models. I have trained two separate models to recognize entities using named entity
recognition model and supervised relationship extracting model. This was great new
experience to learn several machine learning approaches and their internal mechanisms.

63
Name of student: K.M.K.Hasantha (124069T)

As a member of the project group I choose the Answer Generation Module of the Intelligent
Question Answering System. This module include key term extraction from natural language
query and auto generating SPARQL queries to extract the answers from ontologies.

Contributed areas of this process have been listed below.


Entity recognition in natural language queries
Medical terms recognition
SPARQL query generation
Extracting answers from ontology
Natural language question generation for ambiguity clarification
User Interface designing

64
Appendix B
Sample Dataset used during NER classification Module

Dengue B-DISEASE
fever I-DISEASE
is O
a O
disease B-NN
caused O
by O
a O
family O
of O
viruses B-NN
that O
are O
transmitted O
by O
mosquitoes B-CREATURE
. O
Symptoms O
of O
dengue B-DISEASE
fever I-DISEASE
include O
severe O
joint O
and O
muscle O
pain O
, O
swollen O
lymph O

65
Sample Dataset used to train Supervised Relationship extraction model

id classification relation
301 cause_symptom They can also build up and cause inflammation.
302 other Normally your blood doesn?t have a large number of eosinophils.
Your body may produce more of them in response to, allergic
disorders, skin conditions, parasitic and fungal infection, autoimmune
303 is_caused_by diseases, some cancers, and bone marrow disorders.
In some conditions, the eosinophils can move outside the bloodstream
304 cause_symptom and build up in organs and tissues.
Symptoms of EoE include nausea, vomiting, and abdominal pain after
305 cause_symptom eating.
A person may also have symptoms that resemble acid reflux from the
306 cause_symptom stomach.
In older children and adults, it can cause more severe symptoms, such
as difficulty swallowing solid food or solid food sticking in the
307 cause_symptom esophagus for more than a few minutes.
308 is_caused_by In infants, this disease may be associated with failure to thrive.
In some situations, avoiding certain food allergens will be an effective
309 has_treatement treatment for EoE.
Eosinophilic fasciitis is a very rare syndrome in which muscle tissue
310 disease_is under the skin, called fascia, becomes swollen and thick.

66