Anda di halaman 1dari 20

An Analysis of the AskMSR Question-Answering System

Question answering systems

It can be:

Data redundancy based


focusing on data per se as a key resource available to drive our system design
the greater the answer redundancy in the source data collection, the more likely it is that we
can find an answer that occurs in a simple relation to the question

Sophisticated linguistic analysis of either question or candidate answer


Linguistic resources include: part-of-speech tagging, parsing, named entity extraction,
semantic relations, dictionaries, WordNet,
systems typically perform complex parsing and entity extraction for both queries and best
matching Web pages, and maintain local caches of pages or term weights
uncovering complex lexical, syntactic, or semantic relationships between question string and
answer string
The need for anaphor resolution and synonymy, the presence of alternate syntactic
formulations and indirect answers all make answer finding a potentially challenging task.

Anaphora Resolution == the problem of resolving what a pronoun, or a noun phrase refers to.
In the following example, 1) and 2) are utterances; and together, they form a discourse.

1) John helped Mary.

2) He was kind.

As human, readers and listeners can quickly and unconsciously work out that the pronoun "he" in
utterance 2) refers to "John" in 1). The underlying process of how this is done is yet unclear...
especially when we encounter more complex sentences:

An example involving Noun phrases (Webber 93)

1a) John traveled around France twice.

1b) They were both wonderful. ??

2a) John took two trips around France.

2b) They were both wonderful.

Consequently, anaphora resolution presents a challenge, and is an active area of research.


Question answering is a multidisciplinary field

IR, IE, ML, NLP

The architecture of the system can be described by four main steps:


1. Query reformulation,
2. n-gram mining,
3. filtering, and
4. n-gram tiling.

Query Reformulation
Given a question, the system generates a number of weighted rewrite strings which are likely
substrings of declarative answers to the question.
For example, When was the paper clip invented? is rewritten as The paper clip was
invented.
query reformulation:
o Rule-based approach
o Machine Learning approach (learn query-to answer reformulations & their weights)

N-gram Mining
From the page summaries returned by the search engine, n-grams are collected as possible
answers to the question.

N-Gram Filtering
Next, the n-grams are filtered and reweighted according to how well each candidate matches
the expected answer-type, as specified by a handful of handwritten filters.

N-Gram Tiling
Finally, we applied an answer tiling algorithm, which both merges similar answers and
assembles longer answers from overlapping smaller answer fragments.

Therefore, it is important that a QA system has some idea as to how likely an answer is to be correct, so it
can choose not to answer rather than answer incorrectly.

Answering Complex Questions with Random Walk Models


Document Understanding Conferences (DUC), the answer to complex questions like Q0 is
considered to be a multi-sentence multiple document summary (MDS) that meets the information
need of the question.

We introduce a new paradigm for processing complex questions that relies on a combination of:
a. question decompositions (of the complex question);
b. factoid question answering (Q/A) techniques (to process decomposed questions); and
c. Multi-document summarization techniques (to fuse together the answers provided for each
decomposed question).

A novel question decomposition procedure that operates on a Markov chain model


Question decomposition depends on the successive recognition (and exploitation) of the relations
that exist between words and concepts extracted from topic-relevant sentences.

Two types of question decomposition modules:


a syntactic question decomposition module and
o sub-questions are extracted from a complex question by separating conjoined
phrases and recognizing embedded questions.
o syntactic decomposition is question decomposition algorithm
a random walk based question decomposition module.

The keywords are expanded by


(1) identifying the semantic class to which they belong, and
(2) using other terms from the lexicons associated with such semantic classes.
To identify the semantic class, the keyword is matched against the lexicon of the class

We use two different models of topic signatures to identify (a) the most representative relations
for the topic referred by the complex question and evidence by the document collection.
The first topic signature (TS1)
TS1 is defined by a set of terms ti
the topic signature TS2: binary relation between two topics
it identifies two forms of relations:
(a) syntax-based relations, and (b) salience-based context relations.
The arguments of these relations may be (1) nouns or nominalizations;
(2) named entity types that a Named Entity Recognizer
(NER) identifies; and (3) verbs
When topic signatures are available, each sentence from the document collection receives a score
based on (a) the presence of a term from TS1; (b) the presence of a relation from TS2; and (c) the
presence of any of the keywords extracted from the sub-question or their alternations.

STEP 1: The complex question is lexically, syntactically and semantically analyzed with the
goal of identifying the relationships between words that may lead to the generations of simpler
questions.
STEP 2: For a relation discovered at Step 1 we generate questions that involve that relation

Query Formulation
(a) queries involving the lexical arguments of the relation,
(b) queries that involved semantic extensions.
Four forms of extensions were considered:
(1) extensions based on the semantic class of names that represent the nominal category (e.g.
names of drugs),
(2) extensions based on verbs which are semantically related to the verb in theWordNet database
(e.g. develop(v) sem.relationcreate(v); develop(v) sem.relationproduce(v));
(3) extensions that allow the nominal to be anaphoric, therefore replaced by a pronoun, e.g.
[develop it]; and
(4) extensions that allow the nominalizations, as well as the verbal conjuncts, to be considered.

Automatic Question Answering: Beyond the Factoid


Answering factoid question techniques:
o question parsing
o question-type determination
o WordNet exploitation
o Web exploitation
o noisy-channel transformations
o semantic analysis
o inferencing

We employ learning mechanisms for question answer transformations


We build our QA system around a noisy-channel architecture which exploits both a language
model for answers and a transformation model for answer/question terms, trained on a corpus of
1 million question/answer pairs collected from the Web.

One of the first challenges to be faced in automatic question answering is the lexical and stylistic
gap between the question string and the answer string.
o Question reformulations
o QA typologies and hierarchical question types (useful for general QA)
The proposed work is less linguistically motivated and more statistically driven.
Question Classification for a Croatian QA System
Classification of questions according to the expected answer type. If a QA system knows the
type of the answer it is looking for (e.g., number, city, person name)
The QC problem can be tackled using various approaches:
o Rule-based methods (e.g., regular expression matching) and
o Statistical language modeling to
o improved absolute discounting and log-linear interpolation
o machine learning methods for question classification
o SVM (SVMLib), Decision Tree (RapidMiner), k-nearest neighbors (k-NN), as
well as language modeling (LM).

Features used for classification,


o simple features: words and ngrams,
o syntactic features such as noun phrases, chunks, and head chunks
o semantic features such as named entities and WordNet hypernyms
The QC approaches also differ in what classification taxonomy they use
o one-level taxonomy consisting of a few coarse-grained classes
o multilevel taxonomy

Proposed work consists of two levels:


o a coarse-grained level that groups questions into six basic classes,
o a fine-grained level consisting of 50 classes.

Feature selection methods [18]: information gain (IG), 2- statistic (CHI), and document
frequency (DF)
Results are slightly better when using morphological normalization (stemming or
lemmatization), and somewhat surprisingly better on Croatian than on English data.

Question-Answering by Predictive Annotation


QA
IR techniques are insufficient for QA
Natural Language Processing and Information Extraction (solution)
IR techniques, NLP and IE (better solution)
information extraction or deep parsing techniques must be applied to solve the
Question-answering problem is an open

New text-processing technique called Predictive Annotation (PA)


Our approach is based on the following observations about fact-seeking questions:
Questions can be classified by the kind of answer they are seeking.
Answers are usually in the form of phrases.
Answer phrases can be classified by the same scheme as the questions.
Answers can be extracted from text using shallow parsing techniques.
The context of the answer phrase that validates it as an answer to the question is usually a
small fraction of the document it is embedded in.
We modified:
1. Query analysis to detect the question type
2. The indexing process to perform shallow linguistic analysis of the text and to identify and
annotate appropriate phrases with class labels.
3. The search engine to rank passages instead of documents, and to use a simple ranking
formula.
There are two kinds of ambiguity that affect this process, semantic and granular. Semantic
ambiguity occurs with questions like "How long ..." which could be asking about time or
distance (or conceivably, for works of literature, number of pages). Granular ambiguity occurs
when there are QATokens that represent nested classes (for example DATES and YEARS for
"When" questions; PERSONS, ORG$, ROLES, and NAMES for ''Who" questions).

we experimented with two different answer-selection algorithms, AnSel and WerLect, as


reported in [20,21] (Ranking Suspected Answers to Natural Language Questions using Predictive
Annotation)

A Question Answering System Supported by Information Extraction


NLP Based Retrieval of Medical Information is the extraction of medical data from narrative
clinical documents. In this paper, we review Natural Language Processing (NLP) applications
designed to extract medical problems from narrative text clinical documents.

Textract IE
MUC divides IE into distinct tasks, namely;
1. NE (Named Entity),
types (person, organization, location, time, date, money and percent),
2. TE (Template Element),
3. TR (Template Relation),
4. CO (Co-reference), and
5. ST (Scenario Templates)

AQUA: A Closed-Domain Question Answering System


It combines:
Natural Language processing (NLP),
Ontologies, Logic, and
Information Retrieval technologies in a uniform framework.
The ontology is used in the refinement of the initial query, the reasoning process and in the novel
similarity algorithm.

The ontology is used


in the refinement of the initial query,
in the reasoning process (a generalization/specialization process using classes and
subclasses from the ontology), and
in the (novel) similarity algorithm
Query Logic Language (QLL) used in the translation of the English written questions.
In the process model, there are four phases:
1. user interaction,

2. question processing,
understanding of the question requires several steps such as parsing the question,
representation of the question and classification.
NLP parser. This segments the sentence into subject, verb, prepositional phrases,
adjectives and objects. The output of this module is the logic representation of the query.
Interpreter. This finds a logical proof of the query over the knowledge base using
unification and resolution algorithms.
WordNet/Thesaurus. AQUAs lexical resource.
Ontology. This currently contains people, organizations, research areas, projects,
publications, technologies and events.
Failure-analysis system. This analyzes the failure of given question and explains why
the query failed. Then the user can provide new information for the pending proof and the
proof can be re-started. This process can be repeated as needed.
Question classification & reformulation. This classifies questions as belonging to any of
the types supported in AQUA, ( what, who, when, which, why and where). This
classification is only performed if the proof failed.

3. document processing
documents are selected and a set of paragraphs are extracted
This relies on the identification of the focus4 of the question. Document processing
consists of two components:
Search query formulation. This transforms the original question, Q, using
transformation rules into a new question Q. Synonymous words can be used, punctuation
symbols are removed, and words are stemmed.
Search engine. This searches the web for a set of documents using a set of keywords.
4. answer extraction.
answers are extracted from passages and given a score, using the two components:
Passage selection. This extracts passages from the set of documents likely to have the
answer.
Answer selection. This clusters answers, scores answers (using a voting model), and
lastly obtains a final ballot.
QUERY LOGIC LANGUAGE (QLL)
Query Logic Language (QLL) is used for the translation from the English question into its Logic
form. In QLL, variables and predicates are assigned types.

Like Prolog or OCML (Motta, 1999), QLL uses unification and resolution (Lloyd, 1984).
However, in the future we plan to use Contextual Resolution (Pulman, 2000). Given a context,
AQUA could then provide interpretation for sentences containing contextually dependent
constructs.
Contextual resolution based transformation of English sentences into logic form
During the segmentation, the system finds nouns, verbs, prepositions, and adjectives.
AQUA makes use of an inference engine, which is based on the Resolution algorithm. However,
in future it will be tested with the Contextual Resolution algorithm, which will allow the carrying
of context through several related questions

Knowledge-Based Question Answering


ExtrAns uses a combination of robust natural language processing technology and dedicated
terminology processing to create a domain-specific Knowledge Base, containing a semantic
representation for the propositional content of the documents. Knowing what forms the
terminology of a domain and understanding the relation between the terms is vital for the answer
extraction task.

Terminological Knowledge Base (TermKB) [13].


To this end we have adapted the terminology extraction tool FASTR [7].

Natural language question answering: the view from here


Question-answering: dimensions of the problem
A system must analyse the question
It must present the answer to the user with justification or supporting materials
QA combines information retrieval and natural language processing techniques.
Dimensions of QA in terms of:
1. Applications
Applications based on the source of the answers:
structured data (databases),
semi-structured data (for example, comment fields in databases)
free text (the focus of the articles in this volume).
Further distinguish among:
search over a fixed set of collections, as used in TREC (particularly useful for
evaluation);
search over the Web, as discussed in the Buchholz and Daelemans paper;
search over a collection or book, e.g. an encyclopedia (Kupiec, 1993); or
search over a single text, as done for reading comprehension evaluations.
Also distinguish between:
domain-independent question answering systems
domain specific systems
2. Users
First time users or casual users
Expert users
3. Question types
Distinguish questions by answer type:
Factual answers vs. opinion vs. summary.
4. Answer types
Answers may be long or short, they may be lists or narrative
Methodologies for constructing an answer:
through extraction { cutting and pasting snippets from the original document(s)
containing the answer}
via generation.
The answer is drawn from multiple sentences or multiple documents, the coherence of an
extracted answer may be reduced, requiring generation to synthesize the pieces into a
coherent whole.
Question answering and summarization may merge as research areas
5. Evaluation
6. Presentation.

The question answering roadmap


Natural language front-ends to databases- BASEBALL, LUNAR
The aim was to allow users to communicate in their own language with an interface that knew
about questions and about the database structure and could negotiate the translation.
syntactic and semantic analysis of questions

One interesting difference between the questions typically discussed in the literature on natural
language front ends to databases and those in the literature on QA against open text collections is
the role of quantifiers and logical connectives. In questions posed against databases, quantifiers
and connectives frequently play a significant role, e.g. Who are all the students in MAT201 who
also take MAT216?. Put otherwise, such questions tend to ask about the extensions of complex
sets defined in terms of set theoretic operations on simpler sets. Questions against open text
collections, on the other hand, tend to be about finding properties or relations of entities known
via a definite description { Where is the Taj
Mahal?, What year did the Berlin Wall come down?, Which team won the FA cup in 1953?.
In such questions quantifiers and connectives do not play a major role. No doubt this will change
as open text collection QA gets more ambitious, bringing these two traditions closer together.

Information retrieval, information extraction and Question Answering


IR
Information Retrieval (IR), which, following convention, we take to be the retrieval of
relevant documents in response to a user query
Relevant to question answering for two reasons
o First, IR techniques have been extended to return not just relevant documents, but
relevant passages within documents.
o Second, from IR methodology and community the recent question answering
evaluation developed

IE
It was initially known, message understanding
the activity of filling predefined templates from natural language texts, where the
templates are designed to capture information about key role players in stereotypical
events
In the current context, IE templates can be viewed as expressing a question and a filled
template as containing an answer.
evaluation exercise the Message Understanding Conferences (MUC)-terminated
language understanding technology

QA
Questions should be syntactically correct interrogatives

a serious attempt to apply formal logical techniques to the analysis of questions, i.e. to define a
suitable syntax and semantics for a formal language of questions and answers.
First order predicate calculus with functions and identity
Processing stages of QA model
1. Question Analysis: The natural language question input by the user needs to be analyzed
into whatever form
IR system, then one question representation might be a stemmed, weighted term vector
for input to the search engine
detailed analysis of the question which typically involves two steps:
1. identifying the semantic type of the entity sought by the question (a date, a
person, a company, and so on);
2. determining additional constraints on the answer entity by, for example:
identifying key words in the question which will be used in matching
candidate answer-bearing sentences; or,
identifying relations { syntactic or semantic { that ought to hold between a
candidate answer entity and other entities or events mentioned in the question.
built hierarchies of question types based on the types of answer
The constituent analysis of a question that it produces is transformed into a semantic
representation which captures dependencies between terms in the question.

2. Document Collection Preprocessing: Assuming the system has access to a large


document collection as a knowledge resource for answering questions, this collection
may need to be processed before querying, in order to transform it into a form which is
appropriate for real-time question answering.

3. Candidate Document Selection: A subset of documents from the total document


collection (typically several orders of magnitude smaller) is selected, comprising those
documents deemed most likely to contain an answer to the question.

4. Candidate Document Analysis: If the preprocessing stage has only superficially analyzed
the documents in the document collection, then additional detailed analysis of the
candidates selected at the preceding stage may be carried out.
First order logical representation
5. Answer Extraction. Using the appropriate representation of the question and of each
candidate document, candidate answers are extracted from the documents and ranked in
terms of probable correctness.
6. Response Generation: A response is returned to the user.

QA should be assessed with the following criteria:


Relevance: the answer should be a response to the question.
Correctness: the answer should be factually correct.
Conciseness: the answer should not contain extraneous or irrelevant information.
Completeness: the answer should be complete, i.e. a partial answer should not get full
credit.
Coherence: an answer should be coherent, so that the questioner can read it easily.
Justification: the answer should be supplied with sufficient context to allow a reader to
determine why this was chosen as an answer to the question.

As a research agenda, question answering poses long-term research challenges in many critical
areas of natural language processing:
Applications:
o structured information (in databases)
o Free text.
o automated help,
o Web content access
o front-ends to knowledge sources such as on-line encyclopedias or to bibliographic
resources (e.g. to MEDLINE for biomedical literature)

Ripple Down Rules for Question Answering


Ontology-based question answering systems use semantic web information to produce more
precise answers to users queries.
Traditional restricted-domain QA systems make use of relational databases to represent target
domains.
Subsequently, with the advantages of the semantic web, the recent restricted-domain QA systems
employ knowledge bases such as ontologies as the target domains [30].
Thus, semantic markups can be used to add Meta information to return precise answers for
complex natural language questions

Natural language question ------ KB Grammar rules ------Intermediate Representation ---


KbQAS ------Ontology with Concept-Matching technique ------- Answer

KbQAS consists of two components:


Question analysis
Uses a knowledge base of grammar rules for analyzing input questions
Answer retrieval
Responsible for interpreting the input questions with respect to a target ontology
The association between the two components is an intermediate representation element which
captures the semantic structure of any input question.
This intermediate element contains properties of the input question including:
Question structure,
Question category,
Keywords and semantic constraints between the keywords
The key innovation of KbQAS is that it proposes a knowledge acquisition approach to
systematically build a knowledge base for analyzing natural language questions.

To convert a natural language question into an explicit representation in a QA system:


rule-based approaches
Single Classification Ripple Down Rules knowledge acquisition methodology
Many of the proposed Open-domain QA are based on machine learning as well as knowledge
representation and reasoning

Traditional restricted-domain QA systems are called natural language interfaces to databases.


Some QA Systems use:
Syntactic-semantic interpretation rules driving logical forms to process the input question
Semantic grammars to analyze questions

Ontology-Based Medical Question Answering System


Design of QA system requires:
Efficient and Deep analysis of NLQ
Translation of the semantic relation expressed in the question to machine-readable form
Inferring the focus and correct characteristics of the question (efficiency of outcome
depends on this)
Applicable for: in their information search
Biomedical researchers
Healthcare professionals
Patients
QA task has two reference inputs:
Corpora- to be used to extract the relevant answers
Question itself

Sources of knowledge
WordNet UMLS

Summarization techniques
Cluster-based

Asma Ben Abacha [3], approach for translating NLQ into machine-readable representation
Medical entity recognition
Semantic relation extraction
Automatic translation to SPARQL queries
The system uses two ontologies: WordNet and UMLS
AUTOMATED QUESTION-ANSWERING TECHNIQUES AND THE
MEDICAL DOMAIN
Three major QA approaches:
1. Deep NLP
Converts text input into formal representation of meaning such as:
Logic (first order predicate calculus),
Semantic networks,
Conceptual dependency diagrams
Frame-based representations
Perform a semantic analysis of text in NL
the process of studying the meaning of a linguistic input and giving a formal
representation of it
Approach for semantic analysis
The user input is first passed through a syntactic parser, whose output, represented with a
parse tree, is then processed by a semantic analyzer which delivers a meaning
representation
The system derives logical representations of both user questions and the documents in
the collection
The documents are analysed in an offline stage and their semantic form is stored in a
Database. In an on-line stage user questions are converted into their semantic
representation, prior to being compared to the representations of the documents in the
matching process.
When a match occurs, the sentences that originated the match are extracted as possible
answers to the user question.
Drawbacks of the deep NLP approach are:
Its computational intensiveness and its high processing time (Andrenucci and
Sneiders, 2005, Rinaldi et al., 2004) as well
Portability difficulties (Andrenucci and Sneiders, 2005, Hartrumpf 2006)

Six components:
Linguistic frontend- change when the input language changes
o The linguistic front-end parses and analyses the user input in NL
Three components (domain-dependent knowledge) change when the knowledge
domain changes.
o Contains information specific for the domain of interest: a lexicon and a
world model.
o The lexicon contains admissible vocabulary words from the knowledge
domain.
o The world model describes the structure of the domain of interest, i.e. the
hierarchy of classes of the domain objects, plus the properties and the
constraints that characterize the relationship between them.

2. IR with shallow NLP


Shallow NLP, which does not imply text understanding, i.e. semantic analysis of NL
input
It focuses on extracting text chunks, matching patterns or entities that contain the
answer to user questions
The IR approach is more domain-independent than traditional NLP
Answers retrieved with IR techniques are less justified by the context
This approach is typical for information extraction and is largely used in the Text
Retrieval Conferences
IR approach distinguish the expected answer type (e.g. person, place or time) with the
help of the so-called wh words in the user question

3. Template-based QA
Exploits a collection of manually created question templates, i.e. questions which have
open concepts to be filled with data instances, mapped into the conceptual model of the
knowledge domain
The interpretation is done manually, individuating for each single template the concepts
that cover a part of the conceptual model of the knowledge domain

the most viable approach when it comes to medical information portals on the Web
(Andrenucci and Sneiders, 2005). This is due to the following characteristics:
i. its suitability to support multilingual content,
ii. the relatively easiness of maintenance,
iii. its capacity to solve linguistic ambiguities such as word sense disambiguation
without computationally expensive software,
iv. and its capability to return answers in different formats.
Drawback:
Manual creation of the templates is required
Does not provide a natural flow in user/system dialogue or provides dialogues of
poor quality
The most viable commercially and fits Web-based medical applications that are aimed at
retrieving multilingual content in different multimedia formats

A Framework of Ontology-based KMS


Our ontology-based KMS encompasses four main modules, they are:
Ontologies Building, Documents Formalization, Similarity Calculation and User Interface.
Ontology Building: We adopt Protg, developed by Stanford University, to build our domain
ontologies. The concepts and relations are from the standard subject category of China.
Document Formalization: Benefiting from the ontologies that we have built, we can use the
concepts to formalize the documents containing information about projects and domain experts.
Similarity Calculation: By conducting the proposed integrated method to the concept trees
corresponding to projects and domain experts respectively, we can calculate the similarities
between them and rank the candidate domain experts afterwards. As a result, the most
appropriate domain expert can be obtained.
User Interface: This matching system implements the typical client-server paradigm. End users
can access and query the system from the Internet, while domain experts or system
administrators can manipulate the formalization and ontology building process.
OntoNLQA: a Framework for Ontology-Based Question Answering
Ontology-based question answering (QA) systems transform the natural language question into
RDF-triples in order to create a query that retrieves an answer

An Ontology-Based Arabic Question Answering System


Arabic QA has not explored the field of QA on the Semantic Web has mainly focused on
information retrieval from unstructured Arabic documents huge amount of information is
available on the Web in terms of RDF and OWL.
This information can be queried by using the standard SPARQL. However, nave users who have
no experience with Semantic Web cannot express their questions in SPARQL. This problem can
be resolved by using Natural Language (NL) interfaces that translate NL queries to SPARQL.

This work aims to make a step towards supporting Arabic QA on the Semantic Web. It
introduces the QA system that can interface to any Arabic ontology, get a NL user query as an
input and retrieves an answer from a RDF knowledge base.

The core of the system is the approach we propose to translate Arabic NL queries to SPARQL
The most common NLP system contains:
tokenization, part-of-speech tagging (POS), stemming, named entity recognition (NER),
semantic relations, dictionaries, WordNet
QA systems for Arabic language are very few. Mainly, it is due to the lack of accessibility to
linguistic resources, such as corpora and basic NLP tools (tokenizers, morphological analyzers,
etc.). Moreover, the Arabic language has a very complex morphology (inflectional and
derivational characteristics) and texts suffer from the scarcity of vowels as well as the absence of
capitalization. These specificities of the Arabic language introduce many processing problems
related to the word tokenization, the identification and categorization of named entities

Resource Description Framework (RDF) is a framework to annotate information resources in a


machine-understandable way.
Web Ontology Language (OWL) is an ontology language for authoring ontologies or knowledge
bases.
Ontology based reasoning can help in resolving word disambiguation or identifying intelligent
answers.
Interpret a natural language (NL) query and translate it in SPARQL
Rule-based

Combined approach of ontology-based reasoning and natural language processing to


automatically answer questions posed by humans in a natural language.

The main objective of this research is to develop an ontology based Arabic Question Answering
system that transforms NL queries in Arabic to SPARQL queries.

The proposed QA system is composed of four major components: Data Processing, Question
Processing, Ontology Mapper and Answer Retrieval.

There is a couple of open source natural language processing software, such as OpenNLP,
Stanford parser etc. [13].

Tools used:
To implement the Arabic QA system, we used java programming language
The Ontology, which represents the schema, is stored in a file. RDF data representing the
individuals and other annotations are stored as RDF triples in a MySQL database table.
We used Jena API for ontology manipulation and reasoning.

Tools and Programs


To implement Arabic QA system and documentation of the thesis, the following tools had been
used:
For ontology building, we used Protg 3.41 which is a free, open source tool for editing and
managing Ontologies [55].
To implement the queries, Jena2 framework has been used. Jena is a Java toolkit which
provides an API for creating and manipulating RDF models.
Java Development Kit (JDK) 1.6: A software development package from Sun
Microsystems that implements the basic set of tools needed to write, test and debug Java
applications.
Eclipse Standard/SDK: This is the program which helps us to build and finish the system
implementation using java language.
MySQL 5.6: We stored all the data we need to retrieve answers.
We use Stanford NLP3 for normalization, tokenization and POS tagging.
Shreen Khoja Stemmer [18]: This is a free Arabic stemmer. We use it to stem each Arabic
word in the document. Also, it removes all strange words and nonletters from the text.
Microsoft Word 2010: This is the main program used to write the documentation of the system.

Medical Question Answering: Translating Medical


Questions into SPARQL Queries
Designing question answering systems requires efficient and deep analysis of natural language
questions. A key process for this task is to translate the semantic relations expressed in the
question into a machine-readable representation.
Question analysis in the medical field
Study how to translate a natural language question into a machine-readable representation.
The underlying transformation process requires determining three key points:
1. What are the main characteristics of medical questions?
2. Which methods are the most fitted for the extraction of these characteristics? and
3. How to translate the extracted information into a machine-understandable representation?

We present a complete question analysis approach including:


Medical entity recognition,
Semantic relation extraction and
Automatic translation to SPARQL queries

Both questions and candidate answers (or whole corpus) are represented in a homogeneous
semantic representation that can be processed by information systems. Using meta-languages:
RDF(S) and OWL3 to formalize the representation of meaning on the Web
o Supported by efficient storage systems and APIs (e.g. Sesame4, Jena5,
Virtuoso6).

The overall question analysis process is evaluated on a real question corpus collected from the
Journal of Family Practice (JFP)

Information is extracted by a chunk annotation method which classifies question-level


information into five types according to their semantic role.

Medical Domain resources:


UMLS encompasses a semantic network of medical concepts and relationships
Metathesaurus including 2 million concepts and 7 million concept names, and a
Specialist Lexicon
Medline which contains more than 18 million medical article citations
Taxonomies for medical questions were proposed:
Taxonomy of medical questions which contains the 10 most frequent question categories
among 1396 collected questions
Another taxonomy which classifies questions into Clinical vs Non-Clinical, General vs
Specific, Evidence vs No Evidence, and Intervention vs No Intervention
Characteristics of Medical Questions
1. Question Type:
WH Question
o Definition
o List
Yes/No Question.
2. Expected Answer Type
Treatment
Medical test
3. Focus: The focus of the question is the medical entity closest to the expected answer
4. Main relation: For WH questions, the main relation of a question is the semantic relation
that links the expected answer with the focus
5. Medical Entity Recognition (MER)
6. Semantic relations. Correctly extracting the main relation but also contextual ones
Medical Entity Recognition (MER) consists in two main steps:
(i) detection and delimitation of phrasal information referring to medical entities and
(ii) classification of located entities into a set of predefined medical categories
These medical categories have been chosen according to an analysis of different medical
question taxonomies.
The proposed MER approach uses a combination of two methods to recognize medical entities:
MetaMap Plus
A tool that maps Noun Phrases (NP) in texts to the best matching UMLS concepts and
assigns them matching scores
BIO-CRF-H
This method identifies simultaneously entities boundaries and categories
To extract semantic relations, we use a combination of two methods:
1. a pattern-based method and
2. a machine-learning method based on a SVM-classifier

ONTOLOGY LEARNING AND QUESTION ANSWERING (QA) SYSTEMS


First find the question template, and then translate into SPARQL.
The Architecture of the System
It consists of three major parts:
1. User Interface Module: allows the user to enter full NL queries. After executing a query,
it displays the results to the user
2. Storage Module: a repository that stores the ontology
3. Inference Module: based on Jena API provides an ontology based search or inference
mechanism for the most appropriate answer
Tools used:
1. Jena2 Inference Engine: allows a range of inferences engines or reasoners to be plugged
into Jena
2. Jena Ontology API: for developing functions for accessing the reasoner server
3. Servlet Container Apache Tomcat:
4. Integrated Development Environment: Eclipse, Java

Implementation
It can be seen into two parts:
1. Building a document ontology
It can be used as both the knowledge representation and knowledge base
OWL-DL web ontology language is used for building of the document ontology
2. How to interpret a user question which is entered in natural language
What are the best online medical knowledge
bases?
Uptodate is a very good source, as are primary literature sources (go to databases such as
Pubmed, Jstor, Web of Knowledge, etc)

Anda mungkin juga menyukai