It can be:
Anaphora Resolution == the problem of resolving what a pronoun, or a noun phrase refers to.
In the following example, 1) and 2) are utterances; and together, they form a discourse.
2) He was kind.
As human, readers and listeners can quickly and unconsciously work out that the pronoun "he" in
utterance 2) refers to "John" in 1). The underlying process of how this is done is yet unclear...
especially when we encounter more complex sentences:
Query Reformulation
Given a question, the system generates a number of weighted rewrite strings which are likely
substrings of declarative answers to the question.
For example, When was the paper clip invented? is rewritten as The paper clip was
invented.
query reformulation:
o Rule-based approach
o Machine Learning approach (learn query-to answer reformulations & their weights)
N-gram Mining
From the page summaries returned by the search engine, n-grams are collected as possible
answers to the question.
N-Gram Filtering
Next, the n-grams are filtered and reweighted according to how well each candidate matches
the expected answer-type, as specified by a handful of handwritten filters.
N-Gram Tiling
Finally, we applied an answer tiling algorithm, which both merges similar answers and
assembles longer answers from overlapping smaller answer fragments.
Therefore, it is important that a QA system has some idea as to how likely an answer is to be correct, so it
can choose not to answer rather than answer incorrectly.
We introduce a new paradigm for processing complex questions that relies on a combination of:
a. question decompositions (of the complex question);
b. factoid question answering (Q/A) techniques (to process decomposed questions); and
c. Multi-document summarization techniques (to fuse together the answers provided for each
decomposed question).
We use two different models of topic signatures to identify (a) the most representative relations
for the topic referred by the complex question and evidence by the document collection.
The first topic signature (TS1)
TS1 is defined by a set of terms ti
the topic signature TS2: binary relation between two topics
it identifies two forms of relations:
(a) syntax-based relations, and (b) salience-based context relations.
The arguments of these relations may be (1) nouns or nominalizations;
(2) named entity types that a Named Entity Recognizer
(NER) identifies; and (3) verbs
When topic signatures are available, each sentence from the document collection receives a score
based on (a) the presence of a term from TS1; (b) the presence of a relation from TS2; and (c) the
presence of any of the keywords extracted from the sub-question or their alternations.
STEP 1: The complex question is lexically, syntactically and semantically analyzed with the
goal of identifying the relationships between words that may lead to the generations of simpler
questions.
STEP 2: For a relation discovered at Step 1 we generate questions that involve that relation
Query Formulation
(a) queries involving the lexical arguments of the relation,
(b) queries that involved semantic extensions.
Four forms of extensions were considered:
(1) extensions based on the semantic class of names that represent the nominal category (e.g.
names of drugs),
(2) extensions based on verbs which are semantically related to the verb in theWordNet database
(e.g. develop(v) sem.relationcreate(v); develop(v) sem.relationproduce(v));
(3) extensions that allow the nominal to be anaphoric, therefore replaced by a pronoun, e.g.
[develop it]; and
(4) extensions that allow the nominalizations, as well as the verbal conjuncts, to be considered.
One of the first challenges to be faced in automatic question answering is the lexical and stylistic
gap between the question string and the answer string.
o Question reformulations
o QA typologies and hierarchical question types (useful for general QA)
The proposed work is less linguistically motivated and more statistically driven.
Question Classification for a Croatian QA System
Classification of questions according to the expected answer type. If a QA system knows the
type of the answer it is looking for (e.g., number, city, person name)
The QC problem can be tackled using various approaches:
o Rule-based methods (e.g., regular expression matching) and
o Statistical language modeling to
o improved absolute discounting and log-linear interpolation
o machine learning methods for question classification
o SVM (SVMLib), Decision Tree (RapidMiner), k-nearest neighbors (k-NN), as
well as language modeling (LM).
Feature selection methods [18]: information gain (IG), 2- statistic (CHI), and document
frequency (DF)
Results are slightly better when using morphological normalization (stemming or
lemmatization), and somewhat surprisingly better on Croatian than on English data.
Textract IE
MUC divides IE into distinct tasks, namely;
1. NE (Named Entity),
types (person, organization, location, time, date, money and percent),
2. TE (Template Element),
3. TR (Template Relation),
4. CO (Co-reference), and
5. ST (Scenario Templates)
2. question processing,
understanding of the question requires several steps such as parsing the question,
representation of the question and classification.
NLP parser. This segments the sentence into subject, verb, prepositional phrases,
adjectives and objects. The output of this module is the logic representation of the query.
Interpreter. This finds a logical proof of the query over the knowledge base using
unification and resolution algorithms.
WordNet/Thesaurus. AQUAs lexical resource.
Ontology. This currently contains people, organizations, research areas, projects,
publications, technologies and events.
Failure-analysis system. This analyzes the failure of given question and explains why
the query failed. Then the user can provide new information for the pending proof and the
proof can be re-started. This process can be repeated as needed.
Question classification & reformulation. This classifies questions as belonging to any of
the types supported in AQUA, ( what, who, when, which, why and where). This
classification is only performed if the proof failed.
3. document processing
documents are selected and a set of paragraphs are extracted
This relies on the identification of the focus4 of the question. Document processing
consists of two components:
Search query formulation. This transforms the original question, Q, using
transformation rules into a new question Q. Synonymous words can be used, punctuation
symbols are removed, and words are stemmed.
Search engine. This searches the web for a set of documents using a set of keywords.
4. answer extraction.
answers are extracted from passages and given a score, using the two components:
Passage selection. This extracts passages from the set of documents likely to have the
answer.
Answer selection. This clusters answers, scores answers (using a voting model), and
lastly obtains a final ballot.
QUERY LOGIC LANGUAGE (QLL)
Query Logic Language (QLL) is used for the translation from the English question into its Logic
form. In QLL, variables and predicates are assigned types.
Like Prolog or OCML (Motta, 1999), QLL uses unification and resolution (Lloyd, 1984).
However, in the future we plan to use Contextual Resolution (Pulman, 2000). Given a context,
AQUA could then provide interpretation for sentences containing contextually dependent
constructs.
Contextual resolution based transformation of English sentences into logic form
During the segmentation, the system finds nouns, verbs, prepositions, and adjectives.
AQUA makes use of an inference engine, which is based on the Resolution algorithm. However,
in future it will be tested with the Contextual Resolution algorithm, which will allow the carrying
of context through several related questions
One interesting difference between the questions typically discussed in the literature on natural
language front ends to databases and those in the literature on QA against open text collections is
the role of quantifiers and logical connectives. In questions posed against databases, quantifiers
and connectives frequently play a significant role, e.g. Who are all the students in MAT201 who
also take MAT216?. Put otherwise, such questions tend to ask about the extensions of complex
sets defined in terms of set theoretic operations on simpler sets. Questions against open text
collections, on the other hand, tend to be about finding properties or relations of entities known
via a definite description { Where is the Taj
Mahal?, What year did the Berlin Wall come down?, Which team won the FA cup in 1953?.
In such questions quantifiers and connectives do not play a major role. No doubt this will change
as open text collection QA gets more ambitious, bringing these two traditions closer together.
IE
It was initially known, message understanding
the activity of filling predefined templates from natural language texts, where the
templates are designed to capture information about key role players in stereotypical
events
In the current context, IE templates can be viewed as expressing a question and a filled
template as containing an answer.
evaluation exercise the Message Understanding Conferences (MUC)-terminated
language understanding technology
QA
Questions should be syntactically correct interrogatives
a serious attempt to apply formal logical techniques to the analysis of questions, i.e. to define a
suitable syntax and semantics for a formal language of questions and answers.
First order predicate calculus with functions and identity
Processing stages of QA model
1. Question Analysis: The natural language question input by the user needs to be analyzed
into whatever form
IR system, then one question representation might be a stemmed, weighted term vector
for input to the search engine
detailed analysis of the question which typically involves two steps:
1. identifying the semantic type of the entity sought by the question (a date, a
person, a company, and so on);
2. determining additional constraints on the answer entity by, for example:
identifying key words in the question which will be used in matching
candidate answer-bearing sentences; or,
identifying relations { syntactic or semantic { that ought to hold between a
candidate answer entity and other entities or events mentioned in the question.
built hierarchies of question types based on the types of answer
The constituent analysis of a question that it produces is transformed into a semantic
representation which captures dependencies between terms in the question.
4. Candidate Document Analysis: If the preprocessing stage has only superficially analyzed
the documents in the document collection, then additional detailed analysis of the
candidates selected at the preceding stage may be carried out.
First order logical representation
5. Answer Extraction. Using the appropriate representation of the question and of each
candidate document, candidate answers are extracted from the documents and ranked in
terms of probable correctness.
6. Response Generation: A response is returned to the user.
As a research agenda, question answering poses long-term research challenges in many critical
areas of natural language processing:
Applications:
o structured information (in databases)
o Free text.
o automated help,
o Web content access
o front-ends to knowledge sources such as on-line encyclopedias or to bibliographic
resources (e.g. to MEDLINE for biomedical literature)
Sources of knowledge
WordNet UMLS
Summarization techniques
Cluster-based
Asma Ben Abacha [3], approach for translating NLQ into machine-readable representation
Medical entity recognition
Semantic relation extraction
Automatic translation to SPARQL queries
The system uses two ontologies: WordNet and UMLS
AUTOMATED QUESTION-ANSWERING TECHNIQUES AND THE
MEDICAL DOMAIN
Three major QA approaches:
1. Deep NLP
Converts text input into formal representation of meaning such as:
Logic (first order predicate calculus),
Semantic networks,
Conceptual dependency diagrams
Frame-based representations
Perform a semantic analysis of text in NL
the process of studying the meaning of a linguistic input and giving a formal
representation of it
Approach for semantic analysis
The user input is first passed through a syntactic parser, whose output, represented with a
parse tree, is then processed by a semantic analyzer which delivers a meaning
representation
The system derives logical representations of both user questions and the documents in
the collection
The documents are analysed in an offline stage and their semantic form is stored in a
Database. In an on-line stage user questions are converted into their semantic
representation, prior to being compared to the representations of the documents in the
matching process.
When a match occurs, the sentences that originated the match are extracted as possible
answers to the user question.
Drawbacks of the deep NLP approach are:
Its computational intensiveness and its high processing time (Andrenucci and
Sneiders, 2005, Rinaldi et al., 2004) as well
Portability difficulties (Andrenucci and Sneiders, 2005, Hartrumpf 2006)
Six components:
Linguistic frontend- change when the input language changes
o The linguistic front-end parses and analyses the user input in NL
Three components (domain-dependent knowledge) change when the knowledge
domain changes.
o Contains information specific for the domain of interest: a lexicon and a
world model.
o The lexicon contains admissible vocabulary words from the knowledge
domain.
o The world model describes the structure of the domain of interest, i.e. the
hierarchy of classes of the domain objects, plus the properties and the
constraints that characterize the relationship between them.
3. Template-based QA
Exploits a collection of manually created question templates, i.e. questions which have
open concepts to be filled with data instances, mapped into the conceptual model of the
knowledge domain
The interpretation is done manually, individuating for each single template the concepts
that cover a part of the conceptual model of the knowledge domain
the most viable approach when it comes to medical information portals on the Web
(Andrenucci and Sneiders, 2005). This is due to the following characteristics:
i. its suitability to support multilingual content,
ii. the relatively easiness of maintenance,
iii. its capacity to solve linguistic ambiguities such as word sense disambiguation
without computationally expensive software,
iv. and its capability to return answers in different formats.
Drawback:
Manual creation of the templates is required
Does not provide a natural flow in user/system dialogue or provides dialogues of
poor quality
The most viable commercially and fits Web-based medical applications that are aimed at
retrieving multilingual content in different multimedia formats
This work aims to make a step towards supporting Arabic QA on the Semantic Web. It
introduces the QA system that can interface to any Arabic ontology, get a NL user query as an
input and retrieves an answer from a RDF knowledge base.
The core of the system is the approach we propose to translate Arabic NL queries to SPARQL
The most common NLP system contains:
tokenization, part-of-speech tagging (POS), stemming, named entity recognition (NER),
semantic relations, dictionaries, WordNet
QA systems for Arabic language are very few. Mainly, it is due to the lack of accessibility to
linguistic resources, such as corpora and basic NLP tools (tokenizers, morphological analyzers,
etc.). Moreover, the Arabic language has a very complex morphology (inflectional and
derivational characteristics) and texts suffer from the scarcity of vowels as well as the absence of
capitalization. These specificities of the Arabic language introduce many processing problems
related to the word tokenization, the identification and categorization of named entities
The main objective of this research is to develop an ontology based Arabic Question Answering
system that transforms NL queries in Arabic to SPARQL queries.
The proposed QA system is composed of four major components: Data Processing, Question
Processing, Ontology Mapper and Answer Retrieval.
There is a couple of open source natural language processing software, such as OpenNLP,
Stanford parser etc. [13].
Tools used:
To implement the Arabic QA system, we used java programming language
The Ontology, which represents the schema, is stored in a file. RDF data representing the
individuals and other annotations are stored as RDF triples in a MySQL database table.
We used Jena API for ontology manipulation and reasoning.
Both questions and candidate answers (or whole corpus) are represented in a homogeneous
semantic representation that can be processed by information systems. Using meta-languages:
RDF(S) and OWL3 to formalize the representation of meaning on the Web
o Supported by efficient storage systems and APIs (e.g. Sesame4, Jena5,
Virtuoso6).
The overall question analysis process is evaluated on a real question corpus collected from the
Journal of Family Practice (JFP)
Implementation
It can be seen into two parts:
1. Building a document ontology
It can be used as both the knowledge representation and knowledge base
OWL-DL web ontology language is used for building of the document ontology
2. How to interpret a user question which is entered in natural language
What are the best online medical knowledge
bases?
Uptodate is a very good source, as are primary literature sources (go to databases such as
Pubmed, Jstor, Web of Knowledge, etc)