Anda di halaman 1dari 3


This document describes how Apace UIMA can be used in a search engine. UIMA stands
for Unstructured Information Management Architecture. It is a component software
architecture for the development, discovery, composition, and deployment of multi-
modal analytics for the analysis of unstructured information and its integration with
search technologies.

The Traditional Retrieval Process:

The work flow of a typical search engine is as shown in the figure.

Before starting of a retrieval process, documents are indexed. Documents first go through
various text operations.
Text operations involve:
1. Word Tokenizer
2. Stopword Removal
3. Noun groups
4. Stemming

After that the document is indexed. Given that the document database is indexed, the
retrieval process can be initiated. The user first specifies a user need which is then parsed
and transformed by the same text operations applied to the text. Then, query operations
might be applied before the actual query, which provides a system representation for the
user need, is generated. The query is then processed to obtain the retrieved documents.
Fast query processing is made possible by the index structure previously built. Before
been sent to the user, the retrieved documents are ranked according to a likelihood of

Proposed Process:

The proposed design is to include UIMA in text operations. It can be used as a Natural
Language Processor.

Two proposed UIMA annotators.

1. Regular Expression (Dynamic entity recognition): The Regular Expression
Annotator, an Apache UIMA analysis engine, detects entities like email
addresses, URLs, phone numbers, zip codes or any other entity based on regular
expressions and concepts.

2. Concept Mapper: Concept Mapper is a powerful, highly configurable dictionary

UIMA-based annotator. Numerous parameters can be used to specify various
aspects of the lookup algorithm, input processing and output options. The
dictionary structure is flexible, allowing any number synonyms to be associated
with an entry, and any number of attributes to be associated with entries or
synonyms. Lookup and matching against dictionary entries can be performed
against contiguous or non-contiguous blocks of text, and token order independent
lookup is also allowed (for example, the tokens "A" "B" would be considered a
match against dictionary entry "B" "A").
Consider a dictionary as follows:

<token canonical="New York City">

<variant base=”New York”/>
<variant base=”NYC”/>
<variant base="Big Apple"/>
<variant base="New York Capital"/>

So if user search for “NYC” or “New York Capital” and document has “New York City”
in it, the search will return the proper document.