Anda di halaman 1dari 18

ALVIS Superpeer Semantic Search Engine

002068 ALVIS Superpeer semantic Search Engine STREP / IST

Report The ALVIS Document Architecture

Title of contract Acronym Contract number Start date of the project Duration Document name Date of preparation Author(s) Coordinator of the deliverable

Document location

ALVIS - Superpeer Semantic Search Engine ALVIS IST-1-002068-STP 1.1.2004 36 months, until 31.12.2006 Report The ALVIS Document Architecture February 3, 2005 Kimmo Valtonen, Antti Tuominen and Wray Buntine (HUT) Wray Buntine, Helsinki Institute for Information Technology (HIIT), University of Helsinki & Helsinki University of Technology, P.O. Box 9800, FIN-02015 HUT, Finland Phone : +358 50 384 1533 Fax : +358 9 451 8129 Email: Wray.Buntine@Hiit.FI http://project.alvis.info/copies/2005 06/ALVIS X8 20050130 HUT KV.pdf

Project funded by the European Community under the Information Society Technology Programme (2002-2006)

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

Abstract The ALVIS document architecture takes input from a crawler, or some other source, and converts and augments a document with additional information. The resulting XML document is then used as input for the ALVIS search runtime system, or the ALVIS resource discovery system which performs corpus wide linguistic and semantic analysis. This report reviews the Alvis document processing architecture as a pipeline of processes that a document passes through. A simple example illustrates the process at each stage.

Keywords Work package Deliverable type Change log 20.01.2005 30.01.2005 ABSTRACT

integration, document processing, compression, reannotation WP8 article First draft 0.5. Version 1.0 ii

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

Executive summary
The ALVIS document architecture takes input from a crawler, or some other source, and converts and enriches a document with additional information. The components of the document architecture are developed in Work Packages 2, 3 and 5, and the system interfaces with Work Packages 6 and 7. The document architecture described in this report is the integration modules that binds the other components, and is developed as part of Task 8.1. The architectures task is produce the full enriched document as specied in the report for Milestone 3.2: Metadata Format for Enriched Documents. The XML document produced by the architecture is used as input for the ALVIS search runtime system, or the ALVIS resource discovery system which performs corpus wide linguistic and semantic analysis. This report reviews the Alvis document processing architecture as a pipeline of processes that a document passes through. A simple example illustrates the process at each stage.

EXECUTIVE SUMMARY

iii

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

Contents
1 Overview 2 Summary and rationale 2.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 An example document . . . . . . . . . . . . . . . . . . . . . . . . 3 Converting to a canonical document 3.1 Recognizing sections . . . . . . . . 3.2 Recognizing lists . . . . . . . . . . 3.3 Recognizing ulinks . . . . . . . . . . 3.4 The example continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 4 4 5 5 6 6 6 6 7 8 9 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 10 10 11 12 13

4 Linguistic annotation 4.1 Continuing the example . . . . . . . . . . . . . . . . . . . . . . . 5 Compression/decompression of enriched documents 6 Modeling 7 Re-annotating the canonical document 7.1 Specifying indexing deviations in the document 7.1.1 Indexing deviations . . . . . . . . . . . . 7.1.2 Tagging support . . . . . . . . . . . . . 7.2 Continuing the example . . . . . . . . . . . . . 8 Conclusion

CONTENTS

iv

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

Overview

This report reviews the Alvis document processing architecture as a pipeline of processes that a document passes through. A simple example illustrates the process at each stage. The overall setting is shown in Fig. 1. The document system digests documents coming from a set of diverse sources (e.g. WWW, a database, MS Word, i.e., WP7), processes them and then stores them on disk in enriched format. The enriched format is the metadata format described in [4]. Note the document system is implementated as an integration module with separate components for document-level linguistic processing (i.e., WP5), relevance analysis and document ranking (i.e., WP2). The ALVIS runtime system (i.,e., WP3, WP4 and WP9), implemented eciently using a peer-to-peer architecture, then processes queries by consulting the document system. The addition and removal of documents is controlled by a separate maintenance system. The maintenance system does corpus-wide analysis of content to develop new linguistic resources (i.e., WP6) for subsequent use inside the document system.

Figure 1: Overview of the Alvis architecture.

Summary and rationale

The entire document processing pipeline is shown in Fig. 2. Rectangular boxes denote subprocesses. Those in bold outline are major components associated with work packages (as given in the boxes with a dashed outline). Stored databases are denoted by cylindrical sections. First a document is converted to the canonical document format from its original format, be it a database record, a web page, a Word document or some other document source. The goal is to obtain a well-dened, abstract version of the essential contents in XML, see [4] for more detail. 2 SUMMARY AND RATIONALE 1

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

Figure 2: Overview of the process pipeline. 2 SUMMARY AND RATIONALE 2

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

The canonical document consists of the visible text of the original document along with some structural annotation. The annotation describes the hierarchical structure of the text and separates list-type structures from the rest of the text. The conversion step is described in Section 3, and is part of the integration software developed in Task 8.1. The canonical document then goes through linguistic analysis as develop in Work Package 5. In the process sets of tokens are recognized as words, which are then disambiguated, analyzed morphosyntactically and lemmatized. The results of the analysis are added to the document using stand-o annotation. Note this implies we are using so-called document expansion in contrast to the standard method of query expansion. Query expansion can only be performed eectively when the expansion method is context free, thus we have to use document expansion. The outcome of this analysis is then passed to the model updating process of Work Package 2 which adds to the enriched document results of its probabilistic modeling and ranking algorithms. For reasons of scalability, the enriched document is compressed before saving it on disk at any one of these stages. Before reaching the ALVIS runtime system, the enriched document goes through a re-annotation phase where a synthesis of the canonical document and its analysis is formed. This results in an enriched version of the original canonical document with results of analyses incorporated. This enriched canonical document is the XML content seen by the ALVIS runtime system. Note reannotation is a general tool that can be used at any processing stage when a convenient representation of the document is needed for processing. Thus, for instance, it is also used in the model updating process above. Re-annotation also allows the easy display and summary of a document when presenting results to a user. For instance, the display of results in some search engines is done by showing snippets, small relevant extracts of the text. We may wish to augment this with identied named entities, or also mark approximate matches or synonyms such as hound for dog. Extracting relevant snippets from a document is a disk-intensive operation at runtime, thus the stored documents in the runtime system need to be as small as possible, with any unnecessary detail or markup deleted. Now for snippet matching and display with enriched (i.e., expanded) documents, we have two options: Loose some accuracy and only use the original canonical document for the snippet task, thus use query expansion for matching snippets, in order to handle the document enrichment. Do full matching on the enriched canonical document to nd snippets. Thus snippet matching will use exactly the same terms as stored in the index. We will not make a hard decision on this issue at present, and implement the second approach initially. There is probably not much between these two approaches with respect to implementation eort, but the rst could be twice as fast because snippet matching is a disk-bound activity. 2 SUMMARY AND RATIONALE 3

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

2.1

Rationale

Our rationale for the choices made in the processing are as follows: A canonical representation for a document is necessary to make feasible the handling of formats of diverse kinds. It also encodes the essential structure of the document in an abstract manner, enabling the use of structural information in subsequent stages of analysis. Stand-o annotation for linguistic analysis enables the use of several differing analyses for the same piece of text. It also enables easy labeling of noncontiguous sets of tokens. It is generally viewed as the best annotation technique in the linguistic community. It also allows a variety of dierent foreign tools to be more readily incorporated into the linguistic processing pipeline. Document expansion is a necessary complement to using any form of sophisticated context sensitive routines for expanding terms, i.e., as the linguistic processing and named-entity recognition imply. Re-annotation is required for scalability reasons, since after linguistic processing the size of the enriched document has typically increased to more than ten times that of the original, and there is no need for much of this content in the runtime system. It is used in the maintenance system. Re-annotation is also required to make the document summary, display and indexing operations at runtime work with foreign tools. Typically available indexing tools do not have the capability to perform re-annotation, and indeed it should not be part of their function. The document summary stage (or snippet matching) will work o the enriched document, but could just as easily work o the non-enriched document. Thus these choices come down to ease of integration of multiple foreign tools into the processing pipeline, good use of data compression, and support for an acceptable input format for the search engine runtime.

2.2

Outline

Thus, the document processing system consists of four stages given in Fig. 2. 1. Conversion to a canonical document. 2. Linguistic annotation. 3. Relevance calculation. 4. Output of re-annotated document to the run-time system.

SUMMARY AND RATIONALE

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

The full XML metadata is stored for subsequent use in analysis. The document processing system provides the following tools so support this task: Converter for canonical documents. Codec tools for the full XML metadata document. Re-annotation tools for subsequent export of the document to other components. The following sections briey describe these stages and the supporting tools.

2.3

An example document

We will use as a running example of a document entering the system the following small piece of HTML1 :
<HTML> <BODY> <TABLE cellpadding="20" cellspacing="20"> <TR> <TD> Tell me, Alvis!<BR> Youre the dwarf who knows everything about our fates and fortunes:<BR> what is the name for ale, that men quaff, in each and every world? </TD> <TD> Men call it Ale, the dwarf replied.<BR> The gods say Beer and the Vanir say Foaming.<BR> The giants name if Cloudless Swill, and in Hel its known as Mead.<BR> Suttungs sons call it Feast Draught. </TD> </TABLE> </BODY> </HTML>

Converting to a canonical document

The conversion of a document source to a canonical document form is a non trivial task. So far, web pages have been the only available type of document source, so we will limit ourselves to describing their conversion below. See [5] for a more detailed description of the algorithms. At the highest level, a canonical document is a sequence of (possibly nested) sections. By a section we mean a sequence of text with high semantic or functional coherence.
1 An excerpt from a modern English adaptation [2] of the twelfth-century Alvissmal, a lay telling how the Norse god Thorr puts the dwarf Alvis through a series of queries

CONVERTING TO A CANONICAL DOCUMENT

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

3.1

Recognizing sections

We have considered DIV, headers H* and P to be HTML elements which clearly denote substantial sections of text. The following facts cause problems, however: Headers H* do not have explicit scope. Note also that intermediate levels might be lacking. DIVs can interleave freely with headers. Unfortunately, sections cannot interleave similarly. Ps usually have implicit scope and are often used as a break marker, not in the XML element sense. We have adopted an approach which attempts to approximate the human perception of the overall structure of a document as seen through a typical browser. Hence, headers take precedence over other elements in dening where a subsection starts (and ends). DIVs take second place and sections spawned by them are considered to close whenever we come across a new header. Ps have the lowest rank and have break semantics, i.e. the text preceding a P is at the same level as the text following it.

3.2

Recognizing lists

The other major structural component of a canonical document is list. We have chosen to consider HTML elements OL, UL and DL as onedimensional lists. TABLEs are likewise considered to be Ndimensional lists, i.e. lists which contain items that contain lists etc. Quite often TABLEs are used not as Ndimensional lists, but as a layout tool, i.e. to place text segments twodimensionally on a page. Currently this is handled by converting a TABLE into a section if any of its constituents will be converted into a section. In the process its structure has to be attened, because we now view such a TABLE as a collection of text segments, and hence drop all redundant semantics related to the dimensions of a table.

3.3

Recognizing ulinks

The link element, ulink, is the least problematic one to recognize. We have chosen to recognize A, FRAME and IFRAME as ulinkinducing elements. IMG is not included, because ulink is here thought to be a textual link, and currently we do not parse images in any way.

3.4

The example continued

The HTML example of 2.3 is a simple example of TABLEs being used for layout. The conversion to a canonical document produces the following:

CONVERTING TO A CANONICAL DOCUMENT

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

<section> <section>Tell me, Alvis! Youre the dwarf who knows everything about our fates and fortunes: what is the name for ale, that men quaff, in each and every world? </section> <section>Men call it Ale, the dwarf replied. The gods say Beer and the Vanir say Foaming. The giants name is Cloudless Swill, and in Hel its known as Mead. Suttungs sons call it Feast Draught. </section> </section>

which corresponds nicely to the visual rendition of 2.3 by a typical browser.

Linguistic annotation

In the linguistic processing phase the text of the canonical document goes through a sequence of analysis processes. We summarize the pipeline below. For a much more detailed description, see [3]. The subprocesses performed vary depending on the language in use. For English and French, they are in order: 1. Tokenization. The text is broken up into pieces that are viable candidates to be linguistically meaningful units. Four categories of tokens are recognized: word separators, sequences of alphabetical characters, numbers and sequences of other symbols. 2. Named entity recognition. This step identies sets of tokens that act as a single semantic component referring to a particular place, location, person, etc. The recognition is performed using a dictionary of known named entities and by matching named entity patterns to the canonical document. 3. Segmentation into words. A subset of tokens are identied as words. A problem is posed by e.g. abbreviations. This is dealt with by using a dictionary of known abbreviations. The remaining cases where a period follows an alphabetic character are matched against a pattern dening periodcontaining words. All other such tokens are broken up into separate words. 4. Segmentation into sentences. Word sequences are recognized as sentences separated by a period. 5. Morphosyntactic analysis of sentences. The text is studied one sentence at a time and morphosyntactic analysis is applied to each word: its category (part of speech) and features dependent on the category (e.g. number, tense) are determined and the word form is disassembled into a set of morphemes. After this, the words are lemmatized in a second pass over the text, i.e. the canonical form of each word is determined. 4 LINGUISTIC ANNOTATION 7

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

The results of the conversion are added to the enriched document as stando annotation. Note that this processing is repeatedly submitting parts of the full XML document to the linguistic sub-processed, reformatting the output into appropriate stand-o annotation, and then reinserting it into the XML.

4.1

Continuing the example

. For brevity, we will only display (part of) what happens to the phrase Cloudless Swill of 3.4. But one can see from this example that, rst, the detail is needed to clearly demarkate named entities and identify word forms, and second, that there is ample room for compression in the resultant output. XML is designed for clarity is not intended as a storage format. The tokenization produces
<c alpha> <cont>Cloudless</cont> <from>100</to> <id>token1</id> <to>108</to> </c alpha> <c sep> <cont> </cont> <from>108</from> <id>token2</id> <to>108</to> </c sep> <c alpha> <cont>Swill</cont> <from>109</from> <id>token3</id> <to>113</to> </c alpha>

and named entity recognition results in


<semantic unit> <named entity> <id>sem unit1</id> <form>Cloudless Swill</form> </named entity> </semantic unit>

LINGUISTIC ANNOTATION

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

Compression/decompression of enriched documents

Since the output of linguistic processing will increase the size of the enriched document to more than ten times that of the original document, a compressor/decompressor is called for to enable storage on disk for large sources. The compression proceeds in two basic steps: 1. Redundant elements (results of a previous conversion step) are removed from the enriched document. For example, tokenization of the canonical document might be removed. Also the content entry of token elements can be removed, and the token ids stripped back to integers, Note that all other elements of the enriched document can be regenerated given the originalDocument element of WP7, including the canonical document itself. 2. Standard text compression is performed on the remaining stripped-down enriched document. Gzip is the algorithm used at the moment, chosen for its speed, availability and adequate performance. We anticipate a more sophisticated compression process will be developed at a later stage that uses XML-aware compression or a compression-enabled XML database. For instance, necessary elements do not need to be stored, and elements can be grouped for compression purposes where they should share codecs. Decompression naturally proceeds in the reverse direction, re-performing those earlier conversions whose results were deleted in step 1. Note decompression can also be done partially: some parts of the meta-data might want to be ignored and thus there reconstruction can be skipped. This would be used, for instance, in the re-annotation process discussed later.

Modeling

In the modelling phase, the canonical document, the URL links given, and the linguistic annotations will be used to develop a number of auxillary resources for the relevance calculations made during runtime and querying. This is described further in [1]. These will either apply to the document as a whole, or apply to large sections within a document. It is assumed that there is a category structure that applies to the full set of documents. There are two factors (real valued numbers) being computed for each category: (1) the amount of the document (or section of the document) that is relevant to a particular category, and (2) the authority of the document (or section of the document) for a particular category. These factors are then added to the document in markup, which is just a simple weighted category list. The modelling phase adds an analysis element at the top level of the XML document as follows: 6 MODELING 9

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

<analysis> <ranking> ??? </ranking> <ranking> ??? </ranking> <topic> ??? </topic> <topic> ??? </topic> </analysis> Im not sure what to put here because I have to understandIndex Datas MS3.2. The modeling phase would use the same input system as the indexer, thus, it would input the documents via decompression followed by a selective reannotation (to choose the particular elements of the linguistic annotations it wishes to use). From this, it would produce its weighted category lists and insert them back into the full document.

Re-annotating the canonical document

Once the enriched document has been decompressed, it will need preprocessing both to enable standard search runtime systems to input and index it. Systems such as Zebra or Lucene2 cannot digest documents with stand-o annotation. Moreover, re-annotation can also be used by the other phases of document processing if the need a reduced form of input. In this the main goals are: Extraction of the canonical document and the removal of everything else to obtain a version of the document that is digestible by the search runtime in terms of size and complexity. Re-insertion of the results of previous analyses (WP5 and WP2) into the canonical document so as not to loose any information relevant to the indexer and the snippet generator.

7.1

Specifying indexing deviations in the document

The second re-annotation goal above is met by introducing a new element into the text of the canonical document. It encapsulates words (including named entities and terms) and provides information derived from WP5 and WP2 in its parameters. This added information enables both indexing and snippet generation to use the same form of the document with relevant information cleanly separated. 7.1.1 Indexing deviations

Note that there may well be more than one index entry needed per item in the document. Possible entries in the index are:
2 http://jakarta.apache.org/lucene/docs/index.html

RE-ANNOTATING THE CANONICAL DOCUMENT

10

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

The lemma of the word form. If a compound or multi-word named entity is being stored, its component words may also be indexed. Synonyms of the word. Semantic category of the word. Positional information for proximity evaluation. Thus we need to support the addition of entries in the index during the indexing phase. Note that sophisticated synonym or lemmatization processing might have been performed to identify these alternative entries, thus we need to embed these alternatives in the document prior to the indexing stage. We need to convey for words in the document how they will be indexed, and how they will be displayed during snippet retrieval. Consider the following extract of XML: <section> billy bob<place>Billy Bobs</place> </section> The default behaviour of an XML-aware indexing system is to break at spaces and to index tokens so created with their XPATH. In this example, billy and bob are entered as two terms at the level XPATH=section, and Billy and Bobs are entered as two terms at the level XPATH=section/place. The retriever to show snippets would display billy bob if only the XPATH=section level was to be retrieved, and just Billy Bobs if the XPATH=section/place level was to be retrieved. If both are retrieved, then the full content appears, duplicating the name in two slightly dierent forms. This example illustrates the problem with the default behaviour. We might like to index words for a named entity such as Billy Bobs under both the XPATH=section/place and XPATH=section parts so matching can occur in dierent ways. But adding these words into the document, as was done above, can lead to duplication in display. As seen in this example, named entities and other tagged items also need additional support. When we see a named entity, tagged with an XML element as place or person say, how do we index it in possibly several dierent ways, if at all, and what is actually displayed as the content of the original document? This is an issue because if the named entity has been identied by our processing, then it does not exist in the original document and perhaps it should not be included in the display. 7.1.2 Tagging support

We assume that the XML document as entered is what you want retrieved. We then add an extra <t> element whose only task is to augment the index entries

RE-ANNOTATING THE CANONICAL DOCUMENT

11

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

added to account for the above eects. If the default behaviour does indexing as you would like, then no <t> elements are needed. The special element <t> is intended to specify dierent eects. The value of certain elements inside the t elements are stored in the index, with possibly some additional XPATH entries added where we want indexing but no display. In document summary/retrieval all the <t> elements are replaced by the original contents. The denition of t is as follows: <!ELEMENT t (i|#PCDATA)*> <!ATTLIST t orig CDATA #REQUIRED> <!ELEMENT i #PCDATA> The orig attribute is what is displayed. The contained i elements (there may be several) specify the index entry, and defaults to the PCDATA enclosed by the t element.

7.2

Continuing the example

The re-annotation process rst extracts the canonical document version of 3.4 and then re-inserts relevant information from the output shown in 4.1. For brevity, we will only display what happens to the phrase Cloudless Swill under dierent congurations. <t orig="Cloudless Swill> <i>cloudless</i> <i>swill</i> </t> In this case the display will be Cloudless Swill and the two tokens cloudless and swill are indexed separately. If indexing ignores capitals, then this would be equivalent to Cloudless Swill Various other eects are possible depending on the choices used. In all the examples below, the display will be the same, just Cloudless Swill. Indexing compounds: If the indexing system supports phrase searches within text, compounds doesnt need any special tagging. If phrase searches are unavailable but complete subtag matches can be done, indexing compounds is done as follows: <t orig="Cloudless Swill"> <i>cloudless swill</i> </t> Indexing synonyms: Several alternatives can be indexed. 7 RE-ANNOTATING THE CANONICAL DOCUMENT 12

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

<t orig="Cloudless Swill"> <i>cloudless swill</i> <i>beer</i> <i>mead</i> </t> Indexing tagged compounds with higher level classes: The two word compound appears in multiple indexes. <t><i>cloudless swill</i></t> <beverage>Cloudless Swill</beverage> Indexing tagged compounds as classes only: Default behavior used, the compound is indexed. <beverage>Cloudless Swill</beverage> Indexing tagged compounds in dierent ways: The two word compound is indexed dierently at dierent XPATH levels. <t><i>cloudless</i><i>swill</i></t> <beverage>Cloudless Swill</beverage>

Conclusion

This report introduces the document processing architecture for ALVIS, and some of the integration tools it provides in support of linguistic processing and document modeling. In future work, we hope to achieve the following: Integrate some of the tools to support pipelining and manipulating the XML from WP5 [3]. Upgrade the compression scheme once the stand-o annotation standards have settled. XML compression does not appear to be well supported inopen source. Consider the use of compressing XML databases to store documents. Properly integrate this document with the relevant sections from Milestone 3.2 [4].

CONCLUSION

13

ALVIS Superpeer Semantic Search Engine

The ALVIS Document Architecture

References
[1] W.L. Buntine. WP2 milestone 2.2: Requirements and specications for WP2. Technical report, HIIT, September 2004. [2] K. Crossley-Holland. The Norse Myths. Penguin Books, Great Britain, 1987. [3] A. Nazarenko, Eric Alphonse, C. N dellec, S. Aubin, J. Derivi re, T. Hamon, T. Poibeau, D. Weissenbacher, D. Mladenic, and Z. Qiang. WP5. deliverable 5.1: Report on method and language for the production of the augmented document representations. Technical report, INRA, LIPN, JSI, TU, December 2004. [4] M. Taylor. WP3 milestone 3.2: Metadata format for enriched documents. Technical report, Index Data, December 2004. [5] K. Valtonen. Alvis documentation for conversion tools v.0.2. Technical report, HIIT, December 2004.

REFERENCES

14

Anda mungkin juga menyukai