Anda di halaman 1dari 2

Languages Data Sheet

Language Support Information


Autonomy IDOL server is based on probabilistic modeling and therefore does not require any form of language dependent parsing, dictionaries or translation modules. Treating words as abstract symbols of meaning allows Autonomy's technology to derive understanding through the context in which symbols occur rather than a rigid definition of grammar. Slang and other variations in language do not confuse the software. Building up a statistical understanding of the patterns in any language, Autonomy IDOL server can be trained on the patterns of any language. The more information IDOL server is given about a particular type of information (for example, legal terms, pharmaceutical developments, technology and so on), the more understanding it gains of those topics. A new language can be thought of as simply another type of information, for which IDOL server needs enough material to learn from. Therefore, it is possible to mix more than one language in IDOL server as long as the amounts for each language are sufficient to build its understanding. The choice of language does not compromise the accuracy of the concepts extracted by IDOL server. The underlying algorithm is the same regardless of the language used. While Autonomy's technology is language independent, it can be beneficial to use language dependent features in order to optimize IDOL servers ability to match concepts irrespective of their appearance in text. Autonomy therefore provides the following features:

Supported Languages
Support for over 60 languages including:
Afrikaans Albanian Arabic Azeri Basque Belarussian Breton Bulgarian Catalan Chinese Croatian Czech Danish Dutch English (all common varieties) Estonian Faroese Finnish French (incl. Canadian) Gaelic Galician German Greek Greenlandic Hebrew Hindi Hungarian Icelandic Indonesian Italian Japanese Kazakh Korean Kurdish Kyrgyz Lappish Latin Latvian Lithuanian Luxembourgish Macedonian Malay Maltese Maori Mongolian Norwegian Persian Polish Portuguese (incl. Brazil) Romanian Russian Serbian Slovak Slovenian Somali Sorbian Spanish (incl. South/Central America) Swahili Swedish Tagalog Tatar Thai Turkish Ukrainian Urdu Uzbek Valencian Vietnamese Welsh

Languages Data Sheet

Stemming In languages some words have a common morphological root. Autonomy provides stemming algorithms that reduce words to this form. This is useful because it allows concepts to be matched regardless of the grammatical use of words. In English for example, the words "run", "runner" and "running" can all be stripped down to their stem "run" without significant loss of meaning. Autonomy provides as standard a set of stemming algorithms for the most commonly used languages.

Transliteration schemes Transliteration is the ability to represent letters that do not belong to the Latin alphabet or words that comprise accented letters with the corresponding characters of another alphabet. This make familiarity with the accents and special characters of different languages unnecessary.

Canonicalization of characters Some encodings have more than one way of representing a character. The Japanese katakana script, for example, can be written in full width or half width characters. Regardless of its width the character in itself carries the same meaning. Autonomy's software infrastructure uses canonicalization to ensure that all character forms are treated equally through automatic conversion to an internationally recognized canonical form.

Stoplists Every language has words that do not carry much significant meaning. In grammatical terms these are normally prepositions, conjunctions, auxiliary verbs and so on (for example, words such as "the", "a", "and", "to" in English). These words can be safely ignored when processing content. Autonomy provides as standard a set of stoplists for the most commonly used languages.

Multiple encodings Autonomy supports multiple encodings for languages such as Greek and Russian. Different encodings can be used interchangeably which means that it does not matter which encoding a language is given in. This makes it, for example, possible to query in one recognized encoding for a language and receive results that are in other encodings.

Architecture

Autonomy Inc. One Market Plaza, 19th Floor, San Francisco, CA 94105 Tel: 415 243 9955 Fax: 415 243 9984 Email: info@us.autonomy.com

Autonomy Systems Ltd Cambridge Business Park Cowley Road Cambridge CB4 0WZ Tel: +44 (0) 1223 448 000 Fax: +44 (0) 1223 448 001 Email: autonomy@autonomy.com

Other Offices Autonomy has additional offices in Atlanta, Boston and New York,as well as in Amsterdam, Brussels, Copenhagen, Frankfurt, Madrid, Milan, Munich, Paris, Oslo, Stockholm, Singapore and Sydney.

Copyright 2005 2004 Autonomy Corp. All rights reserved. Other trademarks are registered trademarks and the properties of their respective owners. Product specifications and features are subject to change without notice. Use of Autonomy software is under license.

www.autonomy.com

Anda mungkin juga menyukai