Anda di halaman 1dari 5

The SUDA Project: Collaborative Web-based Translation

Raphael A. Finkel Ross Scaife Huar-En Ng Computer Science Department Classics Department Computer Science Department University of Kentucky University of Kentucky University of Kentucky Lexington, KY 40506-0046 Lexington, KY 40506-0027 Lexington, KY 40506-0046 raphael@cs.uky.edu scaife@pop.uky.edu heng00@cs.uky.edu Abstract
SOL (Suda On Line) is a collaborative internet-based project involving dozens of researchers to transform the Suda, a 10th-century Byzantine Greek historical encyclopedia, into a searchable electronic text in English translation. The text will include SGML tags to delimit content such as personal names, places, and treatises and will also contain bibliographic and other references, many of which will be hypertext links to other projects elsewhere. SOL allows us to identify translators, establish editorial control, allocate encyclopedia entries to translators, accept translations, have translators to modify their accepted translations, have editors vet translations, present texts with Greek and English components, present overall translation status, and perform searches. computer le by the Thesaurus Linguae Graecae (TLG). The TLG is an electronic data bank of all extant ancient Greek literature from Homer (8th century BCE) to 600 CE with historiographical, lexicographical and scholiastic texts from the period between 600 and 1453 CE. The TLG project is located at the University of California, Irvine and can be reached via http://www.tlg.uci.edu/tlg/ index.html. The Greek text is encoded in Beta code, which is an Ascii markup that contains alphabetic notations (for instance, a denotes ) along with codes for switching between Greek and Roman fonts, punctuation, and formatting. Unfortunately, the collection of data is not perfectly uniform; minor imperfections exist in the TLGs text as well as in the original Adler edition.

2. Suda On Line (SOL)


To date, the Suda has never been translated into English, which restricts it to scholars with a rm knowledge of Greek. Its computer-le version is not searchable by any means other than exact string match in Beta code. The goal of our project, called Suda On Line (SOL), is to translate the Suda into English and to make the resulting text searchable by topic as well as Boolean combinations of words or phrases. Because the translation task is so large, no one person or even small group is willing to undertake the full task. Therefore, the project involves coordinating a large group of translators. The World-Wide Web (here, simply the Web) along with its platformindependent browsers allows us to center the work at one site and let a distributed group of translators access that site interactively. The starting point for all SOL users is the SOL home page, http://www.uky.edu/ ArtsSciences/Classics/sol.html. We will describe the various functions available to participants in order of increasing engagement. Some of our description reects current implementation; in some places, we expect the implementation to eventually function essentially as specied, even if the implementation is not yet

1. The SUDA
The Suda is a 10th-century Byzantine Greek historical encyclopedia of the ancient Mediterranean world, derived from the scholia to critical editions of canonical works and from compilations by yet earlier authors. As the Oxford Classical Dictionary notes, in spite of its contradictions and other ineptitudes, [the Suda] is of the highest importance, since it preserves (however imperfectly) much that is ultimately derived from the earliest or best authorities in ancient scholarship, and includes material from many departments of Greek learning and civilization [1]. The standard edition of the Suda was edited by Ada Adler and published in ve volumes over the years 1928 1938 [2]. It is organized alphabetically, with entries num bered by the pair rst letter, sequence number . For exam ple, the entry beta, 4 is headed by the Greek word ; its entry spans only 2 lines in the Adler edition. Some entries, such as the entry for Homer, are much longer. Each entry has a head word, which is occasionally several words long, and occasionally an entire phrase. The full text of the Suda has been entered into a 7.3MB
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

complete. We classify SOL participants into four categories: guest, translator, editor, and managing editor. Members of each category are able to access SOL facilities appropriate to that category and all the preceding categories.

3. Guests
Guests are allowed to search through parts of the Suda that have been translated and stored in the database so far. Text phrases, Boolean combinations of phrases, particular keywords, and various eld-specic searches are all possible. In addition, guests may search for all entries with head words starting with a particular Greek letter. SOL keeps two versions of the translation database, the test version and the production version. The test version allows translators to experiment with data entry and to see the results without committing their work as nished. This version is erased periodically. The production version is meant to be permanent. Translations are stored in Ascii with an idiosyncratic markup scheme, but we intend to transform it to a TEIconformant SGML/XML tag organization [3]. A sample markup can be found via a link from the SOL home page. We use tags to delimit head words, the English translation of the head word, the time the translation was entered, the identier of the translator, the translation itself, notes on the translation, keywords appropriate to the translation, bibliographic citations of interest to the reader, pointers to related Web sites, and a quick classication of the entry. When a particular entry is returned by the search engine, some of these elds are formatted and rearranged for readability; we call that appearance output format. In output format, the identier of the translator is converted to the translators name and becomes a mailto link so the guest performing the search can contact the translator. Output format also includes the Greek head word in Beta code. Alternative viewing styles are also available, including Unicode [4], commonly used fonts of classical Greek, GIF images, and Java applets capable of displaying Greek. The head word is accompanied by a link to the online Greek dictionary maintained by the Perseus project [5]. Because the TLG text of the Suda is protected by copyright, we do not include the Greek text in the output format. Guests may view the current state of the translation effort. A graphical picture is produced categorizing parts of the range for each letter as vetted (white), translated (green), assigned to be translated (red), unassigned (grey), or nonexistent (black). This picture is active; clicking in a white or green region directs the browser to the translation of the given entry. In addition to viewing the global state of the translation effort, guests can view the roster of registered participants, which is a list of names, afliations, phone numbers, and
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

e-mail addresses (represented as links), sorted by level of engagement. A guest can seek to increase the level of engagement either by registering as a participant or, if already registered, by logging in. Registration involves lling out a form giving those particulars that we store in the participant database, including the full name, e-mail address, phone numbers, afliation, desired level of engagement, the desired personal identier, and an initial password. The software checks that the e-mail address is valid and that required elds are lled in. Registration does not automatically confer approval. Instead, it generates mail to the community of managing editors informing them that a new participant has asked to join. Until a managing editor accepts this registration, it remains pending. When a managing editor either accepts or rejects the registration, mail is generated informing the guest of the outcome of the registration request. A guest logs in by presenting a participant identier (like samuel) and a password. If login is successful, the participant is presented with choices appropriate for the recorded level of engagement for that participant. If the login fails because the password is incorrect, the participant is given the option of receiving registration conrmation (which includes the password) re-sent through e-mail to the address on record. Once registered, a participant may update the personal prole and change passwords.

4. Translators
A translator needs to be assigned a set of Suda entries before entering translations. Translators request assignments  either in the form of a range, such as gamma,100 gamma,149 , or in the form of a topic, such as Sparta. The request indicates whether the translator wants to work in the test or the production database. Requests generate mail to managing editors, who make and modify assignments. Given an assignment, the translator may select one of the assigned entries that has not yet been translated. The translator sees a form containing the head word and the text of the Suda entry in Greek. The form has blanks for placing the translation, notes, bibliographic references, related Web pointers, keywords, and the gross type of entry (person / place / other). The input is checked for completeness. If the input contains all required elds, the translator sees the data in output format and is given an opportunity to make last changes before storing it in the database. Later, a translator may modify a translation previously submitted. Minor changes are simply accepted (and timestamped). Major changes are mailed to the managing editors for approval; they could indicate a human problem that needs to be dealt with by non-automated means.

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

Translators retain the right to reuse their own work. They can publish it again elsewhere themselves or grant other scholars and teachers the right to use it, just as they wish. However, it is essential to the long-term stability and viability of the SOL project that once materials have been submitted, reviewed, and entered into the database, they cannot be later withdrawn. Subsequent emendation or expansion can and should occur. Once an entry has been translated, the translator may still modify the translation, but any major removal is disallowed.

5. Editors
Editors are best seen as area consultants. Their job is to examine translations and correct the grammar, format, style, and content. We call this operation vetting. Each editor has a personal area of expertise. A single translation may therefore be vetted by multiple editors. Scholars who retrieve a translation see not only who translated it, but also the list of editors who have vetted the translation. The scholar can then determine how much credence to give to the translation. When a translator modies a vetted entry, the vetting is nullied. Editors can view the entire history of an entry, including all translations, retranslations, and vettings. However, their vetting only applies to the most recent version of the translation. Editors can view the current progress of the SOL project, just as guests can. They can see which entries have been translated but not yet vetted or only vetted by other editors, and they can zoom the graphical display to pinpoint exactly which entry they wish to vet. From that display they can move to a vetting environment, in which they see the translation in both input and output format. They may then make changes to the input format. Minor changes are simply accepted. Major changes are sent as mail to the original translator so the editor and translator can discuss the changes and let the translator make the changes personally. After the editor is happy with the translation, the editor establishes that the text has been vetted, at which point a vetting notation is made in the translation database and mail is sent to the translator.

a guest is allowed to register, the managing editor decides what level of engagement is appropriate. It is not reasonable to remove anyone from the list of participants, especially after the participant has entered translations. Complete removal of a participant would make it impossible to present information about those translations. Instead, managing editors may set any participant to inactive, which leaves the database record for that participant intact but prevents the participant from modifying the database further. Managing editors respond to requests for regions of the text to allocate. All managing editors get mail whenever a translator requests an assignment. Managing editors use graphical tools to discover what regions of the text are still available for translation and assign accordingly. Often, assignments are contiguous ranges within a single alphabetic letter. Sometimes, assignments are based on content, and they may include entries scattered throughout the text. The software prevents new assignments from overlapping existing assignments. The site administrator at the computer that hosts the SOL project has de facto privileges, of course, beyond even those of managing editors. The administrator can make any modication whatsoever to the databases, deleting or changing translations, participant engagement, and so forth. We therefore trust the administrator not to act hastily and to protect the data. All data are automatically backed up daily as part of the host sites ordinary maintenance.

7. Implementation method
The SOL package is written as CGI scripts invoked by Web forms. Typically, these scripts produce new forms as their output. The scripts are written in Perl [6]. Once the scripts are stable, they will be available at ftp:// ftp.cs.uky.edu/cs/software/suda.tgz so others can pursue similar projects based on our code. These forms often need to display Greek. Displaying Greek in Beta code is straightforward but not particularly friendly to the reader. Java-enabled browsers can run an applet that presents Beta code as Greek letters. Unicode-aware browsers with a classical Greek font can display Greek more directly [4]. There are also several freely available higher Ascii font sets for the display of polytonic Greek, and we may implement translation to or more of these. Finally, GIF images containing pictures of Greek can be placed in the forms. The choice of display methods is made at the SOL server. The display preferences of a participant are elicited at login time and used throughout the session. All databases are currently at les in Ascii. This choice of formats makes the data machine-independent and allows for attribute values with arbitrary length. In this regard, the databases are like those of Qddb [7]. Individual attributes are separated either by newlines or by the | character. This

6. Managing editors
Managing editors control the entire effort in a distributed-but-equal fashion by consensus. Any managing editor can make decisions that affect the entire project. For this reason, the college of managing editors must be small and cohesive. Managing editors receive mail whenever a guest tries to register for any level of engagement above that of permission to search the database. Through a Web form, managing editors can decide whether to accept the registration, to send further questions to the guest, or to deny registration. When
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

at format simplies the database aspects of the project, including insertion, search, and modication. As of September 18, 1998, the production database contains about 78K of data, with about 150 translated entries. As the translation grows, the production translation database will be subdivided into multiple les. When a translator or editor modies a translation, the software removes the entire entry from its le and places a new copy at the end. The cost of these operations is linear in the size of the le involved. For this reason, we will limit individual database les to about 100KB. Search will require indices when the data become large; we will likely use Qddb for our database engine at that time. The participant database records each registered participant, including guests, linking the translator identier to the participants full name, phone number, afliation, e-mail address, and level of engagement. This database is used to produce the output format, which does not reveal translator identiers. As of September 18, 1998, there are seven managing editors, one area editor, 16 translators, and six registered guests. The assignment database has an entry for each translator listing the Suda entries that translator has been assigned. It is parallel to the completion database, which has an entry for each translator showing what has been completed. Both these databases use entry lists, which are comma-separated lists of ranges. A range is in the form gamma,14-18, indicating the letter and a contiguous set of numbered Suda entries within that letters Suda entries. These databases are used to guide managing editors as they give new assignments and translators as they choose a Suda entry to translate. They are also used to create the graphics depicting the current status of the SOL effort. In order to prevent translators and editors from massively changing a translation, the software compares the modied translation with its previous version using a differencing program. If the number of words changed exceeds a threshold (20%, for instance), the modication is agged as excessive. Because it distributes copyright material, SOL needs to maintain at least minimal security. SOL does not use secure communication through the network. However, logged-in participants are identied on each form by hidden variables with their participant identier and with an encrypted string derived from that identier, the current month, and a master password. This information is checked by the scripts responding to each form. It is quite unlikely that an intruder could guess a string that would grant access. Even if an intruder collects such a string through snooping network trafc, it will only be good for the current month.
Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

8. Experience, Limitations, and Generalizations


Because the SOL package is still under development, we dont yet have much experience on which to base an evaluation. Our approach has been to build the fundamental structure as quickly as possible and to improve it iteratively, adding features as we understand the problems better and see needs for enhancement. In this way, many limitations will be identied and, hopefully, removed, during the course of development. We believe that the SOL tools are generally useful for collaborative projects. We hope to see them adopted and enhanced by other scholars in their own work. Our design does have some built-in limitations, however, which might stand in the way of using our tools for other cooperative projects.

 SOL is a translation effort. Collaborative work aimed at,


for instance, cataloging paintings in museums is similar and could likely use the same sort of tools as SOL. Collaborative work for designing an airplane would most likely not t into the SOL framework.

 SOL is text-based. Not only the raw material (Greek text)


but also the derived result (annotated English text) can be represented in Ascii, albeit with some encoding for Greek. Generalizing, we could imagine a collaborative effort to convert a large database stored in any raw form into some other derived form. Raw forms are not necessarily limited to text. For example, a collaboration might try to summarize the important aspects of a database of paintings. Here, the raw form would be paintings, represented as graphics. Derived forms are not limited, either; a collaboration might produce performances of a database of musical scores. Here, the derived form could be an audio le. Our tools could be modied to handle graphical raw forms such as paintings and musical scores, but it is not easy to see how non-textual derived forms such as performances could be submitted by the translators. Web forms are designed to accept text, not graphics or audio. Furthermore, the fact that the derived form is text allows SOL to search through the translations based on content. If the raw form were text but the derived form were not, SOL could be modied to search the raw domain and present the associated derived material. If neither is text-based, searching becomes much harder, although methods are being studied for classifying multimedia data for search purposes [8].

 There is a one-to-one relation between raw and derived


material. For every encyclopedia entry in the Suda, there is a single translation, albeit modied and enhanced by vetting. This correspondence allows us to display progress

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

towards the nal goal in a graphical way and to generate reasonable assignments. A collaborative effort to review movies, on the other hand, would admit multiple outputs for each input.

 The raw material is organized serially. The fact that the


text is organized alphabetically leads us to treat contiguous ranges efciently. If a translator wants to be assigned all entries having to do with, say, pre-Socratic philosophers, the appropriate assignment includes singleton ranges in many different letters. This assignment is inefcient to represent and awkward for the managing editor to specify. Unfortunately, t translators with particular areas of interest or expertise do request exactly such assignments. Albeit awkwardly, our current data structures are certainly able to handle this situation.

 Since this project involves people acting in a community,


there are many non-technical human issues that can arise. For instance, we have little control over the quality of the translation. In order to provide some control, we have editors who can vet the translations. It remains to be seen how effectively this organization applies proper control on the translations.

References
[1] S. Hornblower and A. Spawforth, eds., The Oxford Classical Dictionary. Oxford University Press, 3 ed., December 1996. [2] A. Adler, ed., Suidae Lexicon. Stuttgart: Verlag Teubner, 19281938. in 5 volumes. [3] Text Encoding Initiative, Guidelines for the Encoding and Interchange of Machine-Readable Texts. http:// etext.virginia.edu/TEI.html, May 1994. [4] The Unicode Consortium, The Unicode Standard, Version 2.0. Addison-Wesley, 1996. [5] G. R. Crane, ed., The Perseus Project. www.perseus.tufts.edu, 1998. http://

[6] L. Wall and R. Schwartz, Programming Perl. OReilly and Associates, 1990. [7] E. H. Herrin, II and R. A. Finkel, Schema and tuple trees: An intuitive structure for representing relational data, Computing Systems, vol. 9, no. 2, 1996, pp. 93118. [8] R. Yavatkar, J. Grifoen, and R. Adams, A Framework for Developing Content-Based Retrieval Systems, ch. 15: Intelligent Multimedia Information Retrieval. AAAI/MIT Press, 1996.

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

0-7695-0001-3/99 $10.00 (c) 1999 IEEE

Anda mungkin juga menyukai