3 (2004) 165 1 8 0 Chinese Language Computer Society & World Scientific Publishing Company
Producing Algorithmically Standard Romanization of Arabic Names Using Hints from Non-Standards
FAWAZ S. AL-ANZI
Department of Computer Engineering, Kuwait University, P.O. Box 5969, Safat, Postal Code 13060, Kuwait alanzif@eng.kuniv.edu.kw
This article addresses the problem of standard Romanization of Arabic names using undiacritized-Arabic forms and their corresponding non-standard Romanization. The Romanization of Arabic names has long been studied and standardized. Huge amounts of non-standard Arabic databases of Romanized names exist that are in use in many private and government agencies. Examples of such applications are passport name holder databases, phone directories, and geographic names databases. Dealing with such databases can be inefficient and can produce inconsistent results. Converting such databases into their standard Romanization can help in solving these problems. In this paper, we present an efficient algorithmic software implementation which produces standard Romanization of Arabic alphabet name presentation by utilizing the hints in the existing non-standard Romanized databases. The results of the software implementation have proven to be very promising. Keywords: Arabic; Names; Romanization; Transliteration; Standard; Database; Search; Security; Geographic; Map.
1. Introduction
Handling Romanized Arabic names has many applications. Examples of such applications are passport name holder databases, phone directories, and geographic names databases. The processing of such databases will not produce accurate and consistent results if the Romanized Arabic names stored in such database are not consistent. This inconsistency can occur due to the practice of Romanizing names without referring to the standards of Arabic names Romanization procedures. Unfortunately, this process has been happening for quite some time and a huge amount of database has been generated and deployed as an official source of information in many private and government agencies.
165
This inconsistent representation of Romanized Arabic names can handicap many serious uses of these databases in the future. For example, consider the security hazard of using different Romanization of the same Arabic name seen in the passport and how this can inflict on non-Arabic countries to keep track of such a person for security reasons. Another example, consider the case of searching a persons phone number in a phone directory using the non-standard Romanized Arabic name as the key for the search. Also, consider the difficulty of searching a Romanized geographic name in a map if you do not know the correct Romanized name of that Arabic area name, see Figure 1. Many efforts and directions to solve or reduce the difficulty of these problems have been discussed and attempted by researchers. Of these attempts, many promising solutions have been emerging such as phonetic representation and cross language phonetic search [14]. However, the simplest and most accurate way to solve these problems is to produce a standard and uniform Romanization of Arabic names. Arabic is written from right to left. As opposed to several other languages, uniform results in Romanization (transliteration) of Arabic are difficult to obtain, since vowel points and some of the diacritical marks necessary for certain identification of the Arabic words are always omitted from both handwritten and printed Arabic texts. It follows that the person doing the Romanization must be able to identify the words used in the names and must know their standard
written Arabic spelling, their proper vowel pointing, and how to eliminate peculiarities resulting from dialectical and idiosyncratic variation. The problem gets more complicated when dealing with special names such as geographic names. This is due to the fact that in most Arabic speaking regions, a large proportion of the common and geographic names is not available in the Arabic alphabets. This applies to Arabic names as well as to those of non-Arabic origin. Even when the Arabic script is available, it is not always possible to determine the proper vowels from dictionaries and other referencing tools. The Romanization is generally reversible though there are some ambiguous letter sequences (dh, kh, sh, th), which may also point to combinations of Arabic characters in addition to the respective single characters. Arabic text is quasi-stenographic. It is usually presented without diacritical marks, which denote short vowels and geminated consonants Relying on different types of linguistics and textual redundancies, the reader has to substitute for missing diacritics [5]. Non-diacritization is a deeply seated property of the Arabic orthography. Attempts to produce tools to generate diacritics for a general text is underway [69]. Most of the tools developed in this area, concentrate on the making and understanding of the text (sentence) and produce the proper diacritics of a word according to its position in the text. The results of such tools are still at their early stages. It will be quite some time before an efficient general-purpose tool for diacritics generation is produced. A more concise, although not necessarily easier, problem is to produce a tool for diacritics generation of Arabic names. This would be an essential tool for producing an accurate Romanization of Arabic names for Arabic alphabets. Until such tool is perfected, an alternative way to produce Arabic name Romanization must be explored. In this paper, we present an efficient algorithmic implementation of producing a standard Romanization of Arabic alphabet name presentation by utilizing the hits in the existing non-standard Romanized databases. This implementation can be used to produce a software that generates consistent results regardless of the skill of Romanization personnel who uses the system to produce the results. This paper is divided into six sections. Section 1 is the introduction. Section 2 gives a historical background of the Romanization of Arabic names. Section 3 presents the model used to utilize hits in non-standard Romanization of Arabic names to produce standard Romanization results. Section 4 presents the standardization of BGN/PCGN-1956 System. Section 5 presents the results of testing the system on the phone directory of the Ministry of Communication in Kuwait. Finally, Section 6 presents some conclusions.
cannot be definitely ascertained which of the Arabic-speaking countries have adopted this system officially. Judging by the use of names in international cartographic products that rely mostly on national sources, it appears that the UN system is more or less current in Iraq, Kuwait, the Libyan Arab Jamahiriya, Saudi Arabia [13], United Arab Emirates, Yemen, and in some other countries (the system is often used without diacritical marks). For the geographical name of the Syrian Arab Republic, the international maps favor the UN system while the local usage seems to prefer a French-oriented Romanization. Also in Egypt and Sudan there are local Romanization schemes or practices that are used side by side with the UN system. The geographical names of Algeria, Djibouti, Mauritania, Morocco and Tunisia are generally rendered in the traditional manner that conforms to the principles of the French orthography. Resolution 7 of the Seventh UN Conference on the Standardization of Geographical Names (1998) recommended that the League of Arab States should, through its specialized structures, continues its efforts to organize a conference with a view to considering the difficulties encountered in applying the amended Beirut system of 1972 for the Romanization of Arabic script, and submit, as soon as possible, a solution to the United Nations Group of Experts on Geographical Names. At the Eighth UN Conference on the Standardization of Geographical Names (2002), the Arabic Division of the UN Group of Experts announced that it had finalized the proposed modifications to the UN recommended Romanization system. These proposals would be submitted to the League of Arab States for approval. 2.1. Other systems of Romanization Some proposed changes (2002) to the UN system were agreed to by the Arab delegations for the Eighth UN Conference on the Standardization of Geographical Names in Berlin (2002) [14], which include the character ( ) to be Romanized as dh instead of z; and the cedilla (,) to be replaced by a sub-macron (_) in all characters with cedillas. Some less famous form of Romanization of Arabic names also exist. For the benefit of the reader, we would like to mention some of them: The I.G.N. System 1973 (sometimes also called Variant B of the Amended Beirut System) uses an amended Beirut System [15]. In these systems minor amendments are used to resolve local problems and some pronunciation considerations. The transliteration ISO 233:1984 gives every character and diacritical mark a unique equivalent, e.g. long vowels in Arabic and u are consequently written as a, iy and uw, respectively in the ISO transliteration.
The Royal Jordanian Geographic Centre (RJGC) System [16] is essentially the same as the amended Beirut system. The sub-macron is used instead of the cedilla. In the Survey of Egypt System (SES) of Romanization, the variants in parentheses are used depending on pronunciation and tradition. The article is always written as el- (EI-Kafr el-Qadim, Sharm el- Sheikh). In Algeria, at present there is no official Romanization system, the prospects of establishing such a system are being discussed in the Permanent Commission for Taxonomy (CPST) at the National Council of Geographical Information (CNIG) [17]. A system that is used in Lebanon, close to the I.G.N. 1973 System, is mentioned in ISO 3166-2:1998 (Codes for the representation of names of countries and their subdivisions. Part 2: Country subdivision code): Principles for Romanization from Lebanese Arabic to Latin Characters (National Ministry of Defense of the Lebanese Republic, 1963). However, in 2002 Lebanon submitted a document where all geographical names were Romanized using the UN system [18]. In Mauritania, the Romanized name forms in official maps edited since 1969 have been rendered in accordance with a simplified version of the I.G.N. system [19]. In Morocco the official Romanization system for the Arabic script dates from June 17, 1932, although changes to this are being planned [20]. In Tunisia the Directorate of Topography and Cartography has officially adopted the amended Beirut system with minor modifications, in 1983 (e.g., adding a letter 9 to the table).
Phase II: Algorithm II, use the diacritized Arabic names, generate the standard Romanization of Arabic name. In the proposed model in this paper, we have to differentiate between two types of strings. The first one is the undiacritized and the other is diacritized. Let us denote the undiacritized letter sequence a = a1 a2 an and a possible diacritization of a is a diacritized letter sequence a = a1 a2 a n such that ai is the same as a i with the diacritics removed. To convert a letter sequence from the second type to the first, we only have to remove all diacritics. We also use the following function: Mid(i,k,S) is a function that returns the first k letters (with diacritics if available) starting at position i of a string S in the proper order Length(S) is a function that returns the number of letters in a string S 3.1. Algorithm I: Diacritizing Arabic alphabets using hits from non-standard Romanization In this algorithm, the knowledge is stored in the non-standard Romanization of the Arabic name. The algorithm Diacritize(A,E) works on two strings, A is the Arabic non-diacritized name and E is the non-standard Romanization of the same name. The algorithm starts with the mapping of every non-vowel letter i in the Arabic name A with the proper corresponding letter in the Roman string E and stores in the hash L(i). The possible correspondence is given in Table 1. Next we test the possible diacritics of a letter i by testing the substring M between the letter and the letter following it in the Roman string, i.e., between L(i) and L(i + 1) 1 in the string E. The possible corresponding Roman letters for diacritics are given in Table 2. For example, consider the information pair of the name (A = . E = mohammad) that represents the non-diacritized Arabic alphabets and its nonstandard Romanization. We start by computing the L( ) hash of the Arabic alphabet representation as follows: L(1) = 1, i.e., corresponding letter of in E is at position 1, letter m L(2) = 3, i.e., corresponding letter of in E is at position 3, letter h L(3) = 5, i.e., corresponding letter of in E is at position 5, letter mm L(4) = 8, i.e., corresponding letter of in E is at position 8, letter d Since the letter at position 3, , of the Arabic alphabets is matched with mm as a TASHDID letter in Table 1, the letter should be followed by the TASHDID diacritic. Next, we have to compute M that follows every letter:
Table 1. Single and TASHDID Arabic letters and their possible match(s) in Romanization. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Arabic letter Romanization possible match a, e, o, u, i B t th j, g h k, kh d th r z s sh s d, dh, z t d, dh, z a, e, o, u, i, , ` , , g, gh, k F g, q, k k l m n h h, t w, o, u y, e, i a, e, o, u, i, y, e, i TASHDID (Doubled) bb tt tth jj, gg hh kk, kkh dd tth rr zz ss ssh ss dd, ddh, zz tt dd, ddh, zz gg, ggh, kk ff gg, qq, kk kk ll mm nn hh hh, tt ww, oo, ou, uu yy, ee, ie yy, ee, ie
Producing Algorithmically Standard Romanization of Arabic Names 173 Table 2. Arabic diacritic possible equivalency for Roman letters. # 1 2 3 4 5 6 7 Roman letter A E I U O W Y Arabic diacritic equivalency FATHAH KASRAH KASRAH FATHAH DHAMAH DHAMAH KASRAH
M1 = o, i.e., the hint for diacritic following letter at position 1 is DHAMAH M2 = a, i.e., the hint for diacritic following letter at position 2 is FATHAH M3 = a, i.e., the hint for diacritic following letter at position 3 is FATHAH M4 = , i.e., the hint for diacritic following letter at position 4 is no diacritics Hence, the output of the diacritization process form of the given name is A formal description of the algorithm is given below: Procedure Diacritize(A,E) p=0 For i = 1 to Length(A) L(i) = Match( p,i,A,E) Next i L(n + 1) = Length(A) + 1 For i = 1 to Length(A) M = Mid(L(i), L(i + 1) L(i) + 1,E) Add diacritics to the letter at position i in string A as Dk(M) Next i End procedure Procedure Match( p,i,A,E) Find the first location of j for possible match, as in Table 1, of the letter Mid(i,1,A) in the string E for positions greater or equal to p Check if possible match of the letter is doubled. If so, then add diacritic TASHDID to the letter Mid(i,1,A) in string A .
If no match found then print Error Set p = j + 1 End procedure Function Dk(M) Return a diacritic that in the first equivalent to the string M as in Table 2. End procedure
Table 3. Special rules for Arabic alphabets/diacritics transliteration. Arabic alphabets 203 2/3 3 13 3 /5 7 8 9 5 0: : 2 Name FATHAH YA SUKUN FATHAH WAW SUKUN FATHAH ALIF FATHAH ALIF MAQSURAH FATHAH DAMMAH WAW TANWIN DAMMAH TANWIN KASRAH TANWIN FATHAH DAMMAH KASRAH YA KASRAH SUKUN(JAZMAH) TASHDID Condition TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE Letter at i 1 is a sun letter and the two letters , at positions i 2 and i 1 and a white space at position i3 ELSE ALIF MADDAH HAMZAT AL WASL i=1 ELSE TRUE Output ay aw 4 a 6 un in an u ; i omit omit l and sun letter is doubled
<
176 Fawaz S. Al-Anzi Table 4. Rules for Arabic alphabets Romanization. Arabic alphabet Name HAMZAH ALIF BA TA THA JIM HA KHA DAL DHAL RA ZAY SIN SHIN SAD DAD TA ZA AYN GHAYN FA QAF KAF LAM MIM NUN Condition i=1 ELSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE If letter at i 1 is , letters at i 2 and i + 1 are white spaces ELSE If letter at i 1 is any of the letters , + , # ELSE TRUE TRUE Output omit ' omit b t th j
h
,
kh d dh r z s sh
d
,
z
,
gh f q k l m Bin
HA
n /h h w y
/ 0
WAW YA
5. Results
In this section, we present our experience of applying our proposed model in producing the standard (uniform) Romanization for the Ministry of Communication phone directory in Kuwait. The directory consists of 50,885 Arabic names with non-diacrtized Arabic alphabets and non-standard Romanization of the same names. The application of our proposed model could not make full use of only 1439 names. This constitutes about 2.83% of the total names. This means that our proposed model succeeded in producing the correct Romanization standard of 97.17% of the directory. The rest of the directory needs to be processed manually. Most of the failure in the automatic Romanization of our model was due to the errors existing in the directory in which the non-standard Romanization was missing a letter or in which some letter positions are interchanged. Table 5 shows example of the results obtained from using our model. Notice that the model successfully produced a standard Romanization of the nondiacritized Arabic alphabets by utilizing the hints in the given non-standard Romanization of names.
Table 5. Sample results of applying the proposed model. # 1 2 3 4 5 6 7 8 9 10 Arabic name Non-standard Romanization abalqelob abdulgafour gharzaldeen gharghani mohamad mohamed mohammad mohammed muhamed Arabic name diacritized Standard Romanization balqilub bdalghafur gharzald;n gargan; muhamad muhamid muhammad muhammid muhamid yassn
EKL
yassain
6. Conclusions
The Romanization of Arabic names and its standardization are addressed. We presented an efficient algorithmic software implementation of producing standard Romanization of Arabic alphabet name presentation by utilizing the hits in the existing non-standard Romanized databases. The model has been formally formulated and implemented. The results of the software implementation have proven to be very promising. This research has a direct impact on Arabic speaking countries since the applications of the results can be applied in many Arabic text-processing databases. Huge amounts of non-standard Arabic databases of Romanized names exist that are in use in many private and government agencies. Dealing with such databases can be made more efficient and can produce more consistent results by applying our model. Converting such databases into their standard Romanization can help in solving many problems in cases like passport Roman names generation, phone directory searches and generating Roman based maps from Arabic ones.
Acknowledgements
This research was supported by Kuwait University Research Administration project number EE 06/00.
References
[1] R. Kneser and H. Ney, Improved clustering techniques for class-based statistical language modeling, in Proc. European Conf. on Speech Technology, 1993, pp. 973976. [2] B. Merialdo, Tagging text with a probabilistic model, in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, Toronto, 1991, pp. 809812. [3] E. G. Schukat-Talamazzini, H. Niemann, W. Eckert, T. Kuhn and S. Rieck, Acoustic modelling of subword units in the ISADORA speech recognizer, in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, San Francisco, 1992, pp. 577580. [4] P. Witschel and G. Niedermair, Experiments in dialogue context dependent language modeling, in G. Gorz, editor, KONVENS 92, Springer, Berlin, 1992, pp 395399. [5] Abdoh, Dawood, Pupils Weaknesses in Written Arabic Texts, Symposium of Arabic Language Problem at University Levels, Kuwait University, Kuwait, 1979, pp. 510.
[6] Ali, Nabil, Arabic Language and Computing, Arabization, Kuwait, 1988. [7] Sadany, T. and Hashish, M., Semi-Automatic Vowelization of Arabic Verbs, 12th Computer Conference, Saudi Arabia, 1988. [8] Ali, Nabil, Parsing and automatic diacritization of written arabic: A breakthrough, Proceedings of 13th National Computer Conference, Riyadh, Vol. 28, Dec. 2, 1992. [9] Saliba, Basel and Al-Danan, Abdullah, An approach to automatic vowelization of Arabic texts, Second Conference on Arabic Computational Linguistics, Kuwait, Nov. 2629, 1989. [10] Bahrain, Kuwait, Qatar, and United Arab Emirates Official Standard Names, United States Board on Geographic Names, Defense Mapping Agency Topographic Center, Washington, DC, March 1976. [11] Report on the Current status of United Nation Romanization System for Geographical Names, Compiled by UNGEGN Working Group on Romanization Systems, Version 2.2, January 2003. [12] Second United Nations Conference on the Standardization of Geographical Names, London, 1031 May 1972, Vo1. II, Technical papers, p. 170. [13] Geographic Names Transliteration in GDMS (Saudi Arabia). Eighth United Nations Conference on the Standardization of Geographical Names. Berlin, 27 August5 September 2002. Document E/CONF .94/INF .77. [14] Minutes of the meeting of the Arab Delegations at the Eighth United Nations Conference on the Standardization of Geographical Names. Berlin, 27 August1 September 2002. [Signed by Dr. Abdul Hadi Tazi, Chief of the Arab Delegations. A copy was given to the Convener of the UNGEGN Working Group on Romanization Systems.] [15] Presentation de la Variante B du Systeme de Translittiration de Larabe HBeyrouth amended, UNGEGN, 17th Session. New York, 1324 June 1994. WP No. 61. [16] Activities in Jordan on the Standardization of Geographical Names, UNGEGN, 18th Session, Geneva, 1223 August 19%. WP. No. 86. [17] Rapport de lAlgrie, Huitieme Conference des Nations Unies sur la normalisation des noms geographiques, Berlin, 27 August5 September 2002, E/CONF.94/INF.37. [18] Rapport sur la toponymie, la normalisation et la romanisation des noms geographiques au Liban. Huitieme Conference des Nations Unies sur la normalisation des noms geographiques, Berlin, 27 August5 September 2002, F/CONF.94/INF .7.
[19] Report of the Working Group on a Single Romanization System for Each Non-Roman Writing System: Activities from 1 June 1972 to 16 August 1977, Third United Nations Conference on the Standardization of Geographical Names. Athens, 17 August7 September 1977. Vol. II. Technical papers, pp. 402403. [20] Rapport national sur la toponymie (Maroc). Huitime Conference des Nations Unies sur la normalisation des noms geographiques, Berlin, 27 August5 September 2002, E/CONF .94/lNF .76.