When it comes to name searching and matching, each language poses its own unique challenges. Arabic is one of the most complex languages for name-matching applications to analyze. This article will illuminate the complexity of Arabic names and the reasons why there are such an astonishing variety of spellings in English.
Here are some key concepts to understand:
- Transliteration: the systematic process of rewriting a word in another script, often using a character-to-character mapping or phonetic mapping
- Transcription: transliteration based on phonetics (pronunciation of words)
- Romanization: the process of rewriting text in a non-Latin alphabet with Latin letters
- Translation: converting words into another language based on the meaning of the words
The issues include when to translate and when to transliterate and why there is no standard way to transliterate names.
Let’s illustrate these aspects with a story.
Imagine that a Turkish passport traveler Abdülmecit Şerafettin is applying for visas to visit Algeria (where French and Arabic are spoken) and Egypt (where the Egyptian dialect of Arabic is commonly spoken). In this Turkish name, both given and surnames are Arabic in origin.
The visa officers at both embassies begin by filling out our traveler’s name as “Abdulmecit Serafettin,” without any orthographic marks.
When Abdülmecit arrives at the Algerian Embassy, the visa officer is influenced by French spelling and her knowledge of Arabic. When Abdülmecit pronounces his name, she may phonetically transliterate it as one of these variations:
What about our traveler’s experience at the Egyptian Embassy? The Egyptian Embassy visa officer might try to use his knowledge of Arabic to transcribe the name, but in the Egyptian dialect of Arabic, pronunciations differ from Arabic spoken in Algeria. Here are some ways that the Egyptian visa officer might write “Abdulmecit Serafettin.” Note: Even within the Arab world, transliteration of Arabic names can vary widely!
Difficulties of Consistent Transliteration
While transliteration standards published by governments do exist—in fact there are many different U.S. government standards produced by different agencies for Arabic to English—they are unlikely to be known and used by the average Arabic or English speaker, whether they are a visa officer, journalist, or “person on the street.”
Dialects can influence romanization, which is why “Qaddafi” is also written “Gadafi” (pronunciation of the character ق is a “g” sound in the Libyan dialect and “k” in the Egyptian or Levantine dialect).
‘Too Many' Sounds
The difficulty starts with English lacking the full range of Arabic sounds, so two Arabic characters may map to one English character or more than one English character. That issue also exists when a French speaker is doing the romanization. Furthermore, spelling conventions in target language also influence the spelling of the transliteration. English uses “sh” for the “sh” sound (as in “hush” while French uses “ch” for the same “sh” sound.
Arabic also has various linguistic pronunciation rules around consonants and vowels where characters may be silent or voiced, depending on how they appear. Therefore, the romanization may “miss” these characters in the transliteration and cause spelling differences.
Here are a few examples:
The definite article “Al” in Arabic is prefixed to a word. The “L” sound is silenced when followed by a specific set of letters (in Arabic they are called “the sun letters”). Thus the “L” is voiced in “Elhady,” but silent in “As Sukkar.”
The Tashdeed (also called Shadda) is a diacritic mark that indicates a long consonant. When it comes right after the definite article, the “L” sound of Al becomes silent, as in “Ech-Chaab.”
The letter Taa Marbuta always comes as a final letter in a word and it’s naturally silent. However, when it is taking part in a construct state (Arabic possessive structure), the Taa Marbuta letter sounds like Taa (“t” in English), as in “Fatimatuzzahra.”
Challenges for Name Matching and Translation Technology
Accidental Name Translations
Machine translation systems are often tripped up by names that should be transliterated rather than translated. Let’s look at this news story about Hayrunnisa Gul, the first lady of Turkey. In Turkish, Hayrunnisa literally means “best woman,” just as “Johnson” once meant “son of John.” Her name is titled by various news outlets in different languages. The English headlines translate not only the text, but also her name, resulting in disastrously incorrect headlines:
Let’s suppose, though, that the names have been isolated and identified as such, so the name translation system knows it just has to do a transliteration. Arabic is still tough because it is commonly written with all the short vowels omitted. Arabic-speaking people can fill in the vowels based on context, but there are always a few words that are tricky for a machine.
Let’s look to Ghazi Mashal Ajil al-Yawar (Arabic: غازي مشعل عجيل الياور), an Iraqi political figure. His romanized name transliterated character-by-character would be Gäzē Mshal Ajil al-Yawr, while his romanized name with the vowels would be Mashaal Ajil al-Yawar. A good name translation system needs algorithms and/or data to be able to infer the missing vowels.
Here are some examples of where vowels may be ambiguous:
The letter Alef has different forms, expressed with diacritic marks, when appearing as the initial letter of a word. The different forms may affect the way it sounds. Since most diacritic marks are omitted in standard writing, the sound of the first Alef may be ambiguous.
Superscript Alef is a diacritic mark that indicates a long Alef vowel. Since it is usually omitted like other diacritic marks, there is an ambiguity as to how the vowel should sound.
There are different ways to indicate a long Alef vowel at the end of a word, leading to various ways of romanizing it.
Lastly, what if the name written in Arabic is actually a name whose origin is English? We’d really like to see “George Bush,” not “Jurj Bush,” as the translation of جورج بوش.
What do you need in a name matching or name translation system for Arabic?
For name matching, it’s the ability to:
- Handle some variation among Arabic dialect pronunciations
- Handle the ways that Arabic is romanized by English speakers
- Handle the way that Arabic is romanized by non-English speakers of Latin-based languages, such as French
- Understand when Arabic characters may not be represented in English spellings due to grammatical exceptions
For name translation, it’s the ability to:
- Guess which and where short vowels should appear in a word
- Recognize what is and is not a name when machine translating text
- Recognize when a name is not Arabic in origin and look it up in a dictionary to find the correct spelling (Jurj versus George)
These are all capabilities offered by Rosette’s name matching and name translation technology. For the end-user translator seeking to consistently machine translate names according to government transliteration standards from the comfort of Microsoft Word or Excel, there is Highlight, a plugin to Word or Excel that has Rosette running inside. Highlight will automatically translate columns of names in Excel (and highlight the ambiguous cases that require some human intervention), or recognize the names within a Microsoft Word document and offer name translations for human translators to review.
Whether it’s using the results for searching in a directory or against a watch list, or name translation for a record system, Rosette allows you to spend more time doing what you do best with the confidence that nothing will be overlooked. حظا سعيدا (good luck)!
This blog post is based on material from the June 2010 presentation by Zina Saadi at the Babel Street Government Users Conference.