Tackling the challenge of Arabic chat written in Latin script
The Arabic chat language, known as “Arabizi” or “Arabish”, is a casual version of written Arabic that appeared when Arabic speakers began using Western keyboards on mobile phones and computers to spell out their native language with the Roman alphabet. With the growth of digital communication via text messages and social networks, Arabizi has become one of the most proliferate online languages. The long-form, Modern Standard Arabic (MSA) that is most easily available for training text analytic systems often doesn’t apply to modern social media and informal communications.
With as many as 420 million speakers in the world, Arabic coverage, and by extension, Arabizi, is necessary for any global text analytics system.
An evolving, multi-regional ‘language’
Arabizi poses a unique problem for text analysis because it is a still evolving language, meaning writers do not follow any standard rules for spelling, grammar or diction. Furthermore, writers from different regions not only use different spellings but also write in their local dialect and may code-switch (insert words in other languages such as English or French).
Social and conversational chat Arabic varies throughout the Arabic-speaking world, with some dialects being mutually unintelligible. Consider the example text below:
|English||One day Joha and his son were packing their things in preparation for travel to the nearby city, and they climbed onto the back of their donkey in order to start their trip.|
|MSA||في يوم من الأيام كان جحا وابنه يحزمون أمتعتهم إستعداداً للسفر إلى المدينة المجاورة، فركبا على ظهر الحمار لكي يبدأوا رحلتهم.|
|MSA transliteration||Fii yowm min al-ayaam kaana Joha wa ibnuhu yahzimuun amta’atahum isti’daadan lil-safar ila al-madiina al mujaawira fa rakibaa ‘ala dhahri likay yabda’u rihlatahum.|
|Algerian transliteration||Qallek wa7ed ennhar kan Djou7a w wlido y7addro besh yro7o lwa7ed mdina, wkan 3andhom 7mar.|
|Egyptian transliteration||fi youm min el ayem, kan go7a we’bno bey7addaro 7aget-hom 3ashan yeroo7o el balad elli gambohom.|
[Example from The Economist]
Additionally, short-text formats such as tweets can be even more varied, having become almost a new language within every dialect.
Luckily, for organizations looking to work with online Arabic data and especially social data, most Arabizi script can be transformed to its underlying Arabic. Once the text has been transliterated, existing text analytics can be applied. Check out a few examples of transliterated text:
- ale waaah itne chaala meko khana hai
➔ ألي وآه اتني شعلة مقو خانة هي
- Nchallah tkoune mabsouta bi asda2 l Ghenniyyi, w ykoun hayda l naja7 li enti natrteh!
➔ نشالله تكون مبسوطة بأصداء الغينيي, ويكون هيدا النجاح لإنت نترته
- Leh lama badaye2 mesh bala2y el nas ely bab2a mawgouda ma3ahom we homa meday2in dol?
➔ ليه لم بديء مش بلائي الناس إلي ببئة موجودة معهم و هما مديان دول
Rosette chat transliteration
Rosette API is capable of disassembling Arabizi because it begins with a statistical approach to transliteration that breaks the text message into phonemes and then ranks the possible conversions. The most probable mappings are used to convert the text into Arabic script. This statistical approach allows the software to adapt as Arabizi continues to evolve and the common usage changes.
We built the statistical model for the chat translation from more than 300 million Arabizi messages gathered from throughout the world. The database is updated regularly through an automatic algorithm that builds a new statistical model from the latest corpus. New releases include the latest version of the model trained with the most recent collection of chat messages.
The results also carry metadata about the regional dialect used in the text message which can identify the country of origin of the writer. The translations of chat alphabet to Arabic script amplify the knowledge of the analyst by suggesting possible sources of the message that may lie outside the core of the analyst’s expertise. The encyclopedic nature of Rosette offers a deep set of options for the analysts to grade, saving them time in identifying the source. This information is kept alongside the translation for analysts to study at all subsequent stages of processing.
Try it out
We added an Arabic chat translation endpoint to Rosette API in the 1.7 release. The /transliteration endpoint takes in Arabizi, and transliterates it into standard Arabic script. It can also transliterate standard Arabic into Arabizi.