Political and linguistic challenges to accessing Persian data
Persian sentiment analysis debuted in Rosette 1.10.1. This feature joined Rosette’s array of Persian text analytics for base linguistics, entity extraction, as well as name matching and translation. This release means Rosette offers the most comprehensive coverage of Persian text analytics on the market.
Why do so few text analytics providers support a language spoken by 110 million people? Despite the number of speakers, written Persian data is actually quite difficult to come by.
A political data drought
There are several ingredients to creating a machine-learned model for a language, whether that is for entity extraction, part-of-speech tagging, or sentiment analysis, but the essential ingredient is good, clean data for training and testing.
For Persian, the scarcity of thorough, well-annotated Persian data is a greater issue than for other languages. Datasets of Persian text do exist, but the vast majority of them are owned by Iranian companies and organizations which Western businesses are banned from doing business with.
Instead, text analytics providers have to roll up their sleeves to get their hands on Persian text: first scraping it from public news sites and social media, then going through the arduous task of cleaning, deduplicating, and annotating that data themselves before they can begin training and developing models. This process takes enormous amounts of time and effort.
Data quality concerns
Pulling training data from social media presents its own challenges. Social data more subjective than news articles or encyclopedias. Data managers need to spend vast amounts of time cleaning the data or risk producing a highly biased and inaccurate model. Consider Norman, an AI bot trained by MIT researchers on a gruesome Reddit feed to exemplify biased machine learning.
Spend more than a few minutes scrolling through a Twitter feed in any language, and you’ll quickly come to the conclusion that it’s a fairly negative site. Overwhelmingly, people take to Twitter to vent: anger at a poor customer service experience, frustration from sitting in traffic, outrage at political events, and more.
Correctly annotating training data for sentiment is a more complicated task than annotating data to train entity extraction or part-of-speech models. Sentiment analysis is in itself subjective in any language, not just Persian. Different annotators may disagree on the sentiment of a piece of text. Those conflicts must be resolved before training can begin.
A machine learning algorithm is only as valuable as its training data. Training a model on social data must be closely supervised by a native speaker of the target language in order to produce an accurate model.
For Persian specifically, aggregating training data is particularly difficult because it is such a widely spoken language with many dialects and even multiple alphabets. Dari, Farsi, and Tajik are all regional variations of the Persian language.
Persian has a large vocabulary to begin with, compounded by the addition of unique words in each dialect. There are also phonetic differences that result in different spellings and transliterations of the same word across dialects.
All languages have slang terms, idioms, colloquial expressions, and informal patterns of speech. This is a consistent hurdle that NLP providers are used to facing. Again, the challenge is multiplied by the widespread community of Persian speakers and dialects. It is not insurmountable, but it does require significant time and effort to overcome.