By Rebecca Hirschfield
What is lemmatization?
Heck, what’s a lemma? [Hint: not a cute, Arctic rodent]
Lemmatization is the act of reducing words to their most essential forms by stripping off their prefixes, suffixes, compounds, and indications of gender, number, tense, or case.
The “lemma” is the resulting word.
Consider the following sentences:
- The children kick the ball.
- The children are kicking the ball.
- The children kicked the ball.
What’s the lemma of the verbs in the sentences above?
For you. Computer programs find it harder to determine each word’s lemma.
Why does it matter?
Lemmas and natural language processing
Natural language processing (NLP) is a branch of computer science that deploys artificial intelligence to help computers understand spoken words and written text. Search functions, chatbots, voice-activated navigation systems, and machine translation all depend on NLP. NLP enables you to query Google using everyday language (“What is the cube root of 125?” “Who was the 30th president of the United States?”). It empowers you to ask a retail chatbot, “How long do I have to exchange Christmas gifts?”
Lemmatization supports these functions by linking words that are related to each other in meaning.
Take search applications as an example. Not everyone phrases a search in the same way. You want to know if the President is speaking at the United Nations today. You may query, “Is the President speaking at the United Nations today?” or as “Did the President speak at the United Nations today?” Or “Will the President be speaking at the United Nations today?” Or, “Has the President spoken at the United Nations today?”
In order to return an appropriate answer, the search function needs to understand all the different forms of the verb “to speak.” For that to occur, words must first be morphologically analyzed — broken down to their smallest possible meaningful forms. To learn what “speaking,” “spoke,” and “spoken” mean, a search application needs information on the verb “to speak.” It also needs information on the extra information, or “inflections,” that are often added to words: prefixes, suffixes, and more.
There are two common processes for finding the most essential form of words: stemming and lemmatization. Let’s take a closer look at each.
What is stemming?
Stemming is a rules-based, “brute force” process to find essential word forms by removing prefixes and suffixes. For example, a stemming rule may remove prefixes such as “un,” and suffixes such as “ing” and “ed.” “Unloved,” “loveable” and “loved” therefore all reduce to “love.” “Friendly” and “friendship” become “friend.”
And “ate” reduces to?
This is the problem with stemming. “At” is not a word that relates to “eat.” “Wand” does not relate to “wander.” In many cases, the stem isn’t even a valid word.
While stemming is a quicker, cheaper, computing process than lemmatization, it doesn’t handle linguistic inconsistencies and ambiguities well. This is true in English. European languages raise additional complications.
Take French as an example. Because stemming knows nothing about the meaning of words, it cannot tell when the French word “bois” refers to the noun “woods,” or to the first person singular of the verb “to drink.” Both these words are spelled “bois,” but they have different lemmas (“bois” for “woods,” “boire” for “to drink”). In 14％ of French word families, identically spelled words with different meanings create the same stems. Conversely, in 48％ of French word families, differently spelled words create the same stem. For example, “cuire” (“cook”) and “cuir” (“leather”) both stem to “cuir.” These vagaries result in decreased precision in search functions and other NLP applications.
What is lemmatization?
Rather than linking words based on a superficial resemblance obtained by removing prefixes or suffixes, lemmatization is a process that links words based on meaning. In order to connect inflected words to their appropriate lemmas, lemmatization requires specialized dictionaries containing information about inflected words and their parts of speech.
Why choose lemmatization over stemming?
Because it connects inflected words to their most essential forms, lemmatization outperforms stemming in search functions and other tasks.
Suppose you’re searching a database for mention of the word “ponies.” Text subjected to stemming processes will return results with mentions of the word “ponies,” but, if stemming has reduced “ponies” to “poni,” the search may miss texts that only refer to the animal in the singular, “pony.”
Conversely, stemming can lead to a slew of irrelevant results. Imagine searching for the word “celebrities.” If text in the database has been subjected to stemming, it might search for any mention of the letters “celebr” — responding to your search request with not only mentions of the word “celebrities” but with “celebrations” and “celebrated” as well.
And, as noted in the discussion of the French language, stemming processes cannot handle homonyms. Lemmatization processes can — differentiating between someone searching for a replacement “spoke” for a bicycle, and someone searching for use of the past tense of the word “speak.”
Additional uses for lemmatization include:
- Improving chatbots and virtual assistants: These applications require a meaningful understanding of language to appropriately respond to user queries.
- Text mining: Lemmatization helps computers more accurately and efficiently extract themes from pieces of text.
- Sentiment analysis: Lemmatization helps computers better understand themes in customer feedback, especially feedback expressed on social media — including direct messages, reviews, and comments.
While more effective and precise than stemming, lemmatization processes typically take longer to implement. Text to be analyzed must first undergo a tagging process to determine parts of speech. Investment in dictionary data for each language to be studied is needed to help the computer link inflected words back to their lemmas. Still, for accurately linking semantically related words in search functions, text mining, or other applications, organizations often find lemmatization to be the better choice.
Real-world applications of lemmatization
- Learn how Virginia tech applied lemmatization and other processes to unstructured social media data to forecast civil unrest events in Latin America — an average of one week before the event happens.
- Discover how Luminoso analytics company digested large volumes of text-based customer feedback — including online reviews, surveys, and customer service interactions — to identify the key concepts, ideas, and sentiments that drive consumer choices. Lemmatization is part of the process.
- Read about how CareerBuilder used lemmatization to more precisely connect talent to jobs.
 https://www.basistech.com/whitepapers/Enabling-High-Quality-Search-in-European-Languages-EN.pdf, “Enabling High-Quality Search in European Languages.” 2013. ↩