Skip to main content
Rosette Blog Default

Why Tokenization Matters

Search is increasingly becoming the first and most important way to access information. Because of the ubiquity of Google, the depth and breadth of Amazon and the growing amount of files and information stored on our personal devices, every user expects world-class search accuracy. Search is now essential for everything from big data to enterprise content, e-commerce to social media and government intelligence to financial solutions— the list goes on. Now there are robust open source search options such as ElasticSearch to aid in this explosive growth.

This need for fast, effective search applies to organizations all over the world, while they deal with the unique challenges of their native language but also the many other languages required to operate in the global economy. As you probably know, one key to better search is linguistic analysis to enhance precision and boost recall. At Babel Street, our linguistic plug-in for search is called Rosette Analyze Language. It includes many important features to achieve better speed and accuracy for search, such as tokenization, lemmatization, decompounding and part-of-speech tagging, among others.

Tokenization, in particular, is a pretty simple idea to understand, but very difficult to implement. Often referred to as segmentation, it is the process for taking a block of unstructured text (no tags or embedded data) and differentiating each word or grammatical element. These resulting “tokens” can then be indexed and made ready to match a user query.

For many languages, including English, this is a pretty easy problem. We use spaces to separate words and sentences for visual comprehension, so these spaces are equally useful to the software for simple segmentation. No big deal. But what happens when there are missing spaces, bad grammar, or missing punctuation? The answer calls for a deeper linguistic analysis of the words themselves so that even when there is a missing space, every word can be properly segmented.

The real challenge comes in with Asian languages, such as Chinese, Japanese and Korean. These languages often will not use any space between characters or sentences, except for stylistic purposes, so a more advanced approach to segmentation is necessary. Some search engines employ a technique called “bigramming”. Essentially, this process separates each character and pairs it with the next character, resulting in a “bigram”. To ensure it does not miss any words, each character is paired twice, once with its neighbor to the left and once with its neighbor to the right. Bigramming, in the end, does produce tokens that will be able to match queries, but because of the redundancy in characters, both non-existent and additional words are introduced for indexing. This not only vastly increases index time and size, but it also reduces precision.

The more advanced approach, linguistically-sensitive tokenization, utilizes statistical modeling and other algorithms to understand each segment of characters within the specific context of the characters around it. This method accurately identifies the correct tokens without the side-effects that bigramming introduces, resulting in a smaller index size and better TF–IDF (term frequency–inverse document frequency) relevancy.

Let’s compare the two approaches for indexing 北京大学生物系 (Beijing University Biology Department) in Chinese. Bigramming produces 6 tokens resulting in two non-words and one additional incorrect word (学生 “student”). Morphological tokenization correctly segments the phrase into only two tokens 北京大学 (Beijing University) and 生物系 (Biology Department) without producing any of the unwanted side effects. So as the illustration shows, when a user queries the word 学生”student”, it will correctly miss this index vs the bigrammed index where the word was incorrectly introduced.

tokenization-basis-technology

Rosette Analyze Language, along with its advanced morphological tokenization, is available as a plug-in for many search engines.