Skip to main content
Foggy forest

Duplicate Document Detection and Cross-lingual Search

How to automate mundane tasks and find relevant text using text embedding

Numbers are great, because they are easy to compare, tabulate and examine. Text? Not so much.
But text embeddings let one manipulate and compare the meaning behind words and text like numbers.

Basically, text embeddings convert words, phrases, or even whole documents into a mathematical vector representing its meaning. Vectors that are numerically closer will be closer in meaning. (For the long explanation of how text embeddings work, read our blog posts “Using Deep Learning to Power Multilingual Text Embeddings for Global Analysis” Part I and Part II.) A given word compared to itself will score a 1.0 in similarity, but outside of that case, 0.8 is about as high a match as you will ever see.

Cross-lingual query expansion (i.e., taking your English search and generating the equivalent in a number of languages) and duplicate document detection can be built using text embeddings. The only difference is cross-lingual search is looking for an equivalent phrase in a different language, and detecting duplicate documents is often done in a single language.

Let’s see how this works.

CROSS-LINGUAL QUERY EXPANSION

Before we had access to text embeddings, monolingual English speakers would take a search term and drop it into Google Translate and then copy the result into the search box. It’s laborious and you may not even have the right term. That has all changed with the availability of semantic similarity of terms with Rosette version 1.12.1, which supports Arabic, English, Chinese, German, Japanese, North & South Korean, Russian, and Spanish for this function.

Similar words or phrases can be discovered within a language or across languages. Given the word “spy”, Rosette returns these similar terms in Spanish, German, and Japanese.

Input: Spy
Spanish
{"term":"espía","similarity":0.61295485},
{"term":"cia","similarity":0.46201307},
{"term":"desertor","similarity":0.42849663},
{"term":"cómplice","similarity":0.36646274},
{"term":"subrepticiamente","similarity":0.36629659}
German
{"term":"Deckname","similarity":0.51391315},
{"term":"GRU","similarity":0.50809389},
{"term":"Spion","similarity":0.50051737},
{"term":"KGB","similarity":0.49981388},
{"term":"Informant","similarity":0.48774603},
Japanese
{"term":"スパイ","similarity":0.5544399},
{"term":"諜報","similarity":0.46903181},
{"term":"MI6","similarity":0.46344957},
{"term":"殺し屋","similarity":0.41098994},
{"term":"正体","similarity":0.40109193},

Rosette’s /semantic/similar endpoint is returning similar terms from a term database compiled from Wikipedia and Gigaword.

DUPLICATE DOCUMENT DETECTION

Text embeddings are also dead useful in areas such as e-discovery where being able to detect nearly duplicate documents could save man-weeks or more of labor during discovery. Rosette will accept an entire document as input to its /semantics/vector endpoint and calculate the vector (a location in semantic space, as represented by a vector of floating point numbers). Then, the values of the resulting vectors for each document can be compared.

For instance, a press release can be published on 100+ websites. Using semantic vectors, you can programmatically identify all 100 copies as different versions of the article.

Curious to try this out? Sign up for a free Rosette Cloud trial account.

Babel Street Home