By Rebecca Hirschfield
What is fuzzy matching? And, why should you use it to connect names and resolve entities?
“Fuzzy matching” can be a hard concept to describe because we’re accustomed to looking at the world in a binary way.
Your living room lamp is on or off.
On high school tests, we label statements as “true” or “false.”
Traditionally, engineers built computing systems on this same all-or-nothing logic. It’s called Boolean logic (or “binary logic”), and it answers all questions with a “yes” or a “no.”
Has your computer ever overheated and turned itself off? Manufacturers often include this safety feature in their laptops, mobile phones, and other devices. The goal is to keep excessive heat from damaging the item’s central processing unit. Eighty-one degrees Celsius (177.8⁰ F) is a common cutoff.
A binary process may be programmed this way:
- 80⁰C/not too hot/do nothing
- 81⁰C/too hot/turn off
So, does that make it OK for your CPU to regularly run at 80⁰C?
No, it doesn’t. While short spurts of running at 80⁰C may be fine, long hours operating at that temperature can damage your device.
But binary logic requires a “true” or “false,” an “on” or “off,” a “not-too-hot” or a “too hot.” And that’s the problem. Boolean logic cannot accommodate the “maybes,” “probablys,” and “in these cases” endemic to name matching processes.
What is fuzzy matching? Why should you use it?
Fuzzy logic is a computing approach that improves upon Boolean processes by considering degrees of truth. What is fuzzy matching, then? Most commonly, it’s a technique that applies fuzzy logic to keyword search to account for typos and transpositions when returning results. You’ve likely seen this when you enter a misspelling into Google, and it asks “Did you mean…?” with the correct spelling.
When it comes to fuzzy matching for names however, it’s a much more specialized problem that needs a specialized solution. Fuzzy matching identifies different pieces of text, appearing in separate records, that are similar but inexact. It then ranks the likelihood of these similar pieces of text matching each other.
This capability is important because not every dataset is ordered similarly. A field for a person’s name may be ordered “first name, last name,” “first name, middle name, last name,” “first name, middle initial, last name,” or the reverse of any of these with last name shown first.
Exact name matching is a binary process that cannot effectively manage these differences. Missing spaces and hyphens (Mary Ellen vs. Mary-Ellen), titles and honorifics, nicknames, semantically similar names (PennyLuck Pharmaceuticals vs. PennyLuck Drugs), use of diacritical marks (Raphael vs. Raphaël), typos, and other differences all foil exact name matching systems.
Consider the following scenario. Your name is James Thomas Jackson. This name appears on your birth certificate and your passport. But your checking account lists your name as James T. Jackson; your savings account as James Jackson; and your mortgage as Jackson, James Thomas. Your library card contains a typo: It lists your name as Jem Jackson.
Exact name matching systems may define you as “James Thomas Jackson,” and connect you only to your birth certificate and passport. They wouldn’t understand that the checking account, savings account, mortgage, and library card also link to you because different variations of your name were used for these accounts.
Fuzzy name matching systems are different. They examine available data to determine the level of probability that James Thomas Jackson and James T. Jackson are the same person. They accomplish this by appending additional identifiers — ages, addresses, and places of birth among them — to each name. They understand that the James Thomas Jackson, age 47, who lives at 123 Main Street is much more likely to match to the Jim Jackson, age 47, who lives at 123 Main Street than to the James Thomas Jackson, age 30, who lives at 456 Market Street.
This type of clarity is vitally important to law enforcement agencies, financial institutions, national security organizations, and even sales and marketing departments. They all must link the right records to the right James Jackson. Otherwise, a James Jackson with great credit may be denied a car loan because another James Jackson has a terrible credit score. U.S. Customs and Border Protection may welcome a terrorist named James Jackson into the United States, while denying entry to the Jackson family that just wants to visit the top of the Empire State Building. Banks may allow money launderer James Jackson to operate freely, while freezing the accounts of James Jackson, law-abiding real estate agent.
Fuzzy name matching: the pros and cons
Despite their clear benefits, fuzzy matching systems are not infallible. They can make mistakes, incorrectly linking some entities and failing to coalesce others. They can be challenging to scale across large, disparate datasets. And they must be configured to match organizational needs
Still, fuzzy name matching improves upon exact name matching systems in several ways. By taking into account real-world issues such as typos, misspellings, alternate spellings, and disordered data components, it is much more likely to accurately match names across two or more datasets. Its ability to provide probabilities for each match (a process called “match scoring”) enables users to determine which names can be matched automatically, and which require further investigation. If users find either too many false positives or too many missed matches, they can recalibrate fuzzy matching systems to find the right balance. All of these capabilities improve name matching.