By Philip Blair
New research may improve entity disambiguation. It posits innovative ways for software to more easily and cost-effectively recognize and disambiguate entities that appear in disparate data streams. The goal is to connect those entities to known entries in an organization’s knowledge base. These connections are vitally important to anti-fraud efforts, government intelligence, law enforcement, and general business processes.
This post will examine the need for entity disambiguation, disambiguation challenges, and how research may help address those challenges.
Understanding the need for entity disambiguation
Imagine you own a private aviation company. You’re looking for market research information to determine where to open a secondary base of operations. You need to know if a mention of the name “Kennedy” in a data stream likely refers to President John F. Kennedy, actor/comic Jamie Kennedy, or John F. Kennedy International Airport — the “Kennedy” you actually care about. The process of determining the right “Kennedy” is called entity disambiguation.
Most entity disambiguation is handled by software. Sometimes, the process is fairly straightforward — particularly when comparing data sets that are consistent in vocabulary, syntax, and types of entities listed. Consider two different data sets: one covering every lesson taught in Mr. Smith’s fourth-grade history class, one covering every lecture in Professor Martin’s “Political Crises of the 20th Century” course. In both cases, mention of the name “Kennedy” is more likely to refer to the assassinated president than the actor.
But entity disambiguation grows increasingly difficult when software must deal with data from a dissimilar data set. There’s a nuclear-powered aircraft carrier called the USS John F. Kennedy. Pentagon officials may feel quite concerned to see mention of “the Kennedy” in a Washington, D.C., municipal data stream — until they realize that “the Kennedy” in this context refers to the John F. Kennedy Center for the Performing Arts.
The modeling challenge
For a software system to automatically disambiguate entities, it must:
- Look at the context in which a mention of an entity appears
- Examine relevant fields in the knowledge base for each candidate entity
- Make an informed decision based on what it finds
These capabilities require the engineering of machine learning models — or methods of connecting these pieces of information to perform this goal of entity disambiguation. Software engineers must train these models — or “teach” them via machine learning processes — to interpret data to make those connections. The act of training models from scratch is expensive and time consuming; they typically require tens of thousands of entity mentions in order for models to learn that, in an aviation-centered data stream, names like Kennedy, Reagan, John Wayne, and John Lennon are more likely to refer to airports rather than former presidents, film stars, or Beatles.
But suppose your aviation company wants to break into the jet rental field. For prospecting purposes, your staff system is ingesting a sports-and-entertainment data stream. How can you more quickly and effectively teach your models that “Reagan” may now refer not to “Ronald Reagan Washington National Airport” but to American actress Reagan Gomez-Preston? “Lennon” not to “Liverpool John Lennon Airport,” but to British footballer Aaron Lennon?
New research hastens machine learning
To avoid the expensive, complicated process of training brand-new models, some organizations simply take an existing machine learning model and apply it to new data sets. This is called “zero shot” learning. Its results are often suboptimal. However, with just a little fine tuning of existing models for new data streams, disambiguation results often improve. This fine-tuning is called “few-shot learning.”
In our research, Babel Street Chief Scientist Kfir Bar and I sought to improve upon existing few-shot learning approaches. Specifically, we used language models which were pre-trained with significant amounts of text, and then applied linguistic classifiers to predict the meaning of words in context. Using this approach, Kfir and I adapted models trained for the general-information articles to align with the medical field.
The results were encouraging. We found that while our method performed about as well as usual in matching entities in closely related data sets, it demonstrably outperformed baseline expectations when linking entities in dissimilar data sets.