Hybrid entity extraction methods
Just as you would never use a screwdriver to insert a nail, each type of entity is most accurately extracted by a different approach. There are many ways to extract entities, but no one universal solution for all entities.
Different extraction methods are best suited to identify different entity types. For entities like credit card numbers that have a very specific pattern, a pattern-matching extraction approach is superior to training a statistical model. The best entity extraction software uses a hybrid of multiple methods to address the maximum number of entity variations:
- Statistical or deep neural network processors
- Exact match processors
- Pattern matching processors
Each of these processors are ideal for extracting one or several entity types, and has comparative strengths and weaknesses.
Statistical or deep neural network processors
Statistical modeling is used to detect entities which cannot be exhaustively listed — such as people, locations, and organizations — or that lack uniqueness. Is Greenwood a person’s name, a city, or a company? Through machine learning (training on numerous examples of these entity types), the statistical model “learns” the context in which each entity type appears. Being aware of context means the statistical model can find entities that are misspelled or just new to the system, and it will know when “Paris” refers to an organization, person, or city.
The best deep neural network or statistical extractors are trained on data that is carefully balanced for content and genre (news, blogs, product reviews, etc.). Entities in training data should be tagged by multiple human annotators, following careful guidelines, and the tags should be cross-checked between different annotators for consistency.
Although statistical models are perhaps the most advanced method of extracting entities, they are also the most complex and laborious to create and train. The time and labor of annotating data and training a model is not worth the effort if the entity type can be accurately identified by another method.
Pattern matching processors
Entity types which fit a pattern are ideal for models trained to recognize regular expressions. These entities include email addresses, URLs, money, phone numbers, date, time, personal ID (e.g., social security number), distances, credit card numbers, latitude & longitude, and UTM coordinates.
Pattern matching processors are trained to recognize certain structures within text that are indicative of entity type. For example, a string of sixteen characters is very commonly a credit card number, while a string of ten numbers (especially if it includes spaces, hyphens, parentheses, or periods) is typically a phone number. To reduce false positives, these regular expressions have to be very carefully constructed.
Entity types whose entities are fairly unique and can be exhaustively listed in a “gazetteer” are best suited to an exact-match processor. These include entities like personal titles, nationalities, and religions. Their weakness is they cannot catch spelling mistakes unless some kind of fuzzy matching is used.
At face value, exact-match extraction may seem simplistic compared to other matching methods. However, the strength of this approach is speed. Exact match extraction is incomparably faster than the other processors.
A multiprocessor, hybrid-approach is best of breed
The best systems for entity extraction use a hybrid of the above approaches to maximize precision and recall for each entity type. Entities that would be missed by one extractor are identified by another, reducing the likelihood of missing an entity. At the same time, as the processors “compete” to find results, a final step called redaction is the last chance to “get it right.” Redactors judge which processor is correct when there are conflicting results from two or more processors. A good entity extractor will let the user tell the system which type of processor is most trustworthy in this case.
For example, the word “Christian” can be a name, a religion, or a high-end fashion label. An exact match processor and statistical processor differ in how they tag “Christian” and “Christian Dior” in this sentence:
Although he had a Roman Catholic upbringing, religion was not one of John Galliano’s main obsessions during his time at Christian Dior.
Results from statistical processor:
Results from exact matching processor:
A redaction processor reviews the entities returned by each technique and selects the best results based on context, degree of conflict, and model weighting:
Distinguishing between “identical” entities
Extraction alone only tells you what words within your text are entities, but it does not tell you who or what those entities are. Extraction also does not help you distinguish between two similarly named entities. To do that, the best entity extraction systems also link entities back to a knowledge base like Wikipedia or an internal database.
By looking at the context of each entity mention as well as customizable rules the users has set, an entity linking algorithm connects entities to a corresponding ID in a knowledge base. The most useful entity linking will provide a linking confidence score that balances the problems of ambiguity and variety.
Ambiguity: When one name can refer to two or more entities, entity linking uses context to decide which entity it is. For example, “Mars” could be:
- A candy company (entity type: organization)
- A planet (entity type: location)
If the surrounding context is about the solar system, then it’s probably the latter.
Variety: For entities with more than one name, entity linking groups them as “synonyms” and ensures all mentions of any synonym link back to the same ID. For example, New York City (Wikidata Q60) could be mentioned as:
- New York City
- The Big Apple
Hybrid entity extraction in Rosette
Rosette’s entity extraction models utilize a balance of all three extraction methods, as well as a powerful redaction processor to consistently return highly accurate results. With years of experience working with entity extraction customers, Rosette’s algorithms have been constantly stress-tested and improved, including the creation of unprecedented high-speed engines for exact match lists.
What’s more, Rosette delivers consistently high-quality results across a broad range of major languages out-of-the-box. On-premises, Rosette’s entity extraction also offers many options for users to customize results from any of the three processor types. Adaptability ranges from modifying regular expressions or adding new entity lists for new entity types all the way to retraining the statistical model with unannotated data or new, annotated data.