How it’s used and how it works
Entity extraction (aka, named entity recognition or NER) is a type of natural language processing technology that enables computers to analyze text as it is naturally written. Specifically, it pulls out the most important data points (entities) in unstructured text (think news, webpages, text fields). Entities include names of people, places, organizations, and products, as well as dates, email addresses, and phone numbers. Extracted entities can populate a database record about the text. This structure enables higher-level analyses, such as relationships between entities, detecting events, and sentiment analysis around entities.
What is Named Entity Recognition Used For?
Better Search for E-commerce, Business Research
Extracted entities make keyword search more accurate. Keywords only match words, whereas entity extraction uses context to know when, for example, “Paris” refers to a city, the name of a celebrity (“Paris Hilton”), or a nonentity (plaster of Paris). In e-commerce, extracting price, clothing features, size, and other product attributes from descriptions lets shoppers filter searches to refine 200 results to a browsable 20.
Brand Monitoring and Intelligence Gathering
Want to know “what are people saying” about a new product launch or their experience at your hotel? NER is an enabling technology for sentiment analysis to track social media buzz or uncover new rivals. Intelligence agencies that track specific people and organizations of interest in message streams can distinguish between similarly named entities (e.g., Neil Armstrong the astronaut or hockey referee) by linking to an entity knowledge base using the context surrounding the entity. (Does the text refer to space or hockey?)
Knowledge Graphs, Event Extraction, Fact Extraction
Pushing the possible are technologies built on NER:
- Knowledge graphs visualize the relationship between entities (who is affiliated with what organizations and locations)
- Fact extraction answers factual questions (What kills bacteria?)
- Event extraction finds who did what to whom, when, and where.
Especially for these advanced technologies, entity extraction must be highly accurate and chain together different mentions of the same entity (e.g., Neil Armstrong, he, Armstrong, the astronaut). This is also known as, coreference resolution.
How Entity Extraction Works
Different techniques are used to extract different types of entities.
Machine learning trains models to extract entities such as person, location, and organization where word meaning varies depending on context (e.g., Paris). A corpus of text containing thousands of examples of each entity type is annotated by humans. Then an algorithm trains a statistical model on that data to “learn rules” for predicting which words represent which entity type.
The accuracy from machine learning models depends on the algorithm used and, even more so, creating high-quality training and test data. Deep learning models can be more accurate than traditional machine learning models, but are currently much slower. Optimizing the accuracy of a model means adapting the statistical model to that set of data.
The exact match method matches words against a list of entities for each entity type. This method is appropriate for entity types that are finite and unambiguous, such as nationalities. However, since exact match doesn’t consider context, it cannot distinguish between the nationality “Polish,” and the common word “polish.”
Pattern matching is effective for finding entities that follow a particular pattern, such as email addresses, URLs, and phone numbers.
Applications that analyze big data to find insight from patterns and themes in unstructured text depend on entity extraction, and will only continue to grow.
Explore Entity Extraction More Deeply in Our Other Blog Posts
Advanced entity extraction features
- What is coreference resolution?
- What is entity linking (aka, entity resolution)?
- What is entity salience for distinguishing between important and unimportant entities?