What Does It Take to Build a Real Production-Ready Model for Entity Extraction in One Language? Here’s a Peek Through the Eyes of Our Linguistic Data Engineer
Swedish model building by the numbers
- 3 annotators
- 5 engineers and project managers
- 840 annotator hours
- 5 calendar months for the annotation period
- 7 calendar months from project start to end
- 3,824 documents
- 737,893 tokens (words)
- 13,775 person entity mentions
March 1, 2019
We are building a new entity extraction model for Swedish. Our English-fluent, Swedish partner is hiring the annotators and managing them for us. Finding native speakers in the target language who also communicate well in English is always tough. We have some Swedish examples in our master annotation guidelines, but we’ll need to get them checked out by the native speakers.
March 14, 2019
Our Swedish partner found us three annotators through a local university. We’re getting them familiarized with our guidelines. I’m really glad there are no new entities that we have to annotate. After training 21 languages we’ve covered most of the edge cases in our guidelines. As expected, we have spent a week doing several rounds with our annotators to fix old examples and add examples to our guidelines.
We’ll use the open source BRAT annotation tool. Our R&D department has a cool active learning annotation tool it’s developing that will help me select the most informative documents to tag first (thus minimizing the number of docs we have to tag), but it’s not yet ready. 😞
This active learning annotation tool will also have features for assigning, monitoring, and adjudicating work with multiple annotators. Right now that work takes up a huge chunk of my time!
March 21, 2019
We’ve started our pilot annotation in which we ask all three annotators to tag the same hand-picked 10-20 documents. I’m not expecting this effort to produce enough to train a model, it’s really to acclimate the annotators to the task and shake out any questions or confusion. Since all three of them are tagging, these docs will end up in the “gold standard” test dataset that we use to test models we build. We keep these docs locked away so the engineers can’t see them. They would be tempted to fit the model to score well on these documents and that is NOT good practice.
March 23, 2019
As expected there have been questions from the annotators. For example, our English guidelines say not to tag the possessive ’s but as the Swedish possessive is a simple s in Swedish, we’ve decided the s should be included in the tagged entity.
Preparing the data
March 31, 2019
I got the news documents from our partner that will become our corpus. It will take me a week or so to
- Check for cleanliness (no stray navigation bars or other kruft)
- Look for duplicate documents. Duplicates are a real issue because it can
- Waste the time of the annotators
- Contaminate our “gold standard” test dataset that we use to score our models. If we test on documents the model trained on, that completely invalidates the scores.
- Make sure we preserve any metadata we have about each document (date of publication, news category, etc.)
- Split the docs between training and evaluation datasets. As usual, it’ll be a 80/20 split.
April 5, 2019
We’ve got excellent annotators. The inter-annotator agreement between the three on the pilot was quite high. I removed all the tokens that were marked
NONE by all annotators and then calculated the inter-annotator agreement. Anything above .80 is considered good and they are in the mid .80s.
We’re ready to start tagging. We’ll start with a big chunk that all three will tag, so we can use it for our “gold standard” test dataset. After that they will all work on different docs so we can get through more docs in less time, but every 10th doc, I’ll give them the same doc to tag so that I can continue to monitor the inter-annotator agreement and have confidence that they continue to be tagging reliably.
July 1, 2019
We completed the initial tagging. Phew! We ran the initial stats and it looks like out of nearly 738,000 tokens, we have about 50,000 entities.
PERSON are the most numerous — no big surprise.
TITLE are somewhat sparse, but that is normal. What is worrisome is, products are very sparse (about 2000) so the model won’t have enough mentions to train on.
Our options are to:
- Get more docs rich in product names to add to the training set, or
- Remove the products and retag them as needed. For example, with book titles tagged as “product,” then “Oliver Twist” would have to be retagged as
PERsince we have chosen to tag fictional people.
July 3, 2019
We’ve decided to drop products from the list of entities. Product management thinks it is relatively low priority, and given the time and resources it would take to tag enough product entities, we’ll drop it, but it still means work looking at each product entity to see if it needs to be retagged.
I’m going to assign one annotator to review and retag the product entities. The other two will work on adjudication (i.e., looking at places where there was disagreement in tagging and deciding which tag is correct).
At the same time we are going to “sanity test” the annotated data by building a model. Of our annotators, the computational linguist seems to be the most reliable, so wherever there is a conflict, we will use his tags.
Building the model
July 5, 2019
While I’ve been busy with annotation, the machine learning engineer has been scraping and cleaning data from Swedish Wikipedia. Now she is taking her Swedish Wikipedia data plus a plain text version of the training data set for unsupervised training. This process creates word classes, which essentially clusters words that appear in the documents in similar contexts.
She will then run supervised training on the annotated data and the word classes to create the model itself.
July 7, 2019
The engineer reports very good results. For each entity type, against the “gold standard” test dataset, the model is scoring between the mid 70s to mid 80s which is quite good. The outlier is product entity type which is scoring in the low to mid 30s. That entity type definitely needs to be removed from the model.
August 1, 2019
It took a few weeks, but we’ve finished retagging and removing product entities and the adjudicating. Yesterday our engineer built another model. The results weren’t very different from the first one, but we can’t really compare this against the sanity check model as the annotations changed.
The engineer reports that scores were good enough from the get-go, so she didn’t have to do any special feature tuning. We’re definitely benefiting from having trained models for other European languages. Although Swedish has a rich morphology, it’s typical European, with declinations, suffixes, etc. Furthermore, our Base Linguistics module already supports Swedish for tokenization. As a result, the common set of machine learning features is working well for Swedish, too.
Most of the time, we use gazetteers (entity lists) for the
RELIGION entities, but because of Swedish’s unique morphology, these two entity types are scoring high with the statistical model. At the same time because of Swedish’s rich morphology, generating gazetteers that list out all the morphological variations would be difficult and time consuming. There are up to four different forms for any nationality because the word can be an adjective or noun, which has its own morphology.
We’re building out the rest of the gazetteers we’ll need for entities:
TITLE. The tricky part is you only want to put in items that are unambiguous. Short words tend to be highly ambiguous (both in the case of something that is and isn’t an entity or different entity types. The English “lee” which is a common noun and adjective, but also a place,“Lee, Florida,” and a person’s name “Stan Lee”.
Our partner’s Swedish engineer is compiling these gazetteers and adding the necessary morphological variations to make it useful.
The engineers are also writing regular expressions to pattern match entities such as
NUMBERS, government issued
PERSONAL IDENTIFIER numbers and such. Our project manager and her Swedish counterpart are coming up with test cases so that we aren’t writing regular expressions that over- or under-generate pattern matches.
September 30, 2019
We are feature complete for Swedish entity extraction. Our final model is quite good, F-scores and precision and recall scores for the various entities range from the mid 70s to low 90s. We’re celebrating with a Swedish smörgåsbord. Skål!