What are the top three barriers to better machine learning models? Annotating data, annotating data, and annotating data.
Producing quality training data to produce accurate models takes up the lion’s share of human labor and time in the entire process. This includes collecting and cleaning data, making sure it is balanced and representative, creating annotation guidelines, annotating the data, and checking for inter-annotator agreement.
That’s why we’re about Model Training Suite, which REDUCES the human labor of annotating AND the amount of data to tag in order to train an accurate model. What’s the secret? Active learning and natural language processing.
Model Training Suite is a user-friendly GUI application for nontechnical users. After annotating a quantity of their own data, the annotation manager can push a button to create a new model and see the improved results tested against their gold standard evaluation dataset.
A Quick Primer on the Traditional Annotation Process
Unfamiliar with how training data is produced for supervised machine learning? (“Supervised” refers to the model being supervised by humans specifying the examples from which the model will learn.) Take entity extraction, for example. Training a new model requires training data on the scale of hundreds of thousands of tokens (words), in which every occurrence of each entity type (e.g., person, place, location) is consistently tagged by humans. And you will need enough examples of each type, so that your resulting machine-learned model has enough data to learn the contexts in which these entities are likely to appear.
The accuracy of machine-learned models is heavily dependent on the quality and quantity of data it was trained on, whatever the task may be.
Collecting, cleaning, and annotating this data requires a tremendous amount of human labor. And the results of the annotation are not known for months. (Were there enough examples of the target entities? What will be the model’s accuracy after tagging 10,000 tokens? 50,000?)
For example, annotating 500,000 tokens for entity extraction takes eight person months of annotator time (split between four annotators) plus a project manager. The fear is always that it is more costly to restart the entire process (data collecting/cleaning and rehiring annotators) if it turns out there were not enough examples. The best practice is to tag too much data to be assured there is enough training data to produce a high accuracy model.
How to Accelerate Annotation Without Sacrificing Accuracy
With “active learning” running in the annotation tool, the user annotates only “informative” instances, thereby finding the signal in less time. Because Model Training Suite works in concert with a small model trained on the documents annotated “so far,” it can intelligently select documents from the dataset that will be most “informative” (i.e., impactful or educational) in the final model’s training. When training for a categorization task (which is the basic model of most NLP machine learning models), the small model will classify documents in the yet-unannotated dataset providing a guess and a confidence score. The documents with the lowest confidence most likely contain examples that the model needs to learn, so those will be recommended for annotation. At the same time, to avoid creating a model that is overfitting for edge (or rare) cases, a quantity of random documents must also be annotated.
Model Training Suite accelerates annotation in these ways:
- AI-assisted data preprocessing — From the NLP features of Rosette, base linguistics segment documents into useful chunks for annotating; entity extraction informs higher order annotations; and semantic similarity measures the uniqueness of each document
- Iterative model evaluation — An interim model of documents tagged “so far” is continually retrained and evaluated as newly tagged documents are added, so tagging can be halted as soon as the model reaches the target accuracy or hits diminishing returns (i.e., it becomes clear that tagging more documents will not significantly increase model accuracy)
- Efficient annotation — Based on the interim model, active learning will recommend the most likely “informative” documents from the yet-to-be annotated dataset, so that they can be tagged first; as the model is rebuilt, the assessment of “informative” documents is continually re-evaluated
- Computer-assisted tagging — The interim model will pre-tag documents for the human annotator to accept/reject/correct; adjusting/correcting tags is much faster than tagging from scratch
- Faster domain adaptation — Natural language processing built into Rosette decreases time for adapting an existing model to a new domain (such as finance or product reviews) or adding an entity type.
Because Model Training Suite allows data annotation, training, and evaluation to be done in parallel, there is transparency into the progress of model building to answer the $100,000 question, “Are we there yet?” And the result is tagging fewer documents than before to achieve similarly accurate models.
Enable Training “Low-signal” NLP Tasks
Active learning in Model Training Suite also makes it possible to train for “low-signal” tasks. One example is a tweet classification task, where we have tweets with the word “iPhone” and we want to find tweets that express an explicit intent for buying an iPhone. However, most of the tweets do not express this intent. They contain commercial content, ads, and personal reviews. Such “low-signal” classification tasks are challenging, due to the relatively large number of negative documents that need to be annotated before a single positive document is found.
A Comparison of Annotating by Random Selection vs. Active Learning
The Babel Street R&D department compared the active learning approach with the traditional random selection approach for a sequence labeling task: named entity recognition in English using a common deep-learning architecture. On the horizontal axis, we see the number of words that were annotated so far, and on the vertical axis we have the accuracy (F1 measure) across all classes, measured over an independent evaluation set. The blue curve represents document selection that is based on active learning, while the orange one represents a random document selection approach. Other than changing the document selection approach, all things were equal for the two systems.
This figure clearly shows that taking an active learning approach cuts the size of the dataset required for learning by large factors (~2X). This conclusion aligns with findings from previous work in this field. Two of the most prominent studies were done by Burr Settles for document classification tasks, and Shen et al., 2017 for sequence labeling tasks.