Text Classification
Text classification models learn to assign one or more labels to text. You can use text classification over short pieces of text like sentences or headlines, or longer texts like paragraphs or even whole documents. One of our top tips for practical NLP is to break down complicated NLP tasks into text classification problems whenever possible. Text classification problems tend to be easier to annotate consistently, and the models need fewer examples to reach high accuracy.
Whether you’re doing intent detection, information extraction, semantic role labeling or sentiment analysis, Prodigy provides easy, flexible and powerful annotation options. Active learning keeps you efficient even if your classes are heavily imbalanced.
Quickstart
For balanced classes, the easiest way to get started is to use
textcat.manual
with a text source and one or more labels. See the docs on
manual annotation for examples. Setting the --exclusive
flag makes
the categories mutually exclusive, so you’ll only be able to select one label
option.
Once you’ve collected a dataset of maybe a few hundred annotations, you can run
training experiments to see if you’re on the right track. The
train
recipe takes one or more Prodigy datasets, trains a model and
outputs statistics and results. You can also use data-to-spacy
to export
data in spaCy’s format to use with
spacy train
, or db-out
to export your
annotations to use in any other process or application.
If your classes are imbalanced and you annotated an unbiased sample, your sample
would include very few examples that your label applies to, making it
difficult to train a reliable model. To make annotation more efficient, you can
use the textcat.teach
recipe to suggest the most relevant examples to
annotate. It uses match patterns of trigger phrases to collect enough positive
examples, and updates a model in the loop that suggests candidates it’s most
uncertain about. See this section for an example.
Annotation can be very efficient, because you only have to press
accept or reject. Once you’re done annotating, you can use
train
to update your model with the annotations.
If you have an existing text classification model trained with spaCy, you can
load it into the textcat.teach
recipe and give it feedback on the
predictions it’s most uncertain about. This means you’re focusing on annotating
examples that potentially make the biggest difference. The progress indicator in
the sidebar shows an estimate of how much you still need to annotate until
there’s nothing left to learn – or, phrased differently, an estimate of when the
loss is going to hit zero. This gives you an idea of when to stop. Once you’re
done annotating, you can use the train
recipe to update the model with
the new annotations.
If you’re not using a spaCy pipeline, you can write a
custom recipe that integrates your model, so you can use
it as part of the same textcat.teach
-style
active learning flow.
If you have existing annotations, you can convert them to Prodigy’s format and
use the db-in
command to import them to a new dataset. Each record should
have a "text"
and either a "label"
plus "answer"
(accept or reject) or a
list of "options"
and a list of selected labels as the "accept"
key. For
examples of the data formats, see the classification
UI (binary) and
choice
interface (manual). You can then run train
to train your
model, use textcat.manual
to add more annotations, or run the
review
recipe to correct mistakes and resolve conflicts.
If all you want to do is train and you don’t need to collect or correct any annotations, you might find it more efficient to just train with spaCy (or any other library) directly.
Choosing the right recipe and workflow
-
Fully manual: This is the classical approach and a very reliable way to get all of your examples annotated with all classes. For each example, you select one or more categories from a list. At the end of the process, you export “gold-standard” data that you can train your model with. In Prodigy, you can use this workflow with the
textcat.manual
recipe that displays the labels as options and lets you select one (mutually-exclusive categories) or multiple (multilabel classification). -
Binary with suggestions from patterns, active learning and a model in the loop: This workflow can be helpful if your classes are very imbalanced and it’s not feasible go through all texts in order. To help select more relevant examples, you can use patterns to describe trigger words and phrase of the categories that you’re looking for. Instead of annotating every example, you can use the model to suggest you the most relevant examples to annotate and give it feedback on its predictions. There are many different ways you can select the “best” examples, and a whole line of research dedicated to exploring active learning techniques. Prodigy’s
textcat.teach
recipe implements simple uncertainty sampling. Based on your decisions, the model is updated in the loop and guided towards better predictions. Prodigy also includes utilities that let you implement custom workflows with a model in the loop.
Annotating whole documents vs. annotating sentences
If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels: you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.
You can always keep your documents in order, so that your annotators get to move through the document from start to finish. However, if you have an annotation task where the annotator really needs to see the whole document to make a decision, that’s often a sign that your text classification model might struggle. Current technologies struggle to put together information across sentences in complex ways. Often you can restructure tasks that require that much context into multiple labels applied at different points, plus a little bit of rule-based logic. This thread on the forum shows a few ideas for this for fact extraction from earnings news.
Fully manual annotation
To get started, you need a file with raw input text and one or more labels. The
following command will start the web server, stream in news headlines from
news_headlines.jsonl
and show the label options Technology
, Politics
,
Economy
and Entertainment
. Instead of passing in a list of comma-separated
labels, you can also point the --label
argument to a text file with one label
per line.
Recipe command
prodigy textcat.manual news_topics ./news_headlines.jsonl --label Technology,Politics,Economy,Entertainment
By default, you’re able to select multiple categories. If your labels are
mutually exclusive and only one of them can apply, you can set the --exclusive
flag. You’re now only able to select one option and the answer is submitted
automatically when you make a selection.
Recipe command
prodigy textcat.manual language_identification ./web_dump.jsonl --label English,German,Other --exclusive
When you hit the accept, reject or ignore
buttons, your answer will be submitted and Prodigy will add an "answer"
key to
the annotation task dict – for example, "answer": "accept"
. When you’re
annotating manually with options, you typically only want to use accepted
answers. Ignoring an answer typically means that you want to skip it completely
and exclude it from everything – for example, because you don’t know the answer
or because the question is confusing or not representative. The
reject button is less relevant here, because there’s nothing to say
no to – however, you can use it to reject examples that have actual problems
that need fixing, like noisy preprocessing artifacts, HTML markup that wasn’t
cleaned properly, texts in other languages, and so on. When you view or export
your data later, e.g. with db-out
, you can then explicitly filter out
those examples and deal with them.
The “score” field in the bottom right corner of the annotation card shows you the score of the current suggestion. Even though the recipe tries to present you with the most uncertain scores, it can sometimes happen that you see very different scores instead. So why does this happen?
Streams are generators and only operate on one batch at a time. They can also stream from huge files or potentially infinite sources of data, so Prodigy can’t just load it all into memory and keep sorting the whole stream. Instead, it uses an exponential moving average to decide whether to send out a score, based on the distribution of previous scores. This also prevents it from getting stuck if the model suddenly produces higher or lower scores. If the scores are confusing and the model isn’t producing meaningful suggestions, try collecting some gold-standard data first before switching to the binary workflow.
When you annotate with a model in the loop, the model is also updated in the background. So why do you still need to train your model on the annotations afterwards, and can’t just export the model that was updated in the loop? The main reason is that the model in the loop is only updated once each new annotation. This is never going to be as effective as batch training a model on the whole dataset, making multiple passes over the data, shuffling on each epoch and using other deep learning tricks like dropout rates, compounding batch sizes and so on. If you batch train your model with the collected annotations afterwards, you should receive the same model you had in the loop, just better.
When you stop the recipe, the model in the loop is discarded and you can use
train
to train a better version of it using your annotations. If you just
restart the recipe with the base model, it’ll start again at the beginning –
otherwise, Prodigy would have to first batch train it behind the scenes and you
might have to wait for quite a while until you can get started annotating. If
you want to start with the updated model, you can train it with your
annotations, output it to a directory and then initialize textcat.teach
with the updated model:
prodigy train ./batch-trained-model --textcat textcat_dataset --base-model en_core_web_sm
prodigy textcat.teach textcat_dataset ./batch-trained-model ./data.jsonl --label INSULT
To prevent unintended side-effects, you typically want to train the base model
from scratch using all annotations every time you train – for example, you
want to update en_core_web_sm
with all annotations from one or more
datasets and not update batch-trained-model
, save the result, update that
again and so on.
Manual annotations with binary labels
If you only provide a single label, the annotation decision becomes much
simpler: does the label apply or not? In this case, Prodigy will present the
question as a binary task using the classification
interface. You can
then hit accept or reject. Even if you have more than one
label, it can sometimes be very efficient to make several passes over the
data instead of selecting from a list of options. The annotators can focus on
one concept at a time, which can reduce the potential for human error –
especially when working with complicated texts and label schemes.
Recipe command
prodigy textcat.manual language_identification ./web_dump.jsonl --label English
Dealing with very large label sets or hierarchical labels
If you’re working on a task that involves more than 10 or 20 labels, it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.
If your annotation scheme is mutually exclusive (that is, texts receive
exactly one label), you’ll often want to organize your labels into a hierarchy,
grouping similar labels together. For instance, let’s say you’re working on a
chat bot that supports 200 different intents. Choosing between all 200 intents
will be very difficult, so you should do a first pass where you annotate much
more general categories. You’d then take all the texts annotated for some
general type, such as information
, and set up a new annotation task to sort
them into more specific subtypes. This lets the annotators study up on that part
of the annotation scheme, so they can make more reliable decisions.
If your annotation scheme is not mutually exclusive (that is, texts can receive zero or more labels), it’s often faster to annotate one label at a time. This approach might seem inefficient, because you’ll have to make many more annotation passes over the data. However, if you’re annotating for just one label, you usually don’t need to read the text very closely – you can see immediately whether your label applies, letting you flash through the data at seconds per example.
Binary annotation with suggestions from patterns, active learning and a model in the loop
Annotation for text classification can get tricky if the classes you’re dealing
with are very imbalanced. For instance, let’s say you want to detect insults in
online comments. The majority of the comments you’ve extracted, e.g.
from Reddit (luckily) do not
contain any insults. If you annotated an unbiased sample, your sample would
include very few comments that your INSULT
label applies to, making it
difficult to train a reliable model.
The textcat.teach
recipe lets you take advantage of two cool NLP
techniques to collect a more representative data sample. When you start the
server, you’re shown binary questions and as you annotate, the model in the loop
is updated with your answers and guided towards better predictions. The
suggestions you see are the ones that the model is most uncertain about. In
the beginning, that’s pretty much everything. So to get over the cold start, you
can provide match patterns describing words and phrases that are
likely indicators of the given label – for instance, "idiot"
or "douchebag"
.
The pattern matches will be mixed in with the model suggestions. This ensures
that the model starts off with enough positive examples to make meaningful
suggestions.
Download INSULT patterns Download annotated dataset
Recipe command
prodigy textcat.teach textcat_insults blank:en ./reddit-comments.jsonl --label INSULT --patterns ./insults-patterns.jsonl
The progress indicator in the sidebar shows an estimate of how much you still
need to annotate until there’s nothing left to learn – or, phrased
differently, an estimate of when the loss is going to hit zero. As you annotate
more examples, the model will slowly get a better sense of the INSULT
label
and will suggest more relevant examples.
The highlighted span above shows the pattern match that was responsible for suggesting this example for annotation. Of course, patterns can also produce false positives that you’d have to reject – but that’s also very helpful. You don’t just want your model to learn that “sentences containing ‘douchebag’ are always an insult”. Note that the highlighted span is only added to visualize the match – it’s not going to be used directly as a feature in the model. However, the words that occur in the text will obviously have an impact on the model either way.
Video tutorial: training an insults classifier
The following video shows an end-to-end workflow using terms.teach
to
quickly bootstrap a list of trigger phrases based on word vectors and
textcat.teach
to collect annotations with a model in the loop. It took
40 minutes to create over 830 annotations, including 20% evaluation
examples, which was enough to give 87% accuracy. You can download the
annotated dataset
from GitHub.
Working with patterns
Match patterns are typically provided as a JSONL (newline-delimited JSON) file and can be used to pre-select examples based on expressions they contain. This is especially useful to find positive candidates if your classes are very imbalanced. For instance, if you’re annotating whether a news headline is about a company sales or acquisition, you could define a condition like “contains any form of the verb ‘acquire’” or “includes this company name”. Prodigy supports two types of patterns:
patterns.jsonl{"pattern": [{"lemma": "acquire"}, {"pos": "PROPN"}], "label": "COMPANY_SALE"}
{"pattern": "acquisition", "label": "COMPANY_SALE"}
-
Token patterns: These patterns are lists of dictionaries with one dictionary describing one token to match. The token attributes to match on can be the token’s
"text"
or lowercase form"lower"
, but also lexical attributes like"is_punct"
or linguistic features like"lemma"
or"pos"
. You can find more details in the spaCy’s documentation on rule-based matching. -
String matches: If the pattern value is a string, it will be used for exact string matching. While
{"lower": "berlin"}
matches “Berlin”, “berlin” and so on,"Berlin"
will only match “Berlin”. The advantage of string patterns is that you don’t have to worry about the tokenization and whether the patterns describe the correct tokens. They also make it easy to re-use existing word lists and dictionaries.
More about Prodigy pattern files
Active learning with a custom model
You don’t need to use spaCy to annotate with a model in the loop. Custom recipes are Python functions that let you script annotation workflows by returning components like the stream or an update callback to update the model in the loop. Just make sure to pick a model implementation that supports updates in small batches and that’s sensitive enough to small updates (since you want your annotations to have an effect).
Step 1: Use the model to predict and score labelspseudocode class Model:
def __call__(self, stream):
for eg in stream:
predictions = your_model(eg["text"])
for score, label in predictions:
example = copy.deepcopy(eg)
example["label"] = label
yield (score, example)
On their own, the scores and examples aren’t that interesting yet – you
typically want to use the scores to only select the most relevant examples for
annotation. Prodigy provides several
sorter functions that take a stream of
(score, example)
tuples and pick examples to send out for annotation. The
textcat.teach
recipe uses the prefer_uncertain
sorter, which selects
scores closest to 0.5
.
Step 2: Sort the stream by scorepseudocode from prodigy.components.sorters import prefer_uncertain
model = Model()
stream = model(stream)
stream = prefer_uncertain(stream)
Step 3: Update the model with answerspseudocode class Model:
def update(self, answers):
accepted = [eg for eg in answers if eg["answer"] == "accept"]
rejected = [eg for eg in answers if eg["answer"] == "reject"]
update_your_model(accepted, rejected)
By default, Prodigy streams are generators and Prodigy will only ever ask for
the next batch from the stream. So as you annotate and update the model, future
batches will receive scores from your updated model in the loop. For a
simplified example of that loop, check out the
textcat_custom_model.py
recipe script. It uses a DummyModel
that “predicts” random numbers to
illustrate the idea – you’d obviously replace that with your own implementation
using a library like scikit-learn, PyTorch or TensorFlow.
Dummy text classification modelpseudocode class DummyModel:
def __init__(self, labels):
# The model can keep arbitrary state – let's use a simple random float
# to represent the current weights
self.weights = random.random()
self.labels = labels
def __call__(self, stream):
for eg in stream:
# Score the example with respect to the current weights
eg['label'] = random.choice(self.labels)
score = (random.random() + self.weights) / 2
yield (score, eg)
def update(self, answers):
# Update the model weights with the new answers
self.weights = random.random()
Finally, you can put it all together in a recipe function using the
@prodigy.recipe
decorator.
Step 4: Putting it all together in a recipepseudocode import prodigy
from prodigy.components.get_stream import get_stream
from prodigy.components.sorters import prefer_uncertain
@prodigy.recipe("custom-textcat")
def custom_textcat_recipe(dataset, source):
model = Model()
stream = get_stream(source) # load the data
stream = model(stream) # call custom predict function
stream = prefer_uncertain(stream) # sort to prefer uncertain scores
return {
"dataset": dataset, # dataset to save annotations to
"stream": stream, # the incoming stream of examples
"update": model.update, # the update callback
"view_id": "classification" # annotation interface to use
}
Command-line usage
prodigy custom-textcat textcat_dataset ./your_data.jsonl -F recipe.py
Optionally, you can also add pattern matching to pre-select examples based on
the matches they contain. Prodigy’s
PatternMatcher
wraps spaCy’s
Matcher
and PhraseMatcher
so
you can use both token-based patterns and string matches. Using the
combine_matches
helper, you can create
one unified predict
function that gets model predictions and matches and
interleaves them, and a unified update
callback that updates both the model
and the pattern matcher.
Step 5: Add match patterns (optional)pseudocode import prodigy
from prodigy.components.stream import get_stream
from prodigy.components.sorters import prefer_uncertain
from prodigy.models.matcher import PatternMatcher
from prodigy.util import combine_models
import spacy
@prodigy.recipe("custom-textcat")
def custom_textcat_recipe(dataset, source, patterns=None):
model = Model()
if patterns is None:
predict = model
update = model.update
else:
nlp = spacy.blank("en")
matcher = PatternMatcher(nlp, label_span=False, label_task=True)
matcher = matcher.from_disk(patterns)
# Combine the textcat model with the PatternMatcher annotate/update both
predict, update = combine_models(model, matcher)
stream = get_stream(source) # load the data
stream = predict(stream) # call custom predict function
stream = prefer_uncertain(stream) # sort to prefer uncertain scores
return {
"dataset": dataset, # dataset to save annotations to
"stream": stream, # the incoming stream of examples
"update": update, # the update callback
"view_id": "classification" # annotation interface to use
}
Command-line usage
prodigy custom-textcat textcat_dataset ./your_data.jsonl ./patterns.jsonl -F recipe.py
Training text classification models
Once you’ve labelled some data with Prodigy, you can start your training
experiments. If you’ve collected annotations from different sources or multiple
annotators, it’s often a good idea to use the review
recipe to resolve
any conflicts and double-check the data. It’s also recommended to create a
separate, dedicated evaluation set that you can compare different approaches
against.
-
Train a spaCy pipeline using Prodigy’s CLI. The
train
recipe is a wrapper around spaCy’s training API and optimized for training straight from Prodigy datasets and quick experiments. It reads from a dataset, holds back data for evaluation and outputs nicely-formatted results. This workflow is the best choice if you just want to get going or quickly check if you’re “on the right track” and your model is learning things. -
Train a pipeline with spaCy directly. Once you’re getting more serious, it often makes sense to train your model directly with the library you’re using – e.g. spaCy. This gives you more control over the training process and hyperparameters, and lets you train all model components at once. The
data-to-spacy
command lets you convert Prodigy datasets to spaCy’s format and auto-generates a config to use with thespacy train
command. It’s recommended to use thereview
recipe on the different annotation types first to resolve conflicts properly. To check if your data is valid and contains no issues, you can run spaCy’sdebug-data
command. -
Train a model with any other implementation or framework. The
db-out
exports annotations in a straightforward JSONL format. If you’ve collected binary annotations, each example will have a"label"
and an"answer"
that’s either"accept"
,"reject"
or"ignore"
(see here for the format). If you’ve collected multiple choice annotations, each example will have an"accept"
key mapped to a list of selected label IDs. This should make it easy to convert and use it to train any model.