Usage

Prodigy Plugins

Some Prodigy recipes require a 3rd party library in order to work. To keep Prodigy lightweight we’ve separated some of these recipes out into their own packages so that you may install them as a plugin. These plugins always target the most recent version of Prodigy with regards to compatibility.

This section of the docs showscases such plugins. Note that you can also explore these recipes on Github to serve as a source of inspiration to customise further.

🤗 Prodigy HFRecipes that interact with the Huggingface stack.
Contains hf.train.ner, hf.correct.ner, hf.upload and more.
Github repo
📄 Prodigy PDFRecipes that help with the annotation of PDF files.
Contains pdf.image.manual and pdf.ocr.correct
Github repo
🤫 Prodigy WhisperRecipes that leverage OpenAI’s Whisper model for audio transcription.
Contains whisper.audio.annotate.
Github repo
🍰 Prodigy SegmentRecipes that leverage Meta’s Segment-Anything model for image segmentation.
Contains segment.image.manual and more.
Github repo
🏘 Prodigy ANNRecipes that allow you to use approximate nearest neighbor techniques to help you annotate.
Contains ann.text.index, ann.image.index, ann.text.fetch and more.
Github repo
🌕 Prodigy LunrRecipes that allow you to use old-school string matching techniques to help you annotate.
Contains lunr.text.index, lunr.text.fetch and more.
Github repo
🦆 sense2vecRecipes that allow to fetch terms using phrase embeddings trained on Reddit.
Contains sense2vec.teach, sense2vec.to-patterns and more.
Github repo

🤗 Prodigy-HF

This plugin contains recipes that interact with the Hugging Face stack. Some recipes will allow you to directly train transformer models on top of your annotations while other recipes allow you to upload artifacts to HF cloud environment.

To use these recipes, you’ll first need to install the plugin.

Install prodigy-hfpip install "prodigy-hf @ git+https://github.com/explosion/prodigy-hf"

Once it is installed you can explore some of the new recipes.

Training Huggingface models

The first recipe that you may enjoy from this plugin is the recipe to train custom NER models.


prodigy
hf.train.ner
fashion,eval:fashion-eval
hf-model-dir
--epochs 10
--model-name distilbert-base-uncased

Once the model is done training you’ll be able to inspect the hf-model-dir folder to find all the trained state.

You can also choose to re-use this trained model to help you annotate data. The plugin features a hf.ner.correct recipe that works similarily to ner.correct except here we get to use a Hugging Face model. This means that you can also use models from the Hugging Face Hub. This recipe will internally map the predictions from the transformer model to spaCy tokens.


prodigy
hf.ner.correct
fashion
hf-model-dir/checkpoint-20
examples.jsonl
--lang en

Note that this plugin also offers variants of these recipes for text classification. Check out the API docs for hf.train.textcat and hf.correct.textcat for more details.

Interacting with Hugging Face Hub

Alternatively, you may also use these plugin to upload your annotated datasets to Huggingface Hub.


prodigy
hf.upload
fashion,eval:fashion-eval
username/reponame

✔ Upload completed! You should be able to view repo at
https://huggingface.co/datasets/username/reponame.

Internally this recipe will validate the dataset for consistency and will attempt to anonymise the annotators before uploading. You can turn this behavior off with flags and you can also specify that you want the dataset not to appear publicly.

API

hf.train.ner command

  • Interface: terminal only
  • Use case: train huggingface models directly

Trains a Huggingface model for NER directly on your annotated datasets.


prodigy
hf.train.ner
datasets
out_dir
--model-name
--batch-size
--eval-split
--learning-rate
--verbose
ArgumentTypeDescriptionDefault
datasetspositionalOne or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation
out_dirpositionalFolder to store trained model and checkpoints.
--model-name, -mnoptionPick the model you’d like to use as a starting point for training.“distilbert-base-uncased”
--batch-size, -bsoptionBatch size for training.8
--eval-split, -esoptionIf no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported.
--learning-rate, -lroptionLearning rate.2e-5
--verbose, -vflagOutput all the logs/warnings from Huggingface libraries.False

hf.correct.ner manual

  • Interface: ner_manual
  • Use case: Annotate NER with a model in the loop

Annotate NER data with a transformer model in the loop.


prodigy
hf.correct.ner
dataset
--model-name
source
--lang
ArgumentTypeDescriptionDefault
datasetpositionalDataset to save annotations into
out_dirpositionalPath to transformer model. Can also point to a model on Hugging Face Hub.
sourcepositionalSource file to annotate
--lang, -loptionLanguage to assume for the spaCy tokeniser“en”

hf.train.textcat command

  • Interface: terminal only
  • Use case: train huggingface models directly

Trains a Hugging Face model for text classification directly on your annotated datasets.


prodigy
hf.train.textcat
datasets
out_dir
--model-name
--batch-size
--eval-split
--learning-rate
--verbose
ArgumentTypeDescriptionDefault
datasetspositionalOne or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation
out_dirpositionalFolder to store trained model and checkpoints.
--model-name, -mnoptionThe name of the model to be used as a starting point for training.“distilbert-base-uncased”
--batch-size, -bsoptionBatch size for training.8
--eval-split, -esoptionIf no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported.
--learning-rate, -lroptionLearning rate.2e-5
--verbose, -vflagOutput all the logs/warnings from Huggingface libraries.False

hf.correct.textcat manual

  • Interface: choice
  • Use case: Annotate textcat data with a model in the loop

Annotate data for text classification with a transformer model in the loop.


prodigy
hf.correct.textcat
dataset
--model-name
source
ArgumentTypeDescriptionDefault
datasetpositionalDataset to save annotations into
out_dirpositionalPath to transformer model. Can also point to a model on Hugging Face Hub.
sourcepositionalSource file to annotate

hf.upload command

  • Interface: terminal only
  • Use case: upload annotations to Hugginface Hub

Upload your annotations to Hugging Face Hub.

You can use the same command multiple times to upload the most recent version of your data to the hub.


prodigy
datasets
repo_id
--keep-annotator-ids
--patch_values
--private
ArgumentTypeDescriptionDefault
datasetspositionalOne or more (comma-separated) datasets to upload. Use the name: prefix to add keys to the dataset.
repo_idpositionalName of the repo to upload to. Should be formatted as /.
--keep-annotator-ids, -kflagDon’t anonymize the annotators.False
--patch_values, -nvflagIf keys are missing between datasets, patch them with None values.False
--private, -pflagUpload dataset as a private repository.False

Prodigy-PDF

This plugin contains recipes that help you annotate pdf files by turning them into images first. This way, they may be annotating using the familiar image_manual interface. It also contains recipes for OCR. If you’re interested in a quick over, you may appreciate this Youtube explainer.

To use these recipes, you’ll first need to install the plugin.

Install prodigy-pdfpip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"

Then once it is installed, you can start annotating PDFs as images via pdf.image.manual.


prodigy
pdf.image.manual
papers
path/pdfs
--labels figure,footnote,paragraph

Prodigy

This live demo requires JavaScript to be enabled.

If you like, you can re-use the pdf annotations with the pdf.ocr.correct recipe to apply OCR to the annotated segments. This recipe uses pytessaract under the hood to give suggestions that you can correct.


prodigy
pdf.ocr.correct
ocr_images
papers
path/pdfs
--labels paragraph
--fold-dashes

Prodigy

This live demo requires JavaScript to be enabled.

API

pdf.image.manual manual

Add annotations to a PDF by first converting it to an image.

In order for this recipe to work, you may need to install system dependencies for tesseract. These can usually be installed directly via:

# for mac
brew install tesseract

# for ubuntu
sudo apt install tesseract-ocr

prodigy
pdf.image.manual
dataset
pdf_folder
--labels
--remove-base64
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
pdf_folderPathFolder that contains your pdf files
labels, -lstrComma delimted labels to attach
remove_base64, -RboolDon’t save the base64 images of the pdfsFalse

pdf.ocr.correct manual

  • Interface: text_input
  • Use case: Add OCR annotations to PDF segments.

Applies OCR to annotated segments from pdf.image.manual and gives a textbox for corrections.


prodigy
pdf.ocr.correct
dataset
source
--labels
--scale
--fold-dashes
--remove-base64
--autofocus
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrSource with PDF Annotations
--labels, -lstrLabels to consider
--scale, -sintZoom scale. Increase above 3 to upscale the image for OCR.3
--remove-base64, -RboolDon’t save the base64 images of the pdfsFalse
--fold-dashes, -fboolRemoves dashes at the end of a textline and folds them with the next term.False
--autofocus, -afboolAutofocus on the transcript UIFalse

🤫 Prodigy-Whisper

OpenAI released an open model for audio annotation called Whisper. It’s a model that can be downloaded locally, it has support for multiple languages and you’re even able to pick from a selection of models. The model isn’t perfect, but when you’re transcribing text, it can really help to have such a model provide a starting point. The goal of this plugin is to help you get started with this right away.

To use this plugin, you’ll need to install it first.

Install prodigy-whisperpip install "prodigy-whisper @ git+https://github.com/explosion/prodigy-whisper"

In order to use the plugin you’ll also need to have ffmpeg installed. Most package managers should have these available so you should be able to use one of the following commands.

Install ffmpeg# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Once the plugin is installed you can use the whisper.audio.transcribe recipe. It is very similar to audio.transcribe recipe that Prodigy provides, but this recipe uses Whisper to provide an initial transcription.

Example

prodigy whisper.audio.transcribe transcripts ./recordings --model base
This live demo requires JavaScript to be enabled.

In the base form you can already see that Whisper does a pretty good job at transcription. But it may be easier to correct short pieces of audio instead of a long one. This is where Wishper can help out as well. It is able to segment a long audio clip into shorter segments and each of these segments can then be annotated in Prodigy.

To use this feature, you can add the --segment flag to the recipe call.

Example

prodigy whisper.audio.transcribe transcripts ./recordings --model base --segment

Now, you can go through the segments one by one and each segment will have metadata attached so that you can link it back to the timestamps in the original file. This is what the first segment would look like.

This live demo requires JavaScript to be enabled.

This is what the second segment would look like.

This live demo requires JavaScript to be enabled.

API

whisper.audio.transcribe manual

  • Interface: blocks/ audio/ text_input
  • Saves: annotations to the database
  • Use case: Manually create transcriptions for audio with a Whisper model in the loop

Manually transcribe audio files by typing the transcript into a text field with the help of Whisper. The API is built on top of audio.transcribe and will allow you to configure everything that the original recipe can. The only input addition is that this recipe also allows you to select a Whisper model. The recipe uses the "base" model by default, but you should be able to pick any of the models shown on here.


prodigy
whisper.audio.transcribe
dataset
source
--loader
--autoplay
--keep-base64
--fetch-media
--playpause-key
--text-rows
--text-rows
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set.
--model, -mstrName of OpenAI Whisper model to use.base
--loader, -lostrOptional ID of source loader, e.g. audio or video.audio
--autoplay, -AboolAutoplay the audio when a new task loads.False
--keep-base64, -BboolIf audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.False
--fetch-media, -FMboolConvert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.False
--playpause-key, -pkstrAlternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field."command+enter, option+enter, ctrl+enter"
--text-rows, -trintHeight of the text input field, in rows.6
--field-id, -fistrAdd the transcript text to the data using this key, e.g. "transcript": "Text here"."transcript"
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Prodigy-Segment

Sometimes you’re interested in selecting pixels from an image, as opposed to merely selecting a bounding box. Selecting the right pixels can be tedious work so you may want to use a model in the loop to help you. A good choice for such a model is Meta’s Segment Anything model, which we’ve integrated into Prodigy via the prodigy-segment plugin.

This model is able to take bounding box annotations from Prodigy to construct a pixel segmentation map under the hood. From the UI, that might look like this:

Using Prodigy-Segment

For a quick overview of the features, you may also enjoy this Youtube tutorial.

Before you’ll be able to use recipes, you’ll want to make sure you’ve downloaded the appropriate model checkpoint beforehand. You can check the available models here but this tutorial will assume the “default” model-type. The weights for this model can be downloaded via:

Download the weights for the `default` model-typewget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Once the model is downloaded you can get started by running the segment.image.manual recipe.


prodigy
segment.image.manual
segment-cat-dog
images
sam_vit_h_4b8939.pth
--model-type default
--labels cat,dog

When you run this model, you may notice that it’s fairly slow. This isn’t a big suprise given the size of the model but it can be a serious burden, especially if your machine does not have a GPU. For a better experience, you may want to pre-compute the features ahead of annotation time and cache those results to disk. It may take a while to precompute all the images, but once they are done the annotation experience feels seamless and realtime again.

To precompute a cache, you can use the segment.fill-cache recipe.


prodigy
segment.fill-cache
images
sam_vit_h_4b8939.pth
--model-type default
--cache segment-anything-cache

This will store all the features in a folder (configurable via the --cache flag) which the segment.image.manual recipe can immediately pick up.


prodigy
segment.image.manual
segment-cat-dog
images
sam_vit_h_4b8939.pth
--model-type default
--label cat,dog
--cache segment-anything-cache

The pixel maps, once annotated, are stored under the spans key in your examples. You can explore these maps one by one in a Jupyter notebook using the script shown below.

Script to loop over all annotated examplesimport base64
from io import BytesIO
from PIL import Image

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset_examples("<dataset-name>")

def mask_to_pil(mask_str):
    indicator = "base64,"
    mask_str = mask_str[mask_str.find(indicator) + len(indicator):]
    bytes = BytesIO(base64.b64decode(mask_str))
    return Image.open(bytes)

# Loop over all the examples and display them.
for ex in examples:
    print(ex['path'])
    for span in ex.get("spans", []):
        # Use builtin `display` to view pixel map
        display(mask_to_pil(span['mask']))

From here you can re-use the Pillow library to either store these pixel maps into the required format for your pipeline or you can stream them directly into a learning algorithm from Python.

API

segment.image.manual manual

  • Interface: blocks/ image_manual
  • Saves: annotations to the database
  • Use case: Annotate pixels by drawing bounding boxes

Manually transcribe pixels in images with Meta’s segment anything model under the hood.


prodigy
segment.image.manual
dataset
source
checkpoint
--label
--loader
--exclude
--width
--darken
--no-fetch
--remove-base64
--model-type
--cache
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set.
checkpointPathPath to a model checkpoint.
--label, -lstr / PathOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
--loader, -lostrOptional ID of source loader.images
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--width, -wintWidth of card and maximum image width in pixels.675
--darken, -DboolDarken image to make boxes stand out more.False
--no-fetch, -NFboolDon’t fetch images as base64. Ideally requires a JSONL file as input, with --loader jsonl set and all images available as URLs.False
--remove-base64, -RboolRemove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files!False
--model-type, -mtstrType of model to use.default
--cache, -cPathPath to feature cache to speed up inference.segment-anything-cache

segment.fill-cache command

  • Interface: terminal only
  • Saves: inference features into disk cache
  • Use case: Prepare images for segmented annotation

Prepares a local disk cache to speed up inference for segment.image.manual. This can cause a huge speedup if you’re running on a non-GPU device.


prodigy
segment.fill-cache
source
--loader
checkpoint
--model-type
--cache
ArgumentTypeDescriptionDefault
sourcestrPath to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set.
checkpointPathPath to a model checkpoint.
--loader, -lostrOptional ID of source loader.images
--cache, -cPathPath to feature cache to speed up inference.segment-anything-cache

Prodigy-ANN

Sometimes you may want to query your examples to find a relevant subset for annotation. A modern method for doing this is to use numeric vectors to represent text and you can use approximate neighest neighbor (ANN) techniques to fetch relevant examples. The goal is to spend more time looking at examples that matter, like examples similar to items that the model gets wrong. Curating these examples first might be a pragmatic method to steer the model in the right direction.

ann
This is general approach for the ANN recipes.

If you’re interested to see a quick demo for Prodigy-ANN applied to a text dataset, you may appreciate this Prodigy short on Youtube.

To use this plugin, you’ll need to install it first.

Install prodigy-annpip install "prodigy-ann @ git+https://github.com/explosion/prodigy-ann"

As a first step for this approach you’ll first need to generate an index with vector representations of your text. To encode the text this library uses sentence-transformers and it uses hnswlib as an index for these vectors.

To index your documents, you can run the ann.text.index recipe.


prodigy
ann.text.index
examples.jsonl
examples.index

indexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]

Once it is indexed you can use text queries find and curate interesting subsets. A general method to prepare these subsets is to use ann.text.fetch. This will fetch a subset of vectors that are close in vector space and save the associated examples on disk. From there you can use any Prodigy recipe you like.


prodigy
ann.text.fetch
examples.jsonl
examples.index
subset.jsonl
--query "this is an outrage!"

More interfaces

As a convenience this plugin also provides the textcat.ann.manual, ner.ann.manual and spans.ann.manual so that you may query and annotate directly. These recipes have the same arguments as their native Prodigy textcat.manual, ner.manual and spans.manual counterparts but add a --query parameter so that you may pass your query.

Interactive Queries

Sometimes you may want to update the stream while you’re annotating. You can do that without restarting the server by using the --allow-reset flag when you’re starting the textcat.ann.manual, ner.ann.manual or spans.ann.manual recipes.


prodigy
textcat.ann.manual
examples.jsonl
examples.index
--query "new academic dataset"
--allow-reset

Here’s an example of what the experience might look like from the UI.

Retreiving Images

You can use these embedding retreival techniques for images too. Models like CLIP allow you to embed images and text in the same space, which means that you can query the images by using text.

The approach for images is very similar to the approach for text too. To get started you’ll first want to run an indexing recipe over a folder of images via the ann.image.index recipe.


prodigy
ann.image.index
path/to/image_folder
image.index

indexing: 100%|███████████████████████████| 210/210 [01:49<00:00]

Once the index is built, you can query it. You can choose to query it to prepare a .jsonl file to re-use later via the ann.image.fetch recipe.


prodigy
ann.image.fetch
path/to/image_folder
examples.index
out.jsonl
--query "laptops"
--remove-base64
--n 100

Alternatively the plugin also provides a wrapper around the familiar image.manual recipe. This will retreive the images before passing it on to the image_manual interface. This interface also allows you to reset the stream via the --allow-reset flag.


prodigy
image.ann.manual
annotated_laptops
path/to/image_folder
examples.index
--query "laptops"
--remove-base64
--n 100
--labels laptop,phone
--allow-reset

Here’s an example of what the experience might look like from the UI.

API

ann.text.index command

  • Interface: terminal only
  • Use case: Prepare an HNSWlib index.

Builds an HSNWLIB index on example text data.


prodigy
ann.text.index
source
examples.index
ArgumentTypeDescriptionDefault
sourcePathPath to source to index.
index_pathPathPath of trained index

ann.text.fetch command

  • Interface: terminal only
  • Use case: Query to get a subset of interest.

Fetch a relevant subset using a HNSWlib index.


prodigy
ann.text.fetch
source
index_path
out_path
--query
--n
ArgumentTypeDescriptionDefault
sourcePathPath to source to index.
index_pathPathPath of trained index
out_pathPathPath to stored subset of interest
--query, -qstrQuery to encode and pass to index
--n, -nstrNumber of results to return from index200

ann.image.index command

  • Interface: terminal only
  • Use case: Prepare an HNSWlib index.

Builds an HSNWLIB index on example image data.


prodigy
ann.image.index
source
examples.index
ArgumentTypeDescriptionDefault
sourcePathPath to source folder of images to index.
index_pathPathPath of trained index

ann.image.fetch command

  • Interface: terminal only
  • Use case: Query to get a subset of interest.

Fetch a relevant subset of images using a HNSWlib index.


prodigy
ann.image.fetch
source
index_path
out_path
--query
--query
--remove-base64
ArgumentTypeDescriptionDefault
sourcePathPath to source folder of images for index.
index_pathPathPath of trained index
out_pathPathPath to stored subset of interest
--query, -qstrQuery to encode and pass to index
-nintNumber of items to retreive200
remove-base64, -RboolDon’t save the base64 images on diskFalse

Prodigy-Lunr

Instead of using semantic vectors with approximate nearest neighbors to find relevant subsets you can also resort to the “regular” search techniques. To accomodate these techniques we’ve added support for recipes that use lunr. These recipes are very similar to their ann.* counterparts but will rely on string matching techniques to retreive relevant examples.

lunr
This is general approach for the ANN recipes.


To use this plugin, you’ll need to install it first.

Install prodigy-lunrpip install "prodigy-lunr @ git+https://github.com/explosion/prodigy-lunr"

To index your documents, you can run the ann.text.index recipe. This will generate an index and serialize it to disk by writing it into a gzipped json file.


prodigy
lunr.text.index
examples.jsonl
index.gz.json

indexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]

Once it is indexed you can use text queries find and curate interesting subsets. A general method to prepare these subsets is to use lunr.text.fetch. This will fetch a subset of vectors that are close in vector space and save the associated examples on disk. From there you can use any Prodigy recipe you like.


prodigy
lunr.text.fetch
examples.jsonl
index.gz.json
subset.jsonl
--query "outrage better service unhappy"

More interfaces

As a convenience this plugin also provides the textcat.lunr.manual, ner.lunr.manual and spans.lunr.manual so that you may query and annotate directly. These recipes have the same arguments as their native Prodigy textcat.manual, ner.manual and spans.manual counterparts but add a --query parameter so that you may pass your query.

Interactive Queries

Sometimes you may want to update the stream while you’re annotating. You can do that without restarting the server by using the --allow-reset flag when you’re starting the textcat.lunr.manual, ner.lunr.manual or spans.lunr.manual recipes.


prodigy
textcat.lunr.manual
examples.jsonl
index.gz.json
--query "outrage better service unhappy"
--allow-reset

API

lunr.text.index command

  • Interface: terminal only
  • Use case: Prepare an HNSWlib index.

Builds an HSNWLIB index on example text data.


prodigy
lunr.text.index
source
examples.index
ArgumentTypeDescriptionDefault
sourcePathPath to source to index.
index_pathPathPath to stored lunr index

lunr.text.fetch command

  • Interface: terminal only
  • Use case: Query to get a subset of interest.

Fetch a relevant subset using a HNSWlib index.


prodigy
lunr.text.fetch
source
index_path
out_path
--query
ArgumentTypeDescriptionDefault
sourcePathPath to source to index.
index_pathPathPath to stored lunr index
out_pathPathPath to stored subset of interest
--query, -qstrQuery to encode and pass to index

Sense2vec

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo. There are also more details in this blogpost.

To see a demo on how to use this tool with Prodigy, you may enjoy this Youtube video where we use it to detect video games in text.

To use sense2vec, you’ll first need to install it.

python -m pip install sense2vec

To use the pre-trained vectors in Prodigy you’ll need to download the archive(s) and extract them. Large files have been split into multi-part downloads. All the available versions can be found below.

VectorsSizeDescriptionDownload Link (zipped)
s2v_reddit_2019_lg4 GBReddit comments 2019 (01-07)part 1, part 2, part 3
s2v_reddit_2015_md573 MBReddit comments 2015part 1

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

Once downloaded (and merged) you should be able to unarchive via:

tar -xvf s2v_reddit_lg.tar.gz

Now that the archive is extracted you can point the sense2vec.teach recipe to it. This will allow Prodigy to suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used.


prodigy
sense2vec.teach
video_game_yesno
/path/to/s2v_reddit_2019_lg
--seeds "mass effect,knights of the old republic,halo 3"
--resume

Suggestions from Sense2Vec

This live demo requires JavaScript to be enabled.

After curating the generated examples you can choose to export the collected phrases as pattern files which can be used with spaCy’s EntityRuler or recipes like ner.manual by using the sense2vec.to-patterns recipe.


prodigy
sense2vec.to-patterns
video_game_yesno
blank:en
VIDEO_GAME
patterns.jsonl

This will generate a patterns.jsonl file locally that has contents that may look like:

{"label": "VIDEO_GAME", "pattern": [{"LOWER": "mass"}, {"LOWER": "effect"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "knights"}, {"LOWER": "of"}, {"LOWER": "the"}, {"LOWER": "old"}, {"LOWER": "republic"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "halo"}, {"LOWER": "3"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "jade"}, {"LOWER": "empire"}]}

More recipes

Sense2vec also has the sense2vec.eval, sense2vec.eval-most-similar and sense2vec.eval-ab recipes. These may be interesting if you’re interested in evaluating a sense2vec model. For more information on those, you can check the README on the Github repository.

sense2vec.teach binary

  • Interface: html
  • Saves: annotations to the database
  • Use case: curate terminology phrases via sense2vec

Bootstrap a terminology list using sense2vec.


prodigy
sense2vec.teach
dataset
vectors_path
--seeds
--threshold
--n-similar
--batch-size
--case-sensitive
--resume
ArgumentTypeDescriptionDefault
datasetpositionalDataset to save annotations to.
vectors_pathpositionalPath to pretrained sense2vec vectors.
--seeds, -soptionOne or more comma-separated seed phrases.
--threshold, -toptionSimilarity threshold.0.85
--n-similar, -noptionNumber of similar items to get at once.100
--batch-size, -boptionBatch size for submitting annotations.5
--case-sensitive, -CSoptionShow the same terms with different casing.False
--resume, -RflagResume from an existing phrases dataset.False

sense2vec.to-patterns command

  • Interface: terminal only
  • Use case: generate pattern files

Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns.


prodigy
sense2vec.to-patterns
dataset
spacy_model
label
--output-file
--case-sensitive
--dry
ArgumentTypeDescriptionDefault
datasetpositionalPhrase dataset to convert.
spacy_modelpositionalspaCy model for tokenization.
labelpositionalLabel to apply to all patterns.
--output-file, -ooptionOptional output file. Defaults to stdout.
--case-sensitive, -CSflagMake patterns case-sensitive.False
--dry, -DflagPerform a dry run and don’t output anything.False