Prodigy Plugins
Some Prodigy recipes require a 3rd party library in order to work. To keep Prodigy lightweight we’ve separated some of these recipes out into their own packages so that you may install them as a plugin. These plugins always target the most recent version of Prodigy with regards to compatibility.
This section of the docs showscases such plugins. Note that you can also explore these recipes on Github to serve as a source of inspiration to customise further.
🤗 Prodigy HF | Recipes that interact with the Huggingface stack. Contains hf.train.ner , hf.correct.ner , hf.upload and more. | Github repo |
📄 Prodigy PDF | Recipes that help with the annotation of PDF files. Contains pdf.image.manual and pdf.ocr.correct | Github repo |
🤫 Prodigy Whisper | Recipes that leverage OpenAI’s Whisper model for audio transcription. Contains whisper.audio.annotate . | Github repo |
🍰 Prodigy Segment | Recipes that leverage Meta’s Segment-Anything model for image segmentation. Contains segment.image.manual and more. | Github repo |
🏘 Prodigy ANN | Recipes that allow you to use approximate nearest neighbor techniques to help you annotate. Contains ann.text.index , ann.image.index , ann.text.fetch and more. | Github repo |
🌕 Prodigy Lunr | Recipes that allow you to use old-school string matching techniques to help you annotate. Contains lunr.text.index , lunr.text.fetch and more. | Github repo |
🦆 sense2vec | Recipes that allow to fetch terms using phrase embeddings trained on Reddit. Contains sense2vec.teach , sense2vec.to-patterns and more. | Github repo |
🤗 Prodigy-HF
This plugin contains recipes that interact with the Hugging Face stack. Some recipes will allow you to directly train transformer models on top of your annotations while other recipes allow you to upload artifacts to HF cloud environment.
To use these recipes, you’ll first need to install the plugin.
Install prodigy-hfpip install "prodigy-hf @ git+https://github.com/explosion/prodigy-hf"
Once it is installed you can explore some of the new recipes.
Training Huggingface models
The first recipe that you may enjoy from this plugin is the recipe to train custom NER models.
prodigy
hf.train.ner
fashion,eval:fashion-eval
hf-model-dir
--epochs 10
--model-name distilbert-base-uncased
Once the model is done training you’ll be able to inspect the hf-model-dir
folder to find all the trained state.
You can also choose to re-use this trained model to help you annotate data. The
plugin features a hf.ner.correct
recipe that works similarily to
ner.correct
except here we get to use a Hugging Face model. This means
that you can also use models from the
Hugging Face Hub.
This recipe will internally map the predictions from the transformer model to
spaCy tokens.
prodigy
hf.ner.correct
fashion
hf-model-dir/checkpoint-20
examples.jsonl
--lang en
Note that this plugin also offers variants of these recipes for text
classification. Check out the API docs for hf.train.textcat
and
hf.correct.textcat
for more details.
Interacting with Hugging Face Hub
Alternatively, you may also use these plugin to upload your annotated datasets to Huggingface Hub.
prodigy
hf.upload
fashion,eval:fashion-eval
username/reponame
✔ Upload completed! You should be able to view repo at
https://huggingface.co/datasets/username/reponame.
Internally this recipe will validate the dataset for consistency and will attempt to anonymise the annotators before uploading. You can turn this behavior off with flags and you can also specify that you want the dataset not to appear publicly.
API
hf.train.ner
command
Trains a Huggingface model for NER directly on your annotated datasets.
prodigy
hf.train.ner
datasets
out_dir
--model-name
--batch-size
--eval-split
--learning-rate
--verbose
Argument | Type | Description | Default |
---|---|---|---|
datasets | positional | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation | |
out_dir | positional | Folder to store trained model and checkpoints. | |
--model-name , -mn | option | Pick the model you’d like to use as a starting point for training. | “distilbert-base-uncased” |
--batch-size , -bs | option | Batch size for training. | 8 |
--eval-split , -es | option | If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported. | |
--learning-rate , -lr | option | Learning rate. | 2e-5 |
--verbose , -v | flag | Output all the logs/warnings from Huggingface libraries. | False |
hf.correct.ner
manual
Annotate NER data with a transformer model in the loop.
prodigy
hf.correct.ner
dataset
--model-name
source
--lang
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Dataset to save annotations into | |
out_dir | positional | Path to transformer model. Can also point to a model on Hugging Face Hub. | |
source | positional | Source file to annotate | |
--lang , -l | option | Language to assume for the spaCy tokeniser | “en” |
hf.train.textcat
command
Trains a Hugging Face model for text classification directly on your annotated datasets.
prodigy
hf.train.textcat
datasets
out_dir
--model-name
--batch-size
--eval-split
--learning-rate
--verbose
Argument | Type | Description | Default |
---|---|---|---|
datasets | positional | One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation | |
out_dir | positional | Folder to store trained model and checkpoints. | |
--model-name , -mn | option | The name of the model to be used as a starting point for training. | “distilbert-base-uncased” |
--batch-size , -bs | option | Batch size for training. | 8 |
--eval-split , -es | option | If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported. | |
--learning-rate , -lr | option | Learning rate. | 2e-5 |
--verbose , -v | flag | Output all the logs/warnings from Huggingface libraries. | False |
hf.correct.textcat
manual
Annotate data for text classification with a transformer model in the loop.
prodigy
hf.correct.textcat
dataset
--model-name
source
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Dataset to save annotations into | |
out_dir | positional | Path to transformer model. Can also point to a model on Hugging Face Hub. | |
source | positional | Source file to annotate |
hf.upload
command
Upload your annotations to Hugging Face Hub.
You can use the same command multiple times to upload the most recent version of your data to the hub.
prodigy
datasets
repo_id
--keep-annotator-ids
--patch_values
--private
Argument | Type | Description | Default |
---|---|---|---|
datasets | positional | One or more (comma-separated) datasets to upload. Use the name: prefix to add keys to the dataset. | |
repo_id | positional | Name of the repo to upload to. Should be formatted as . | |
--keep-annotator-ids , -k | flag | Don’t anonymize the annotators. | False |
--patch_values , -nv | flag | If keys are missing between datasets, patch them with None values. | False |
--private , -p | flag | Upload dataset as a private repository. | False |
Prodigy-PDF
This plugin contains recipes that help you annotate pdf files by turning them
into images first. This way, they may be annotating using the familiar
image_manual
interface. It also contains recipes for OCR. If you’re
interested in a quick over, you may appreciate this Youtube explainer.
To use these recipes, you’ll first need to install the plugin.
Install prodigy-pdfpip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"
Then once it is installed, you can start annotating PDFs as images via
pdf.image.manual
.
prodigy
pdf.image.manual
papers
path/pdfs
--labels figure,footnote,paragraph
If you like, you can re-use the pdf annotations with the pdf.ocr.correct
recipe to apply OCR to the annotated segments. This recipe uses
pytessaract under the hood to give
suggestions that you can correct.
prodigy
pdf.ocr.correct
ocr_images
papers
path/pdfs
--labels paragraph
--fold-dashes
API
pdf.image.manual
manual
Add annotations to a PDF by first converting it to an image.
In order for this recipe to work, you may need to install system dependencies for tesseract. These can usually be installed directly via:
# for mac
brew install tesseract
# for ubuntu
sudo apt install tesseract-ocr
prodigy
pdf.image.manual
dataset
pdf_folder
--labels
--remove-base64
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
pdf_folder | Path | Folder that contains your pdf files | |
labels , -l | str | Comma delimted labels to attach | |
remove_base64 , -R | bool | Don’t save the base64 images of the pdfs | False |
pdf.ocr.correct
manual
Applies OCR to annotated segments from pdf.image.manual
and gives a textbox
for corrections.
prodigy
pdf.ocr.correct
dataset
source
--labels
--scale
--fold-dashes
--remove-base64
--autofocus
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Source with PDF Annotations | |
--labels , -l | str | Labels to consider | |
--scale , -s | int | Zoom scale. Increase above 3 to upscale the image for OCR. | 3 |
--remove-base64 , -R | bool | Don’t save the base64 images of the pdfs | False |
--fold-dashes , -f | bool | Removes dashes at the end of a textline and folds them with the next term. | False |
--autofocus , -af | bool | Autofocus on the transcript UI | False |
🤫 Prodigy-Whisper
OpenAI released an open model for audio annotation called Whisper. It’s a model that can be downloaded locally, it has support for multiple languages and you’re even able to pick from a selection of models. The model isn’t perfect, but when you’re transcribing text, it can really help to have such a model provide a starting point. The goal of this plugin is to help you get started with this right away.
To use this plugin, you’ll need to install it first.
Install prodigy-whisperpip install "prodigy-whisper @ git+https://github.com/explosion/prodigy-whisper"
In order to use the plugin you’ll also need to have ffmpeg
installed. Most
package managers should have these available so you should be able to use one of
the following commands.
Install ffmpeg# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
Once the plugin is installed you can use the whisper.audio.transcribe
recipe.
It is very similar to audio.transcribe
recipe that Prodigy provides, but
this recipe uses Whisper to provide an initial transcription.
Example
prodigy whisper.audio.transcribe transcripts ./recordings --model base
In the base form you can already see that Whisper does a pretty good job at transcription. But it may be easier to correct short pieces of audio instead of a long one. This is where Wishper can help out as well. It is able to segment a long audio clip into shorter segments and each of these segments can then be annotated in Prodigy.
To use this feature, you can add the --segment
flag to the recipe call.
Example
prodigy whisper.audio.transcribe transcripts ./recordings --model base --segment
Now, you can go through the segments one by one and each segment will have metadata attached so that you can link it back to the timestamps in the original file. This is what the first segment would look like.
This is what the second segment would look like.
API
whisper.audio.transcribe
manual
Manually transcribe audio files by typing the transcript into a text field with
the help of Whisper. The API is built on top of audio.transcribe
and will
allow you to configure everything that the original recipe can. The only input
addition is that this recipe also allows you to select a Whisper model. The
recipe uses the "base"
model by default, but you should be able to pick any of
the models shown on
here.
prodigy
whisper.audio.transcribe
dataset
source
--loader
--autoplay
--keep-base64
--fetch-media
--playpause-key
--text-rows
--text-rows
--exclude
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
--model , -m | str | Name of OpenAI Whisper model to use. | base |
--loader , -lo | str | Optional ID of source loader, e.g. audio or video . | audio |
--autoplay , -A | bool | Autoplay the audio when a new task loads. | False |
--keep-base64 , -B | bool | If audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database. | False |
--fetch-media , -FM | bool | Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset. | False |
--playpause-key , -pk | str | Alternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field. | "command+enter, option+enter, ctrl+enter" |
--text-rows , -tr | int | Height of the text input field, in rows. | 6 |
--field-id , -fi | str | Add the transcript text to the data using this key, e.g. "transcript": "Text here" . | "transcript" |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
Prodigy-Segment
Sometimes you’re interested in selecting pixels from an image, as opposed to
merely selecting a bounding box. Selecting the right pixels can be tedious work
so you may want to use a model in the loop to help you. A good choice for such a
model is Meta’s Segment Anything model, which
we’ve integrated into Prodigy via the
prodigy-segment
plugin.
This model is able to take bounding box annotations from Prodigy to construct a pixel segmentation map under the hood. From the UI, that might look like this:
Using Prodigy-Segment
For a quick overview of the features, you may also enjoy this Youtube tutorial.
Before you’ll be able to use recipes, you’ll want to make sure you’ve downloaded the appropriate model checkpoint beforehand. You can check the available models here but this tutorial will assume the “default” model-type. The weights for this model can be downloaded via:
Download the weights for the `default` model-typewget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
Once the model is downloaded you can get started by running the
segment.image.manual
recipe.
prodigy
segment.image.manual
segment-cat-dog
images
sam_vit_h_4b8939.pth
--model-type default
--labels cat,dog
When you run this model, you may notice that it’s fairly slow. This isn’t a big suprise given the size of the model but it can be a serious burden, especially if your machine does not have a GPU. For a better experience, you may want to pre-compute the features ahead of annotation time and cache those results to disk. It may take a while to precompute all the images, but once they are done the annotation experience feels seamless and realtime again.
To precompute a cache, you can use the segment.fill-cache
recipe.
prodigy
segment.fill-cache
images
sam_vit_h_4b8939.pth
--model-type default
--cache segment-anything-cache
This will store all the features in a folder (configurable via the --cache
flag) which the segment.image.manual
recipe can immediately pick up.
prodigy
segment.image.manual
segment-cat-dog
images
sam_vit_h_4b8939.pth
--model-type default
--label cat,dog
--cache segment-anything-cache
The pixel maps, once annotated, are stored under the spans
key in your
examples. You can explore these maps one by one in a Jupyter notebook using the
script shown below.
Script to loop over all annotated examplesimport base64
from io import BytesIO
from PIL import Image
from prodigy.components.db import connect
db = connect()
examples = db.get_dataset_examples("<dataset-name>")
def mask_to_pil(mask_str):
indicator = "base64,"
mask_str = mask_str[mask_str.find(indicator) + len(indicator):]
bytes = BytesIO(base64.b64decode(mask_str))
return Image.open(bytes)
# Loop over all the examples and display them.
for ex in examples:
print(ex['path'])
for span in ex.get("spans", []):
# Use builtin `display` to view pixel map
display(mask_to_pil(span['mask']))
From here you can re-use the Pillow library to either store these pixel maps into the required format for your pipeline or you can stream them directly into a learning algorithm from Python.
API
segment.image.manual
manual
Manually transcribe pixels in images with Meta’s segment anything model under the hood.
prodigy
segment.image.manual
dataset
source
checkpoint
--label
--loader
--exclude
--width
--darken
--no-fetch
--remove-base64
--model-type
--cache
Argument | Type | Description | Default |
---|---|---|---|
dataset | str | Prodigy dataset to save annotations to. | |
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
checkpoint | Path | Path to a model checkpoint. | |
--label , -l | str / Path | One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. | |
--loader , -lo | str | Optional ID of source loader. | images |
--exclude , -e | str | Comma-separated list of dataset IDs containing annotations to exclude. | None |
--width , -w | int | Width of card and maximum image width in pixels. | 675 |
--darken , -D | bool | Darken image to make boxes stand out more. | False |
--no-fetch , -NF | bool | Don’t fetch images as base64. Ideally requires a JSONL file as input, with --loader jsonl set and all images available as URLs. | False |
--remove-base64 , -R | bool | Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files! | False |
--model-type , -mt | str | Type of model to use. | default |
--cache , -c | Path | Path to feature cache to speed up inference. | segment-anything-cache |
segment.fill-cache
command
Prepares a local disk cache to speed up inference for segment.image.manual
.
This can cause a huge speedup if you’re running on a non-GPU device.
prodigy
segment.fill-cache
source
--loader
checkpoint
--model-type
--cache
Argument | Type | Description | Default |
---|---|---|---|
source | str | Path to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set. | |
checkpoint | Path | Path to a model checkpoint. | |
--loader , -lo | str | Optional ID of source loader. | images |
--cache , -c | Path | Path to feature cache to speed up inference. | segment-anything-cache |
Prodigy-ANN
Sometimes you may want to query your examples to find a relevant subset for annotation. A modern method for doing this is to use numeric vectors to represent text and you can use approximate neighest neighbor (ANN) techniques to fetch relevant examples. The goal is to spend more time looking at examples that matter, like examples similar to items that the model gets wrong. Curating these examples first might be a pragmatic method to steer the model in the right direction.
If you’re interested to see a quick demo for Prodigy-ANN applied to a text dataset, you may appreciate this Prodigy short on Youtube.
To use this plugin, you’ll need to install it first.
Install prodigy-annpip install "prodigy-ann @ git+https://github.com/explosion/prodigy-ann"
As a first step for this approach you’ll first need to generate an index with vector representations of your text. To encode the text this library uses sentence-transformers and it uses hnswlib as an index for these vectors.
To index your documents, you can run the ann.text.index
recipe.
prodigy
ann.text.index
examples.jsonl
examples.index
indexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]
Once it is indexed you can use text queries find and curate interesting subsets.
A general method to prepare these subsets is to use ann.text.fetch
. This will
fetch a subset of vectors that are close in vector space and save the associated
examples on disk. From there you can use any Prodigy recipe you like.
prodigy
ann.text.fetch
examples.jsonl
examples.index
subset.jsonl
--query "this is an outrage!"
More interfaces
As a convenience this plugin also provides the textcat.ann.manual
,
ner.ann.manual
and spans.ann.manual
so that you may query and annotate
directly. These recipes have the same arguments as their native Prodigy
textcat.manual
, ner.manual
and spans.manual
counterparts
but add a --query
parameter so that you may pass your query.
Interactive Queries
Sometimes you may want to update the stream while you’re annotating. You can do
that without restarting the server by using the --allow-reset
flag when you’re
starting the textcat.ann.manual
, ner.ann.manual
or spans.ann.manual
recipes.
prodigy
textcat.ann.manual
examples.jsonl
examples.index
--query "new academic dataset"
--allow-reset
Here’s an example of what the experience might look like from the UI.
Retreiving Images
You can use these embedding retreival techniques for images too. Models like CLIP allow you to embed images and text in the same space, which means that you can query the images by using text.
The approach for images is very similar to the approach for text too. To get
started you’ll first want to run an indexing recipe over a folder of images via
the ann.image.index
recipe.
prodigy
ann.image.index
path/to/image_folder
image.index
indexing: 100%|███████████████████████████| 210/210 [01:49<00:00]
Once the index is built, you can query it. You can choose to query it to prepare
a .jsonl
file to re-use later via the ann.image.fetch
recipe.
prodigy
ann.image.fetch
path/to/image_folder
examples.index
out.jsonl
--query "laptops"
--remove-base64
--n 100
Alternatively the plugin also provides a wrapper around the familiar
image.manual
recipe. This will retreive the images before passing it on
to the image_manual
interface. This interface also allows you to reset
the stream via the --allow-reset
flag.
prodigy
image.ann.manual
annotated_laptops
path/to/image_folder
examples.index
--query "laptops"
--remove-base64
--n 100
--labels laptop,phone
--allow-reset
Here’s an example of what the experience might look like from the UI.
API
ann.text.index
command
Builds an HSNWLIB index on example text data.
prodigy
ann.text.index
source
examples.index
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path of trained index |
ann.text.fetch
command
Fetch a relevant subset using a HNSWlib index.
prodigy
ann.text.fetch
source
index_path
out_path
--query
--n
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path of trained index | |
out_path | Path | Path to stored subset of interest | |
--query , -q | str | Query to encode and pass to index | |
--n , -n | str | Number of results to return from index | 200 |
ann.image.index
command
Builds an HSNWLIB index on example image data.
prodigy
ann.image.index
source
examples.index
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source folder of images to index. | |
index_path | Path | Path of trained index |
ann.image.fetch
command
Fetch a relevant subset of images using a HNSWlib index.
prodigy
ann.image.fetch
source
index_path
out_path
--query
--query
--remove-base64
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source folder of images for index. | |
index_path | Path | Path of trained index | |
out_path | Path | Path to stored subset of interest | |
--query , -q | str | Query to encode and pass to index | |
-n | int | Number of items to retreive | 200 |
remove-base64 , -R | bool | Don’t save the base64 images on disk | False |
Prodigy-Lunr
Instead of using semantic vectors with approximate nearest neighbors to find
relevant subsets you can also resort to the “regular” search techniques. To
accomodate these techniques we’ve added support for recipes that use
lunr. These recipes are very similar
to their ann.*
counterparts but will rely on string matching techniques to
retreive relevant examples.
To use this plugin, you’ll need to install it first.
Install prodigy-lunrpip install "prodigy-lunr @ git+https://github.com/explosion/prodigy-lunr"
To index your documents, you can run the ann.text.index
recipe. This will
generate an index and serialize it to disk by writing it into a gzipped json
file.
prodigy
lunr.text.index
examples.jsonl
index.gz.json
indexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]
Once it is indexed you can use text queries find and curate interesting subsets.
A general method to prepare these subsets is to use lunr.text.fetch
. This will
fetch a subset of vectors that are close in vector space and save the associated
examples on disk. From there you can use any Prodigy recipe you like.
prodigy
lunr.text.fetch
examples.jsonl
index.gz.json
subset.jsonl
--query "outrage better service unhappy"
More interfaces
As a convenience this plugin also provides the textcat.lunr.manual
,
ner.lunr.manual
and spans.lunr.manual
so that you may query and annotate
directly. These recipes have the same arguments as their native Prodigy
textcat.manual
, ner.manual
and spans.manual
counterparts
but add a --query
parameter so that you may pass your query.
Interactive Queries
Sometimes you may want to update the stream while you’re annotating. You can do
that without restarting the server by using the --allow-reset
flag when you’re
starting the textcat.lunr.manual
, ner.lunr.manual
or spans.lunr.manual
recipes.
prodigy
textcat.lunr.manual
examples.jsonl
index.gz.json
--query "outrage better service unhappy"
--allow-reset
API
lunr.text.index
command
Builds an HSNWLIB index on example text data.
prodigy
lunr.text.index
source
examples.index
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path to stored lunr index |
lunr.text.fetch
command
Fetch a relevant subset using a HNSWlib index.
prodigy
lunr.text.fetch
source
index_path
out_path
--query
Argument | Type | Description | Default |
---|---|---|---|
source | Path | Path to source to index. | |
index_path | Path | Path to stored lunr index | |
out_path | Path | Path to stored subset of interest | |
--query , -q | str | Query to encode and pass to index |
Sense2vec
sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo. There are also more details in this blogpost.
To see a demo on how to use this tool with Prodigy, you may enjoy this Youtube video where we use it to detect video games in text.
To use sense2vec, you’ll first need to install it.
python -m pip install sense2vec
To use the pre-trained vectors in Prodigy you’ll need to download the archive(s) and extract them. Large files have been split into multi-part downloads. All the available versions can be found below.
Vectors | Size | Description | Download Link (zipped) |
---|---|---|---|
s2v_reddit_2019_lg | 4 GB | Reddit comments 2019 (01-07) | part 1, part 2, part 3 |
s2v_reddit_2015_md | 573 MB | Reddit comments 2015 | part 1 |
To merge the multi-part archives, you can run the following:
cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz
Once downloaded (and merged) you should be able to unarchive via:
tar -xvf s2v_reddit_lg.tar.gz
Now that the archive is extracted you can point the sense2vec.teach
recipe to
it. This will allow Prodigy to suggest similar terms based on the most similar
phrases from sense2vec, and the suggestions will be adjusted as you annotate and
accept similar phrases. For each seed term, the best matching sense according to
the sense2vec vectors will be used.
prodigy
sense2vec.teach
video_game_yesno
/path/to/s2v_reddit_2019_lg
--seeds "mass effect,knights of the old republic,halo 3"
--resume
After curating the generated examples you can choose to export the collected
phrases as pattern files which can be used with
spaCy’s EntityRuler
or recipes like ner.manual
by using the sense2vec.to-patterns
recipe.
prodigy
sense2vec.to-patterns
video_game_yesno
blank:en
VIDEO_GAME
patterns.jsonl
This will generate a patterns.jsonl
file locally that has contents that may
look like:
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "mass"}, {"LOWER": "effect"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "knights"}, {"LOWER": "of"}, {"LOWER": "the"}, {"LOWER": "old"}, {"LOWER": "republic"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "halo"}, {"LOWER": "3"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "jade"}, {"LOWER": "empire"}]}
More recipes
Sense2vec also has the
sense2vec.eval
,
sense2vec.eval-most-similar
and
sense2vec.eval-ab
recipes. These may be interesting if you’re interested in evaluating a sense2vec
model. For more information on those, you can check the
README on
the Github repository.
sense2vec.teach
binary
Bootstrap a terminology list using sense2vec.
prodigy
sense2vec.teach
dataset
vectors_path
--seeds
--threshold
--n-similar
--batch-size
--case-sensitive
--resume
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Dataset to save annotations to. | |
vectors_path | positional | Path to pretrained sense2vec vectors. | |
--seeds , -s | option | One or more comma-separated seed phrases. | |
--threshold , -t | option | Similarity threshold. | 0.85 |
--n-similar , -n | option | Number of similar items to get at once. | 100 |
--batch-size , -b | option | Batch size for submitting annotations. | 5 |
--case-sensitive , -CS | option | Show the same terms with different casing. | False |
--resume , -R | flag | Resume from an existing phrases dataset. | False |
sense2vec.to-patterns
command
Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns.
prodigy
sense2vec.to-patterns
dataset
spacy_model
label
--output-file
--case-sensitive
--dry
Argument | Type | Description | Default |
---|---|---|---|
dataset | positional | Phrase dataset to convert. | |
spacy_model | positional | spaCy model for tokenization. | |
label | positional | Label to apply to all patterns. | |
--output-file , -o | option | Optional output file. Defaults to stdout. | |
--case-sensitive , -CS | flag | Make patterns case-sensitive. | False |
--dry , -D | flag | Perform a dry run and don’t output anything. | False |