By default, Prodigy uses SQLite to store annotations
in a simple database file in your Prodigy home directory. If you want to use the
default database with its default settings, no further configuration is
required and you can start using Prodigy straight away. Alternatively, you can
choose to use Prodigy with MySQL or
PostgreSQL, or write your own custom recipe to
plug in any other storage solution.
Prodigy uses the peewee package to
manage the database integration. This gives you a lot of flexibility in terms of
setup and debugging, and allows you to use more advanced features via the
Playhouse extension.
Prodigy uses the database to store all annotations by project and annotation
session. Even if you’re only using Prodigy with a model in the loop, you’ll
usually want a record of the collected annotations as a backup, or to use them
as evaluation data. Prodigy is also very powerful without a model in the loop –
for example to bootstrap word lists or collect feedback on the output of two
models and generate evaluation data for machine translation or image
classification.
No! The database only exists to store collected annotations. You can read in
the raw data straight from a file or a
custom source. As your data comes in,
Prodigy will assign hashes to the examples, so it’ll always be able to tell
whether an example has been annotated already or not.
If you have existing annotations, you can convert them to
Prodigy’s format for the given task and use the
db-in command to import them to the database.
The default database option that stores all annotations in a flat
SQLite database file. Unless otherwise specified, the
database is created in the Prodigy home directory.
The settings for a PostgreSQL database can take any
psycopg2
connection parameters. Instead of providing the settings in your prodigy.json,
you can also set
environment variables.
When setting up your database integration, you might not want to give your user
permission to perform all operations. Prodigy uses the following operations,
some of which are optional and only required for certain commands or initial
setup:
Operation
Required
Details
SELECT
Retrieving datasets and annotations.
INSERT
Adding datasets and annotations.
UPDATE
Updating datasets and annotations.
DELETE
Deleting datasets. Only used for the prodigy drop command, so permission can be omitted if you don’t need this feature or prefer to delete records manually.
CREATE
Creation of tables Dataset, Example and Link. Not required if you create them manually.
To test your database connection, you can also write a simple Python script that
connects to Prodigy’s database and performs the most important operations. For
more details, check out this thread on the
forum.
test_database.pyfrom prodigy.components.db import connect
examples =[{"text":"hello world","_task_hash":123,"_input_hash":456}]
db = connect()# uses settings from prodigy.json
db.add_dataset("test_dataset")# add datasetassert"test_dataset"in db # check that dataset was added
db.add_examples(examples,["test_dataset"])# add examples to dataset
examples = db.get_dataset_examples("test_dataset")# retrieve a dataset's examplesassertlen(examples)==1# check that examples were added
Here are the tables Prodigy creates and how they map to the annotations you
collect. You typically shouldn’t have to interact with the database and its
tables directly.
Table
Description
Dataset
The dataset / session IDs and meta information.
Example
The individual annotation examples. Each example is only added once, so if you add the same annotation to multiple datasets, it’ll only have one record here.
Link
Example IDs linked to datasets. This is how Prodigy knows which examples belong to which datasets and sessions.
To use existing annotations collected with other tools in Prodigy, you can
import them via the db-in command. You can import data of all file
formats supported by Prodigy. However, JSON or JSONL is
usually recommended, as it gives you more flexibility. By default, all examples
will be set to "answer": "accept". You can specify a different answer using
the --answer argument on the command line.
prodigy
db-in
new_dataset
/path/to/data.jsonl
Imported 1550 annotations to 'new_dataset'.
Prodigy provides a simple connection helper that takes care of connecting to one
of the built-in database options using the database ID and database settings. If
no database config is provided, it will be read off the prodigy.json settings,
and default to 'sqlite' with the standard settings.
Argument
Type
Description
db_id
str
ID of database, i.e. 'sqlite', 'postgresql' or 'mysql'. Defaults to 'sqlite'.
db_settings
dict
Database-specific settings. If not provided, settings will be read off the prodigy.json.
RETURNS
Database
The database.
from prodigy.components.db import connect
db = connect("sqlite",{"name":"my_db.db"})
peewee database. Will be available as the db attribute of the database.
display_id
str
ID of database, e.g. 'sqlite'. Will be available as the db_id attribute of the database. For custom databases plugged in by the user, the ID will default to 'custom'.
display_name
str
Human-readable name of the database, e.g. 'SQLite'. Will be available as the db_name attribute. For custom database module, the display name will defaults to the function name, class name, or repr(db).
RETURNS
Database
The database.
To plug in custom database, you can initialize the Database class with a
custom instance of peewee.Database or its extension package
Playhouse, for
example:
import prodigy
from prodigy.components.db import Database
from playhouse.postgres_ext import PostgresqlExtDatabase
psql_db = PostgresqlExtDatabase("my_database", user="postgres")
db = Database(psql_db,"postgresql","Custom PostgreSQL Database")@prodigy.recipe("recipe-with-custom-db")defrecipe_with_custom_db():return{"db": db}
Reconnect to the database. Called on API requests to avoid timeout issues,
especially with MySQL. If the database connection is still open, it will be
closed before reconnecting.
Get all session datasets associated with a parent dataset. Finds all the session
datasets that have examples also associated with the parent dataset. Can be an
expensive query for large datasets.
Custom recipes let you return an optional "db"
component. If it’s not set, it will default to the database specified in your
prodigy.json or to "sqlite". The database plugged in via a custom recipe can
also be False (to not use any DB) or a custom class that follows Prodigy’s
Database API.
@prodigy.recipe('custom-recipe')defcustom_recipe():return{'db': YourCustomDB}# etc.
Essentially, all your custom class needs to do is expose methods to add and
retrieve datasets and annotated examples. For instance:
classYourCustomDB:def__init__(self,*args,**kwargs):# initialize your custom databasedefget_dataset_examples(self, name, default=None):# get examples for a given dataset name# other methods and properties
How your database handler resolves those queries is entirely up to you. If your
database class implements the methods reconnect and close, Prodigy will call
those on each request to the REST API, allowing you to explicitly manage the
connection and prevent timeouts between requests.