Database schema¶

The application uses a lightweight SQLite database to track extraction progress and review outcomes. The schema consists of four core tables.

Specimens¶

Stores basic information about each specimen.

column	type	notes
specimen_id	TEXT	primary identifier
image	TEXT	path to specimen image

Candidates¶

Holds raw values produced by OCR engines. Includes an error flag for modules that fail to produce a reliable value.

column	type	notes
run_id	TEXT	identifier for the OCR run
image	TEXT	image filename
value	TEXT	extracted text
engine	TEXT	OCR engine name
confidence	REAL	engine confidence score
error	INTEGER	1 if engine flagged an error

Final values¶

Represents the final selected value for each metadata field.

column	type	notes
specimen_id	TEXT	links back to `specimens`
field	TEXT	metadata field name
value	TEXT	chosen value
module	TEXT	module that produced the value
confidence	REAL	confidence for the chosen value
error	INTEGER	1 if reviewers flagged an error
decided_at	TEXT	ISO timestamp of selection

Processing state¶

Tracks per-module processing state for each specimen.

column	type	notes
specimen_id	TEXT	specimen identifier
module	TEXT	module name
status	TEXT	e.g. `pending`, `done`, `failed`
confidence	REAL	optional confidence from the module
error	INTEGER	1 if the module reported an error
updated_at	TEXT	ISO timestamp of last update

Migrations¶

Run migrations using:

from pathlib import Path
from io_utils.migrate import migrate_db

migrate_db(Path("candidates.db"))

This upgrades older databases with the new columns and tables.

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group