Terminology Guide¶

The Problem¶

This project evolved from a simple OCR script to an enterprise data platform, creating terminology confusion that obscures the actual workflows. Terms like "import" suggest database operations when we're actually doing OCR extraction.

Clear Definitions¶

Primary Workflow Terms¶

Term	Definition	Usage	Examples
Extract	Getting data FROM images via OCR	`python cli.py process`	Images → Text data
Process	General term for OCR extraction	`python cli.py process`	Same as extract
Ingest	Adding data TO the system (any source)	General term	Images, CSV, manual entry
Import	Bringing external data INTO database	`python cli.py import`	CSV → Database
Export	Creating output FROM database	`python cli.py export`	Database → Darwin Core

Data Flow Terms¶

Term	What It Describes	Input → Output
OCR Pipeline	Image processing workflow	Images → Raw text
Extraction Job	Complete OCR processing task	Image batch → Structured data
Review Workflow	Quality control process	Raw data → Approved data
Archive Creation	Standards compliance export	Database → Darwin Core ZIP

Database Terms¶

Table/Concept	Purpose	Contains
specimens	Tracks OCR extraction jobs	Image files and their processing status
final_values	Curator-approved field values	Reviewed and corrected OCR results
processing_state	OCR job progress tracking	Success/failure status for each image
import_audit	External data import tracking	Records from CSV imports, not OCR

Common Confusions Fixed¶

"Import" Confusion¶

Before: Issue #193 talks about "import audit sign-off workflow" After: This should be split into: - Extraction Audit: Tracking OCR processing (images → data) - Import Audit: Tracking external data imports (CSV → database)

"Specimen" vs "Image" Confusion¶

Before: specimens table suggests biological specimens After: This tracks extraction jobs - each record represents processing one image file - One specimen (biological) might have multiple images - One image might show multiple specimens - The table tracks processing, not taxonomy

Review Workflow Confusion¶

Before: import_review.py suggests reviewing imports After: This should be extraction_review.py - reviewing OCR results

Recommended Refactoring¶

File Renames¶

# Current → Proposed
import_review.py → extraction_review.py
test_import_review.py → test_extraction_review.py

Function Renames¶

# Current → Proposed
import_review_selections() → review_extractions()
import_audit_trail() → extraction_audit_trail()

CLI Command Clarity¶

# Current (confusing)
python cli.py process --input images/  # What does "process" mean?

# Clearer
python cli.py extract --input images/  # OCR extraction from images
python cli.py import --input data.csv  # Import external data

Issue Terminology Updates¶

Issue #193: "Import audit sign-off workflow" Should be: "Extraction audit and import audit workflows"

Issue #194: "Spreadsheet pivot-table reporting" Context: This is about reviewing OCR results, not importing spreadsheets

Usage Examples with Clear Terminology¶

OCR Extraction (Primary Use Case)¶

# Extract data from herbarium images using OCR
python cli.py extract --input specimen_photos/ --output results/

# What happens:
# 1. Images are processed via Apple Vision OCR
# 2. Text data is extracted and structured
# 3. Results saved to results/occurrence.csv

Data Import (Secondary Use Case)¶

# Import external CSV data into the database
python cli.py import --input external_data.csv --output results/

# What happens:
# 1. CSV data is read and validated
# 2. Records are inserted into database
# 3. Audit trail records the import source

Review Workflow (Quality Control)¶

# Review extracted OCR results for accuracy
python review_web.py --db results/candidates.db --images specimen_photos/

# What happens:
# 1. Web interface shows side-by-side image and extracted data
# 2. Curator can edit/approve/reject each field
# 3. Approved data goes to final_values table

Export (Standards Compliance)¶

# Export approved data to Darwin Core format
python cli.py export --output results/ --version 1.0

# What happens:
# 1. Approved data from final_values table
# 2. Formatted according to Darwin Core standards
# 3. Packaged as GBIF-ready archive

Documentation Structure with Clear Terms¶

README.md Focus¶

# Quick Start: Extract Data from Specimen Images

1. python cli.py extract --input photos/ --output results/
2. python review_web.py --db results/candidates.db --images photos/
3. python cli.py export --output results/

ADVANCED.md for Complex Workflows¶

# Advanced: Multiple Data Sources

## OCR Extraction + Manual Data Entry + CSV Import
1. Extract from images: python cli.py extract ...
2. Import CSV data: python cli.py import ...
3. Manual entry via web interface
4. Review all sources together
5. Export to Darwin Core

Benefits of Clear Terminology¶

For New Users¶

Immediately understand that primary workflow is OCR extraction
Know when they need database features vs simple extraction
Clear mental model of data flow

For Developers¶

Functions and files clearly indicate their purpose
Separation between extraction and import logic
Easier to find relevant code

For Issues and Planning¶

Features can be categorized clearly (extraction vs import vs export)
Priorities become clearer (OCR accuracy vs audit compliance)
Less confusion about requirements

Migration Strategy¶

Phase 1: Documentation¶

✅ Create this terminology guide
✅ Update ARCHITECTURE.md with clear terms
Update issue descriptions to use consistent terminology

Phase 2: Code Comments¶

# Add clarifying comments to confusing functions
def import_review_selections():
    """Review OCR extraction results (not imports from external files)."""

Phase 3: Gradual Refactoring¶

Rename files and functions over multiple releases
Maintain backwards compatibility
Update CLI command names with aliases

The goal is conceptual clarity - users should immediately understand what each part of the system does without having to decode overloaded terminology.

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group