Usage Modes¶
This system supports different levels of complexity depending on your needs. Choose the mode that fits your project requirements.
🚀 Quick Mode: Simple OCR Extraction¶
Perfect for: Individual researchers, small projects, immediate data needs
What you get:¶
- Direct OCR processing of images
- CSV output ready for immediate use
- No database complexity
- Fastest path from images to data
Workflow:¶
# 1. Process images with OCR
python cli.py process --input specimen_photos/ --output results/
# 2. Check your data (done!)
ls results/
# occurrence.csv <- Darwin Core data ready for GBIF
# raw.jsonl <- Raw OCR results with confidence scores
# manifest.json <- Processing metadata
Use Quick Mode when:¶
- ✅ You have < 500 images to process
- ✅ You trust the OCR accuracy (Apple Vision: 95%)
- ✅ You don't need detailed review workflows
- ✅ CSV output meets your needs
🔬 Research Mode: Quality Control Workflow¶
Perfect for: Research projects, institutional collections, quality-focused work
What you get:¶
- OCR extraction with review interface
- Curator tools for data correction
- Confidence scoring and flagging
- Database tracking of corrections
Workflow:¶
# 1. Extract data with database tracking
python cli.py process --input specimen_photos/ --output results/
# 2. Review extraction results in web interface
python review_web.py --db results/candidates.db --images specimen_photos/
# Opens http://localhost:5000 for side-by-side review
# 3. Export approved data
python cli.py export --output results/ --version 1.0
# Creates dwca_v1.0.zip with reviewed data
Use Research Mode when:¶
- ✅ Data quality is critical
- ✅ Multiple people need to review results
- ✅ You want to track confidence scores
- ✅ GBIF submission requires quality control
🏛️ Production Mode: Enterprise Compliance¶
Perfect for: Museums, herbaria, institutional digitization programs
What you get:¶
- Full audit trails and compliance reporting
- Multiple data source integration
- User authentication and permissions
- Institutional-grade quality control
Workflow:¶
# 1. Process with audit tracking
python cli.py process --input specimen_photos/ --output results/ \\
--audit-user "curator@institution.edu"
# 2. Import additional data sources (optional)
python cli.py import --input external_data.csv --output results/ \\
--audit-user "datamanager@institution.edu"
# 3. Multi-user review workflow
python review_web.py --db results/candidates.db --images specimen_photos/ \\
--auth-required --user-tracking
# 4. Generate compliance reports
python cli.py audit-report --output compliance/ --format institutional
# 5. Export with full provenance
python cli.py export --output results/ --version 2.1 \\
--include-audit --include-provenance
Use Production Mode when:¶
- ✅ Institutional compliance requirements exist
- ✅ Multiple curators/data managers involved
- ✅ Audit trails are legally required
- ✅ Long-term data management is critical
🔀 Hybrid Mode: Multiple Data Sources¶
Perfect for: Complex projects combining OCR, manual entry, and existing data
What you get:¶
- OCR extraction from images
- Manual data entry interface
- CSV/spreadsheet import capabilities
- Unified review and export workflow
Workflow:¶
# 1. Extract from images
python cli.py process --input new_photos/ --output project_db/
# 2. Import existing CSV data
python cli.py import --input historical_records.csv --output project_db/
# 3. Manual entry for problematic specimens
python review_web.py --db project_db/candidates.db \\
--images new_photos/ --enable-manual-entry
# 4. Review all data sources together
# Web interface shows OCR, imported, and manual data
# 5. Export unified dataset
python cli.py export --output project_db/ --version final \\
--include-all-sources
Use Hybrid Mode when:¶
- ✅ Combining new digitization with existing records
- ✅ Some specimens require manual data entry
- ✅ Multiple data sources need integration
- ✅ Historical data needs cleaning/standardization
🎯 Mode Selection Guide¶
| Your Situation | Recommended Mode | Key Benefits |
|---|---|---|
| "I just need data from these photos" | Quick Mode | Fastest, simplest |
| "Quality matters more than speed" | Research Mode | Review workflow |
| "This is for institutional archives" | Production Mode | Compliance, audit |
| "I have photos + existing records" | Hybrid Mode | Multiple sources |
📊 Feature Comparison¶
| Feature | Quick | Research | Production | Hybrid |
|---|---|---|---|---|
| OCR Processing | ✅ | ✅ | ✅ | ✅ |
| CSV Output | ✅ | ✅ | ✅ | ✅ |
| Database Storage | ❌ | ✅ | ✅ | ✅ |
| Web Review Interface | ❌ | ✅ | ✅ | ✅ |
| Confidence Scoring | ❌ | ✅ | ✅ | ✅ |
| Audit Trails | ❌ | ❌ | ✅ | ✅ |
| User Authentication | ❌ | ❌ | ✅ | Optional |
| Multiple Data Sources | ❌ | ❌ | ✅ | ✅ |
| Compliance Reporting | ❌ | ❌ | ✅ | ✅ |
| Manual Data Entry | ❌ | Limited | ✅ | ✅ |
🔧 Configuration Examples¶
Quick Mode Config¶
# config/quick.toml
[ocr]
preferred_engine = "vision"
confidence_threshold = 0.70
[export]
formats = ["csv"]
include_raw = false
Research Mode Config¶
# config/research.toml
[ocr]
preferred_engine = "vision"
confidence_threshold = 0.80
enable_fallbacks = true
[qc]
flag_low_confidence = true
require_review = true
[export]
formats = ["csv", "dwca"]
include_confidence = true
Production Mode Config¶
# config/production.toml
[audit]
required = true
user_tracking = true
retain_days = 2555 # 7 years
[qc]
multi_user_review = true
sign_off_required = true
[export]
formats = ["csv", "dwca", "institutional"]
include_audit = true
include_provenance = true
🚀 Getting Started¶
- Choose your mode based on your needs
- Start with Quick Mode if unsure
- Upgrade to Research/Production as requirements grow
- All modes use the same core commands - just different options
The architecture is designed to grow with your needs - start simple and add complexity only when required.
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group