AAFC Herbarium Darwin Core Extraction¶
Production-ready toolkit for extracting Darwin Core metadata from herbarium specimen images
๐ View Full Documentation - Complete guides, tutorials, and API reference
๐ฏ What This Does¶
Automatically extracts structured biodiversity data from herbarium specimen photographs using OCR and AI:
- Reads labels (handwritten & printed) from specimen images
- Extracts Darwin Core fields (scientific name, location, date, collector, etc.)
- Outputs standardized data ready for GBIF publication
- Provides review tools for quality validation
Example Workflow¶
Input: Herbarium specimen image Output: Structured database record
catalogNumber,scientificName,eventDate,recordedBy,locality,stateProvince,country
"019121","Bouteloua gracilis (HBK.) Lag.","1969-08-14","J. Looman","Beaver River crossing","Saskatchewan","Canada"
๐ Quick Start¶
# Install
git clone https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025.git
cd aafc-herbarium-dwc-extraction-2025
./bootstrap.sh
# Process specimens
python cli.py process --input photos/ --output results/
# Review results (Quart web app)
python -m src.review.web_app --extraction-dir results/ --port 5002
๐ฆ Current Release: v2.0.0¶
Specimen-Centric Provenance Architecture
What's New in v2.0.0¶
๐ฌ Specimen Provenance System - Complete lineage tracking from raw images through all transformations - Automatic deduplication at (image_sha256, extraction_params) level - Multi-extraction aggregation for improved field candidates - Content-addressed storage with S3 integration
๐ Production-Ready Infrastructure - Async web framework (Quart) for high-performance review - Docker containerization for reproducible deployments - Clean 8MB repository (97% size reduction from v1.x) - Migration tools with full rollback capability
๐ฏ Quality & Efficiency - Confidence-weighted field aggregation across extraction runs - Review workflow with specimen-level tracking - Progressive publication: draft โ batches โ final - Full backward compatibility with v1.x data
๐ Documentation & Migration - Complete release plan: docs/RELEASE_2_0_PLAN.md - Migration guide with safety guarantees - GBIF validation integration roadmap (v2.1.0) - Specimen provenance architecture doc
Why This Matters¶
Architectural shift: - From: Image-centric processing (lost specimen identity) - To: Specimen-centric provenance (complete lineage tracking)
Research impact: - Enables reproducible extraction pipelines - Supports iterative improvement with safety - Production-ready data quality management - Foundation for GBIF-validated publication (v2.1.0)
See CHANGELOG.md for complete release notes.
๐ง Installation¶
Requirements¶
- Python 3.11+
- macOS (Apple Vision OCR) or Linux/Windows (cloud APIs)
Setup¶
# Clone repository
git clone https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025.git
cd aafc-herbarium-dwc-extraction-2025
# Install dependencies
./bootstrap.sh
# Check available OCR engines
python cli.py check-deps
macOS (Recommended)¶
โ Apple Vision API works out-of-the-box (FREE, no API keys)
Windows/Linux¶
Requires cloud API keys. Copy .env.example to .env and configure:
# OpenAI (GPT-4o-mini for direct extraction)
OPENAI_API_KEY="your-key-here"
# Optional: Anthropic Claude, Google Gemini
# ANTHROPIC_API_KEY=""
# GOOGLE_API_KEY=""
See API_SETUP_QUICK.md for detailed setup.
๐ก Core Features¶
Multi-Engine OCR Support¶
| Engine | Platform | Cost/1000* | Quality | Notes |
|---|---|---|---|---|
| Apple Vision | macOS | FREE | Medium | Best for macOS users |
| GPT-4o-mini | All | ~$3.70 | High | Layout-aware, 16 fields |
| Tesseract | All | FREE | Low | Fallback option |
| Azure Vision | All | ~$2.00 | Medium | Cloud alternative |
*Estimated from 500-specimen baseline ($1.85 actual = $3.70/1000)
Intelligent Pipeline Composition¶
Agent-managed optimization: - ๐ Zero budget: Vision API โ Rules engine (7 fields) - ๐ฐ Small budget: GPT-4o-mini direct (16 fields, ~$3.70/1000 specimens) - ๐ฌ Research-grade: Multi-engine ensemble voting (cost varies by providers)
See agents/pipeline_composer.py for decision logic.
Darwin Core Output¶
v1.0 Fields (7): - catalogNumber, scientificName, eventDate, recordedBy - locality, stateProvince, country
v2.0 Fields (16): All v1.0 fields plus: - habitat, minimumElevationInMeters, recordNumber - identifiedBy, dateIdentified, verbatimLocality - verbatimEventDate, verbatimElevation, associatedTaxa
Review & Validation Tools¶
Web interface (recommended):
python -m src.review.web_app --extraction-dir results/ --port 5002
# Access at http://127.0.0.1:5002
Terminal interface:
๐ Data Publication¶
Ready to publish extracted data to GBIF via Canadensys:
-
Export Darwin Core Archive:
-
Generate EML metadata:
-
Upload to Canadensys IPT (browser-based, no installation)
-
Automatic GBIF publication (24-48 hours)
See docs/DATA_PUBLICATION_GUIDE.md for complete workflow.
๐งช Quality & Accuracy¶
Phase 1 Baseline (500 Specimens)¶
OpenAI GPT-4o-mini: - scientificName coverage: 98.0% (490/500) - catalogNumber coverage: 95.4% (477/500) - Actual cost: $1.85 ($0.0037 per specimen) - Status: Production-quality baseline
OpenRouter FREE (20 Specimens): - scientificName coverage: 100% (20/20) - Cost: $0.00 - Status: Validates FREE models outperform paid baseline
v1.0 Apple Vision (2,885 Photos - DEPRECATED)¶
- ScientificName coverage: 5.5% (159/2,885) - FAILED
- Status: Replaced by GPT-4o-mini/OpenRouter approach
โ ๏ธ All extracted data should be manually reviewed before publication
๐ฏ Use Cases¶
โ When to Use This Tool¶
- Digitizing physical herbarium collections
- Creating GBIF-ready biodiversity datasets
- Batch processing specimen photographs
- Extracting structured data from label images
โ Not Suitable For¶
- Live plant identification (use iNaturalist)
- Specimens without readable labels
- Real-time field data collection
๐ Documentation¶
๐ View Full Documentation Site¶
Complete guides, tutorials, and reference: - ๐ Getting Started - Installation and quick start - ๐ User Guide - Processing workflows and GBIF export - ๐ฌ Research - Methodology and quality analysis - ๐ป Developer Guide - Architecture and API reference
Legacy documentation (being migrated to docs site): - Agent Orchestration Framework - Data Publication Strategy - Scientific Provenance Pattern โญ - API Setup Guide
๐ Processing Workflow¶
graph LR
A[Image] --> B[OCR Engine]
B --> C[Text Extraction]
C --> D[Rules Engine]
D --> E[Darwin Core]
E --> F[Review Interface]
F --> G[GBIF Export]
Step-by-Step¶
- Prepare images in a directory
- Run extraction:
python cli.py process --input photos/ --output results/ - Review results: Web or terminal interface
- Export data: Darwin Core CSV ready for GBIF
๐ค Contributing¶
Contributions welcome! See CONTRIBUTING.md for guidelines.
Development Setup¶
๐ System Requirements¶
- Python: 3.11 or higher
- Disk space: ~1GB for dependencies, ~5GB for image cache
- Memory: 4GB minimum (8GB recommended for large batches)
- OS: macOS (best), Linux, Windows
๐ Version History¶
Current: v2.0.0 (October 2025) - Specimen-centric provenance architecture Previous: v1.1.1 (October 2025) - Accessibility improvements and Quart migration Earlier: v1.0.0 (October 2025) - Production baseline with Apple Vision API
See CHANGELOG.md for full version history.
๐ License¶
MIT License - see LICENSE file for details.
๐ Support¶
- Issues: GitHub Issues
- Documentation: docs/
- Examples: docs/workflow_examples.md
๐ Project Status¶
Production Ready โ - โ v2.0.0 specimen provenance architecture released - โ 500-specimen baseline @ 98% quality validated - โ 2,885 photos ready for full-scale processing - โ Repository optimized (8MB, 97% size reduction) - โ Docker containerization and async review interface - ๐ Next: v2.1.0 GBIF validation integration - ๐ Next: Full dataset processing with validated pipeline
Built for Agriculture and Agri-Food Canada (AAFC) Enabling biodiversity data digitization at scale
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group