Documentation Architecture: Single Source of Truth¶
The Problem: Documentation Drift¶
When documentation lives in multiple places, you get: - Duplication: Same content in README.md and docs/index.md - Sync Issues: Updates in code don't reflect in docs - Maintenance Burden: Two places to update everything
Our Solution: Single Source of Truth¶
1. Root Files Are Canonical¶
These files live only in the repository root:
- README.md - GitHub landing page
- CHANGELOG.md - Version history
- CONTRIBUTING.md - Contribution guide
- LICENSE - Legal terms
2. Docs Site Includes Root Files¶
We use symlinks or pymdownx.snippets to include root files in the docs site:
# mkdocs.yml
markdown_extensions:
- pymdownx.snippets:
base_path: ['.', 'docs'] # Search root and docs/
check_paths: true
3. Include Syntax¶
In any docs file, include content from root:
<!-- Include entire file -->
# Changelog
## [Unreleased]
### Changed
- **CI/Type Checking**: Replaced mypy with Astral's ty type checker ([PR #223](https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/pull/223))
- Completes Astral toolchain: uv (package management) + ruff (linting) + ty (type checking)
- 100x+ faster than mypy, zero installation overhead (uvx)
- Phased rollout: CI integration complete, fixing remaining type issues incrementally
- See `[tool.ty]` in pyproject.toml for configuration and status
### Fixed
- **Type Safety**: Fixed 9 type safety issues found by ty
- `Image.LANCZOS` deprecation โ `Image.Resampling.LANCZOS`
- Missing `List` import in dwc/archive.py
- OpenAI optional dependency shadowing
- Path type narrowing in cli.py
- **CI**: Fixed 22 ruff linting errors (unused variables, missing imports, boolean comparisons)
- **Dependencies**: Synced uv.lock to match pyproject.toml version 2.0.0
### Future Development
- ๐ฎ 16 Darwin Core fields (9 additional: habitat, elevation, recordNumber, etc.)
- ๐ฎ Layout-aware prompts (TOP vs BOTTOM label distinction)
- ๐ฎ Ensemble voting for research-grade quality
## [2.0.0] - 2025-10-22
### ๐ Specimen-Centric Provenance Architecture
**Major Achievement:** Fundamental architectural shift from image-centric to specimen-centric data model, enabling full lineage tracking and production-scale data quality management.
#### Added - Specimen Provenance System
- ๐ฌ **Specimen Index** (`src/provenance/specimen_index.py`)
- SQLite database tracking specimens through transformations and extraction runs
- Automatic deduplication at (image_sha256, extraction_params) level
- Multi-extraction aggregation per specimen for improved candidate fields
- Data quality flagging: catalog duplicates, malformed numbers, missing fields
- Full audit trail from original camera files to published DwC records
- ๐ **Deduplication Logic**
- Deterministic: same (image, params) = cached result, no redundant processing
- Intentional re-processing supported: different params aggregate to better candidates
- Prevents waste: identified 2,885 specimens extracted twice (5,770 โ 2,885)
- Cost savings: eliminates duplicate API calls and processing time
- ๐๏ธ **Specimen-Centric Data Model**
- Specimen identity preserved through image transformations
- Provenance DAG: original files โ transformations โ extractions โ review
- Content-addressed images linked to specimen records
- Support for multiple source formats per specimen (JPEG, NEF raw)
- ๐ก๏ธ **Data Quality Automation**
- Automatic detection of catalog number duplicates across specimens
- Pattern validation for malformed catalog numbers
- Perceptual hash detection for duplicate photography
- Missing required fields flagged for human review
- ๐ **Multi-Extraction Aggregation**
- Combines results from multiple extraction attempts per specimen
- Selects best candidate per field (highest confidence)
- Enables iterative improvement: reprocess with better models/preprocessing
- All extraction attempts preserved for audit trail
#### Added - Migration & Analysis Tools
- ๐ **Migration Script** (`scripts/migrate_to_specimen_index.py`)
- Analyzes existing raw.jsonl files from historical runs
- Populates specimen index without modifying original data
- Detects duplicate extractions and reports statistics
- Runs comprehensive data quality checks
- Example usage:
```bash
python scripts/migrate_to_specimen_index.py \
--run-dir full_dataset_processing/* \
--index specimen_index.db \
--analyze-duplicates \
--check-quality
```
- ๐ **Extraction Run Analysis** (`docs/extraction_run_analysis_20250930.md`)
- Documented root cause of duplicate extractions in run_20250930_181456
- ALL 5,770 extractions failed (missing OPENAI_API_KEY)
- Every specimen processed exactly twice (no deduplication)
- Provides recommendations for prevention
#### Added - Production Infrastructure
- ๐ **Quart + Hypercorn Migration** (Async Review System)
- Migrated review web app from Flask to Quart for async performance
- All routes converted to async for better concurrency
- GBIF validation now non-blocking (async HTTP with aiohttp)
- Hypercorn ASGI server replaces Flask development server
- Production-ready async architecture
- ๐ณ **Docker Support** (`Dockerfile`, `docker-compose.yml`)
- Production-ready containerization with multi-stage builds
- Optimized Python 3.11-slim base image
- Health checks and restart policies
- Volume mounting for data persistence
- Port mapping for review UI (5002)
- ๐บ **Monitor TUI Improvements**
- Fixed progress warnings from manifest.json/environment.json format detection
- Support for both old and new metadata formats
- Graceful fallback when metadata files missing
- Proper specimen count estimation from raw.jsonl
#### Documentation - Comprehensive Guides
- ๐ **Architecture Documentation** (`docs/specimen_provenance_architecture.md`)
- Complete specimen-centric data model specification
- Transformation provenance DAG design
- Extraction deduplication logic and examples
- Data quality invariants and flagging rules
- Full integration examples and migration patterns
- SQL schema and API documentation
- ๐ **Release Plan** (`docs/RELEASE_2_0_PLAN.md`)
- Three-phase migration strategy (preserve โ populate โ publish)
- Progressive publication workflow (draft โ batches โ final)
- Data safety guarantees and rollback procedures
- Review UI integration requirements
- Timeline and success criteria
#### Research Impact
**Architectural Foundation:**
- **From**: Image-centric, duplicates allowed, no specimen tracking
- **To**: Specimen-centric, automatic deduplication, full provenance
**Economic Impact:**
- Eliminates redundant extraction attempts (identified 2,885 duplicates)
- Prevents wasted API calls on already-processed specimens
- Enables cost-effective iterative improvement via aggregation
**Scientific Impact:**
- Full lineage tracking for reproducibility
- Cryptographic traceability (content-addressed images)
- Data quality automation (catalog validation, duplicate detection)
- Supports progressive publication with human review tracking
#### Technical Implementation
- **Database Schema**: 7 tables tracking specimens, transformations, extractions, aggregations, reviews, quality flags
- **Deduplication Key**: SHA256(extraction_params) for deterministic caching
- **Aggregation Strategy**: Multi-extraction results combined, best candidate per field selected
- **Quality Checks**: Automated SQL queries detect violations of expected invariants
- **Migration Safety**: Additive only, original data never modified, full rollback capability
#### Backward Compatibility
โ
**Fully Backward Compatible**
- Existing extraction runs remain valid (no modification)
- Old workflow continues to work without migration
- New features opt-in via migration script
- No breaking changes to CLI interface
- Gradual adoption supported
#### Production Readiness
- โ
Async web architecture (Quart + Hypercorn)
- โ
Docker containerization with health checks
- โ
Data quality automation
- โ
Full provenance tracking
- โ
Progressive publication workflow
- โ
Safe migration with rollback capability
### Changed - Infrastructure
- Migrated review web app from Flask to Quart (async)
- Updated monitor TUI for manifest.json format support
- Enhanced error handling in review system
### Fixed
- Monitor TUI progress warnings (manifest/environment format detection)
- Review UI port already in use error handling
- Auto-detection priority (real data before test data)
- S3 image URL auto-detection from manifest.json
### Notes
Version 2.0.0 represents a fundamental architectural maturity milestone, transitioning from proof-of-concept extraction to production-scale specimen management with full provenance tracking, data quality automation, and human review workflows. This release sets the foundation for progressive data publication and long-term institutional deployment.
## [1.1.1] - 2025-10-11
### Added - Accessibility Enhancements
- ๐จ **Constitutional Principle VI: Information Parity and Inclusive Design**
- Elevated accessibility to constitutional status (Core Principle VI)
- Cross-reference to meta-project pattern: `information-parity-design.md`
- Validation requirements: VoiceOver compatibility, keyboard-first, screen reader native
- โจ๏ธ **Keyboard-First Review Interface**
- Keyboard shortcuts with confirmation dialogs (a/r/f for approve/reject/flag)
- Double-press bypass (500ms window) for power users
- Prevents accidental actions during review workflow
- ๐ **Enhanced Image Interaction**
- Cursor-centered zoom (focal point under cursor stays stationary)
- Pan boundary constraints (prevents image escaping container)
- Safari drag-and-drop prevention (ondragstart blocking)
- ๐ท๏ธ **Status Filtering**
- Filter buttons for All/Critical/High/Pending/Approved/Flagged/Rejected statuses
- Quick access to specimens needing review
- Visual indication of current filter state
- ๐ผ๏ธ **TUI Monitor Enhancements**
- iTerm2 inline specimen image rendering via rich-pixels
- Real-time image preview (60x40 terminal characters)
- 3-column layout: event stream + field quality | specimen image
- Automatic image updates as extraction progresses
### Changed
- Review interface improvements for keyboard-first navigation
- Enhanced TUI monitor with multi-panel layout
- Updated constitution to v1.1.0 with accessibility principle
### Documentation
- Added `docs/ACCESSIBILITY_REQUIREMENTS.md` - project-level implementation roadmap
- Phase 1-3 priorities: Critical fixes โ Enhanced accessibility โ Documentation
- Success metrics and testing requirements defined
### Notes
This patch release prepares the production baseline (v1.1.x-stable) before beginning v2.0.0 accessibility-first redesign. All changes are backward-compatible with v1.1.0.
## [1.1.0] - 2025-10-09
### ๐ Multi-Provider Extraction with FREE Tier Support
**Major Achievement:** Architectural shift to multi-provider extraction with zero-cost production capability
#### Added - OpenRouter Integration
- ๐ **Multi-Model Gateway** (`scripts/extract_openrouter.py`)
- Access to 400+ vision models via unified OpenRouter API
- FREE tier support (Qwen 2.5 VL 72B, Llama Vision, Gemini)
- Automatic retry with exponential backoff
- Rate limit handling with progress tracking
- Model selection interface with cost/quality trade-offs
- ๐ฐ **Zero-Cost Production Pipeline**
- Qwen 2.5 VL 72B (FREE): 100% scientificName coverage
- Better quality than paid OpenAI baseline (98% coverage)
- Removes financial barrier to herbarium digitization
- Unlimited scale without queue constraints
#### Added - Scientific Provenance System
- ๐ฌ **Reproducibility Framework** (`src/provenance.py`)
- Git-based version tracking for complete reproducibility
- SHA256 content-addressed data lineage
- Immutable provenance fragments
- Complete system metadata capture (Python, OS, dependencies)
- Graceful degradation for non-git environments
- ๐ **Pattern Documentation** (`docs/SCIENTIFIC_PROVENANCE_PATTERN.md`)
- Complete guide with real-world herbarium examples
- Best practices for scientific reproducibility
- Integration patterns with Content-DAG architecture
- Anti-patterns and evolution pathways
- Working examples: `examples/provenance_example.py`, `examples/content_dag_herbarium.py`
#### Production Results
- ๐ **Quality Baseline & FREE Model Validation**
- Phase 1: 500 specimens @ 98% scientificName coverage (OpenAI GPT-4o-mini, $1.85)
- Validation: 20 specimens @ 100% coverage (OpenRouter FREE, $0.00)
- Dataset: 2,885 photos ready for full-scale processing
- Validates FREE models outperform paid baseline
- Complete provenance tracking for scientific publication
- ๐ **Evidence Committed**
- Phase 1 baseline statistics: `full_dataset_processing/phase1_baseline/extraction_statistics.json`
- OpenRouter validation results: `openrouter_test_20/raw.jsonl`
- Quality metrics documented for peer review
#### Technical Architecture
- ๐๏ธ **Provider Abstraction**
- Unified interface for multiple AI providers
- Clean separation: OpenAI, OpenRouter, future providers
- Transparent fallback and retry mechanisms
- No vendor lock-in or single point of failure
- โก **Performance Optimizations**
- Rate limit handling with automatic backoff
- Progress tracking with ETA calculation
- Efficient image encoding (base64)
- JSONL streaming for large datasets
- ๐ง **Version Management System**
- Single source of truth: `pyproject.toml`
- Programmatic version access: `src/__version__.py`
- Automated consistency checking: `scripts/check_version_consistency.py`
- Prevents version drift across documentation
#### Research Impact
**Architectural shift:**
- **From**: Single provider, paid, queue-limited
- **To**: Multi-provider, FREE option, unlimited scale
**Economic impact:**
- Enables zero-cost extraction at production scale
- Removes financial barrier for research institutions
- Democratizes access to AI-powered digitization
**Scientific impact:**
- Full reproducibility for scientific publication
- Cryptographic traceability of research outputs
- Complete methodology documentation
- Sets new baseline for herbarium extraction quality
#### Changed - Documentation Updates
- Updated README.md with v1.1.0 features and results
- Added Scientific Provenance Pattern guide
- Enhanced with OpenRouter integration examples
- Version consistency across all public-facing docs
### Breaking Changes
None - fully backward compatible with v1.0.0
## [1.0.0] - 2025-10-06
### ๐ Production Release - AAFC Herbarium Dataset
**Major Achievement:** 2,885 specimen photos processed, quality baseline established
#### Added - v1.0 Deliverables
- ๐ฆ **Production Dataset** (`deliverables/v1.0_vision_api_baseline.jsonl`)
- 2,885 herbarium photos processed with Apple Vision API
- **Quality: 5.5% scientificName coverage (FAILED - replaced in v1.1.0)**
- 7 Darwin Core fields attempted
- Apple Vision API (FREE) + rules engine
- Total cost: $0 (but unusable quality)
- โ
**Ground Truth Validation** (`deliverables/validation/human_validation.jsonl`)
- 20 specimens manually validated
- Documented accuracy baselines
- Quality metrics calculated
- ๐ **Complete Documentation**
- Extraction methodology documented
- Quality limitations identified
- Upgrade path to v2.0 designed
#### Added - Agent Orchestration Framework
- ๐ค **Pipeline Composer Agent** (`agents/pipeline_composer.py`)
- Cost/quality/deadline optimization
- Engine capability registry (6 engines)
- Intelligent routing: FREE-first with paid fallback
- Progressive enhancement strategies
- Ensemble voting support for research-grade quality
- ๐ **Data Publication Guide** (`docs/DATA_PUBLICATION_GUIDE.md`)
- GBIF/Canadensys publication workflow
- Darwin Core Archive export scripts
- CC0 licensing recommendations
- Deployment context strategies (Mac dev / Windows production)
- โ๏ธ **Enhanced Configuration**
- `config/config.gpt4omini.toml` - GPT-4o-mini direct extraction
- Layout-aware prompts (`config/prompts/image_to_dwc_v2.*.prompt`)
- Expanded 16-field Darwin Core schema
#### Technical Improvements - v1.0
- ๐ง **API Integration**
- Fixed OpenAI Chat Completions API format
- Prompt loading from files (system + user messages)
- JSON response format for structured extraction
- Model: gpt-4o-mini (cost-effective, layout-aware)
- ๐๏ธ **Architecture**
- Plugin registry pattern (additive-only, zero conflicts)
- Config override pattern (branch-specific configurations)
- Parallel development enabled (v2-extraction + agent-orchestration branches)
#### Quality Metrics - v1.0 Apple Vision (DEPRECATED)
- **ScientificName coverage:** 5.5% (159/2,885) - FAILED
- **Status:** Replaced by GPT-4o-mini/OpenRouter approach in v1.1.0
- **Exact matches:** 0% (on 20-specimen validation)
- **Partial matches:** ~10-15%
- **Known limitations:** OCR accuracy insufficient for production use
#### v2.0 Preview (In Progress)
- **16 Darwin Core fields** (9 additional: habitat, elevation, recordNumber, identifiedBy, etc.)
- **Layout-aware extraction** (TOP vs BOTTOM label distinction)
- **Expected quality:** ~70% accuracy (vs ~15% baseline)
- **Cost:** $1.60 total or FREE overnight (15-20 hours)
- **Agent-managed pipelines:** "Consider all means accessible in the world"
### Changed - Documentation Overhaul
- Updated README with v1.0 production status
- Reorganized docs for clarity
- Added deployment context considerations
- Improved API setup instructions
### Fixed
- OpenAI API endpoint (responses.create โ chat.completions.create)
- Environment variable naming (OPENAI_KEY โ OPENAI_API_KEY)
- Model config passthrough for gpt4omini
- Prompt loading in image_to_dwc engine
## [1.0.0-beta.2] - 2025-10-04
### Added - Storage Abstraction Layer
- ๐๏ธ **Storage Backend Architecture** โ Pluggable storage layer decoupled from core extraction logic
- **ImageLocator Protocol** (`src/io_utils/locator.py`) โ Storage-agnostic interface for image access
- **LocalFilesystemLocator** โ Traditional directory-based storage backend
- **S3ImageLocator** โ AWS S3 and S3-compatible storage (MinIO) backend
- **CachingImageLocator** โ Transparent pass-through caching decorator with LRU eviction
- **Factory Pattern** โ Configuration-driven backend instantiation (`locator_factory.py`)
- ๐ฆ **Storage Backends Supported**
- **Local Filesystem** โ Direct directory access (default, backward compatible)
- **AWS S3** โ Cloud object storage with automatic credential handling
- **MinIO** โ Self-hosted S3-compatible storage via custom endpoint
- **Future Ready** โ Easy to add HTTP, Azure Blob, Google Cloud Storage
- ๐ **Transparent Caching System**
- **Automatic Caching** โ Remote images cached locally on first access
- **LRU Eviction** โ Configurable cache size limit with least-recently-used eviction
- **Cache Management** โ Statistics (`get_cache_stats()`), manual clearing
- **SHA256 Keys** โ Robust cache keys handling special characters and long names
- โ๏ธ **Configuration Support**
- **TOML Configuration** โ `[storage]` section in `config/config.default.toml`
- **Example Configs** โ `config/config.s3-cached.toml` for S3 with caching
- **Backward Compatible** โ Omit `[storage]` section to use local filesystem
- **Environment Aware** โ AWS credentials via environment or explicit config
- ๐งช **Comprehensive Testing**
- **18 Passing Tests** โ `tests/unit/test_locators.py` covering all components
- **LocalFilesystemLocator** โ 11 tests for local storage operations
- **CachingImageLocator** โ 7 tests for caching behavior and eviction
- **Edge Cases** โ Missing files, invalid paths, cache size limits
- ๐ **Complete Documentation**
- **Architecture Guide** โ `docs/STORAGE_ABSTRACTION.md` with patterns and examples
- **Configuration Guide** โ Storage backend configuration templates
- **Migration Guide** โ Phase 1 complete (core abstractions), Phase 2 deferred (CLI integration)
- **Release Process** โ `docs/RELEASE_PROCESS.md` for versioning and release guidelines
### Technical Implementation - Storage Abstraction
- **Protocol-Based Design** โ Duck typing via `Protocol`, not abstract base classes
- **Decorator Pattern** โ Caching as transparent wrapper, not baked into backends
- **Strategy Pattern** โ Pluggable backends selected at runtime
- **Lazy Imports** โ boto3 only imported when S3 backend needed
- **Performance Optimized** โ `get_local_path()` optimization for direct filesystem access
### Backward Compatibility
- โ
**No Breaking Changes** โ Existing local filesystem workflows unaffected
- โ
**Optional Feature** โ Storage abstraction activated via configuration
- โ
**CLI Unchanged** โ Current `cli.py` works perfectly with local filesystem
- โ
**Deferred Integration** โ CLI migration to ImageLocator deferred to future release
### Added - Modern UI/UX System (2025-09-26)
- ๐ฅ๏ธ **Rich Terminal User Interface (TUI)** โ Professional interactive terminal experience
- Real-time progress tracking with animated progress bars and live statistics
- Interactive configuration wizards for easy setup
- Menu-driven navigation with keyboard support
- Visual error reporting and engine usage charts
- Built with Rich library for beautiful terminal displays
- ๐ **Modern Web Dashboard** โ Real-time web interface with live updates
- WebSocket-based real-time progress updates
- Interactive charts and visual statistics (Chart.js integration)
- Modern responsive design with Tailwind CSS
- Multi-user support for team environments
- FastAPI backend with async WebSocket support
- ๐ **Unified Interface Launcher** โ Single entry point for all UI options
- Interactive menu for interface selection
- Direct launch options via command-line flags (`--tui`, `--web`, `--cli`, `--trial`)
- Automatic dependency checking and installation guidance
- Comprehensive help system and documentation
- ๐ **Centralized Progress Tracking System** โ Unified real-time updates
- Abstract progress tracker with multiple callback support
- Integration hooks in existing CLI processing pipeline
- Support for TUI, web, and file-based progress logging
- Async callback support for WebSocket broadcasting
- Comprehensive statistics tracking (engine usage, error reporting, timing)
### Enhanced
- โก **CLI Integration** โ Enhanced existing command-line interface
- Added progress tracking hooks to `cli.py` processing pipeline
- Maintains backward compatibility with existing workflows
- Optional progress tracking (graceful fallback if tracker unavailable)
- Image counting and batch processing optimization
- ๐งช **Testing Infrastructure** โ Comprehensive UI testing framework
- Automated dependency checking and validation
- Integration tests for all UI components
- Progress tracking system validation
- Interface import and functionality testing
- Non-interactive demo system for CI/CD
### Technical Implementation
- **Dependencies Added**: `rich`, `fastapi`, `uvicorn`, `jinja2` for UI components
- **Architecture**: Modular design with interface abstraction
- **Performance**: Async processing to avoid blocking UI updates
- **Compatibility**: Graceful degradation when optional UI dependencies unavailable
- **Integration**: Seamless integration with existing processing pipeline
### User Experience Improvements
- **From**: Basic command-line non-interactive execution with text-only output
- **To**: Professional multi-interface system matching CLI agentic UX quality
- โ
Real-time progress visualization with animated elements
- โ
Interactive configuration wizards and guided setup
- โ
Live error reporting and actionable feedback
- โ
Multiple interface options for different user preferences
- โ
Professional branding and consistent visual design
- โ
Context-aware help and comprehensive documentation
## [0.3.0] - 2025-09-25
### Added - OCR Research Breakthrough
- ๐ฌ **Comprehensive OCR Engine Analysis** โ First definitive study of OCR performance for herbarium specimen digitization
- **Major Finding**: Apple Vision OCR achieves 95% accuracy vs Tesseract's 15% on real herbarium specimens
- **Economic Impact**: $1600/1000 specimens cost savings vs manual transcription
- **Production Impact**: Enables automated digitization with minimal manual review (5% vs 95%)
- **Research Infrastructure**: Complete testing framework for reproducible OCR evaluation
- **Documentation**: `docs/research/COMPREHENSIVE_OCR_ANALYSIS.md` with full methodology and findings
- ๐งช **Advanced OCR Testing Infrastructure**
- Multi-engine comparison framework supporting Apple Vision, Claude Vision, GPT-4 Vision, Google Vision
- Comprehensive preprocessing evaluation with 10+ enhancement techniques
- Real specimen testing on AAFC-SRDC collection with statistical analysis
- Reproducible testing protocols and automated evaluation scripts
- ๐ **Production-Ready Apple Vision Integration**
- Native macOS OCR engine with 95% accuracy on herbarium specimens
- Zero API costs and no vendor lock-in for primary processing
- Enhanced vision_swift engine with macOS compatibility improvements
- Integration with existing CLI processing pipeline
- ๐ **Research Documentation System**
- `docs/research/` directory with comprehensive analysis and methodology
- Updated project documentation reflecting OCR findings
- Production deployment guidelines based on empirical testing
- Future research directions for vision API integration
### Changed
- **OCR Engine Recommendations**: Apple Vision now primary choice, Tesseract not recommended
- **Processing Pipeline**: Updated to use Apple Vision as default OCR engine
- **Documentation**: README, roadmap, and guides updated with research findings
- **Installation Guide**: OCR engine selection based on accuracy testing
### Technical Impact
- **Eliminates API dependency** for 95% of herbarium specimen processing
- **Reduces manual labor** from 95% to 5% of specimens requiring review
- **Enables production deployment** with enterprise-grade accuracy at zero marginal cost
- **Establishes evidence-based best practices** for institutional herbarium digitization
## [0.2.0] - 2024-09-24
### Added - Phase 1 Major Enhancements
- โจ **Versioned DwC-A Export System** ([#158](https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/issues/158))
- Rich provenance tracking with semantic versioning, git integration, timestamps
- Configurable bundle formats ("rich" vs "simple")
- Embedded manifests with file checksums and comprehensive metadata
- New `cli.py export` command for streamlined export workflows
- โจ **Official Schema Integration** ([#188](https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/issues/188))
- Automatic fetching of official DwC/ABCD schemas from TDWG endpoints
- Intelligent caching system with configurable update intervals
- Schema validation and compatibility checking
- `SchemaManager` class for high-level schema operations
- โจ **Enhanced Mapping System**
- Fuzzy matching and similarity-based mapping suggestions
- Auto-generation of mappings from official schemas
- Configuration-driven mapping rules with dynamic updates
- Integration with existing mapper functionality
- โจ **Enhanced GBIF Integration**
- Comprehensive GBIF API client with taxonomy and locality verification
- Configurable endpoints, retry logic, and rate limiting
- Enhanced error handling and metadata tracking
- Support for occurrence validation and fuzzy matching
- ๐ **Comprehensive Documentation**
- New documentation: API reference, user guide, workflow examples, FAQ, troubleshooting
- Schema mapping guide with practical examples
- Enhanced export and reporting documentation
- ๐งช **Expanded Testing**
- New unit tests for schema management and enhanced mapping
- Integration tests for end-to-end workflows
- Enhanced prompt coverage testing harness
- Comprehensive test coverage for new functionality
### Enhanced
- ๐ง **Configuration System**
- Extended configuration options for schema management, GBIF integration
- Export format preferences and behavior settings
- Enhanced validation and error reporting
- ๐ฅ๏ธ **CLI Improvements**
- Better error handling and user feedback
- Support for schema management operations
- Enhanced archive creation workflows
### Infrastructure
- ๐๏ธ **Schema Cache**: Official schemas cached locally for offline operation
- ๐ฆ **Package Structure**: New modules for schema management and enhanced functionality
- โก **Performance**: Caching and optimization for schema operations
### Previous Changes
- :seedling: uv lockfile and bootstrap script for quick environment setup
- :label: expand mapping rules for collector numbers and field note vocabulary
- :dog: bootstrap script now runs linting and tests after syncing dependencies
- :bug: bootstrap script installs uv if missing
- :bug: avoid auto-registering unimplemented multilingual OCR engine
- :bug: normalize `[ocr].langs` for PaddleOCR, multilingual, and Tesseract engines so ISO 639-1/639-2 codes interoperate out of the box ([#138](https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/issues/138))
- :memo: outline testing and linting expectations in the development guide
## [0.1.4] - 2025-09-10 (0.1.4)
### Added
- โจ adaptive threshold preprocessor with selectable Otsu or Sauvola binarization
- โจ configurable GBIF endpoints via `[qc.gbif]` config section
- โจ core Darwin Core field mappings and controlled vocabularies
- โจ load custom Darwin Core term mappings via `[dwc.custom]` config section
- โจ versioned Darwin Core Archive exports with run manifest
- โจ taxonomy and locality verification against GBIF with graceful error handling
- โจ track review bundle imports with audit entries
### Fixed
- ๐ normalize `typeStatus` citations to lowercase using vocabulary rules
- ๐ record review import audits in the main application database
### Docs
- ๐ document adaptive thresholding options in preprocessing and configuration guides
- ๐ document GBIF endpoint overrides in QC and configuration guides
- ๐ document custom term mappings and vocabulary examples
- ๐ describe versioned exports in README and export guide
## [0.1.3] - 2025-09-08 (0.1.3)
### Docs
- ๐ mark developer documentation milestone; refine roadmap and TODO priorities (non-breaking, optional upgrade)
## [0.1.2] - 2025-09-03 (0.1.2)
### Added
- support GPT image-to-Darwin Core extraction with default prompts
- :gear: configurable task pipeline via `pipeline.steps`
- :sparkles: interactive candidate review TUI using Textual
- :sparkles: lightweight web review server for OCR candidate selection
- :sparkles: export/import review bundles with manifest and semantic versioning
- :sparkles: spreadsheet utilities for Excel and Google Sheets review
- :sparkles: automatically open image files when reviews start with optional `--no-open` flag
### Fixed
- guard against non-dict GPT responses to avoid crashes
- handle multiple reviewer decisions per image when importing review bundles
### Changed
- :recycle: load role-based GPT prompts and pass messages directly to the API
### Docs
- ๐ outline review workflow for TUI, web, and spreadsheet interfaces
## [0.1.1] - 2025-09-02 (0.1.1)
### Added
- :recycle: Load Darwin Core fields from configurable schema files and parse URIs
- :card_file_box: Adopt SQLAlchemy ORM models for application storage
- :lock: Support `.env` secrets and configurable GPT prompt templates
### Changed
- :memo: Document configuration, rules and GPT setup
- :package: Move prompt templates under `config/prompts`
### Removed
- :fire: Legacy hard-coded prompt paths
## [0.1.0] - 2025-09-01 (0.1.0)
### Added
- :construction: project skeleton with CLI and configurable settings
- :package: wheel packaging with importlib-based config loading
- :sparkles: DWC schema mapper and GPT-based extraction modules
- :crystal_ball: Vision Swift and Tesseract OCR engines with pluggable registry
- :hammer_and_wrench: preprocessing pipeline, QC utilities, and GBIF verification stubs
- :card_file_box: SQLite database with resume support and candidate review CLI
- :memo: developer documentation, sample Darwin Core Archive, and comprehensive tests
### Changed
- :loud_sound: replace print statements with logging
### Fixed
- :bug: handle missing git commit metadata
- :bug: correct mapper schema override
[Unreleased]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v2.0.0...HEAD
[2.0.0]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v1.1.1...v2.0.0
[1.1.1]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v1.1.0...v1.1.1
[1.1.0]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v1.0.0...v1.1.0
[1.0.0]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v1.0.0-beta.2...v1.0.0
[1.0.0-beta.2]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v1.0.0-alpha.1...v1.0.0-beta.2
[1.0.0-alpha.1]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.3.0...v1.0.0-alpha.1
[0.3.0]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.1.4...v0.2.0
[0.1.4]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.1.3...v0.1.4
[0.1.3]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.1.2...v0.1.3
[0.1.2]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.1.1...v0.1.2
[0.1.1]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/compare/v0.1.0...v0.1.1
[0.1.0]: https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/releases/tag/v0.1.0
<!-- Include specific lines -->
---
> **๐ [View Full Documentation](https://aafc.devvyn.ca)** - Complete guides, tutorials, and API reference
---
## ๐ฏ What This Does
Automatically extracts structured biodiversity data from herbarium specimen photographs using OCR and AI:
- **Reads labels** (handwritten & printed) from specimen images
- **Extracts Darwin Core fields** (scientific name, location, date, collector, etc.)
- **Outputs standardized data** ready for GBIF publication
- **Provides review tools** for quality validation
### Example Workflow
**Input:** Herbarium specimen image
**Output:** Structured database record
```csv
catalogNumber,scientificName,eventDate,recordedBy,locality,stateProvince,country
"019121","Bouteloua gracilis (HBK.) Lag.","1969-08-14","J. Looman","Beaver River crossing","Saskatchewan","Canada"
๐ Quick Start¶
# Install
git clone https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025.git
cd aafc-herbarium-dwc-extraction-2025
./bootstrap.sh
# Process specimens
python cli.py process --input photos/ --output results/
# Review results (Quart web app)
<!-- Include code from source -->
import hashlib
import json
import logging
import sqlite3
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
logger = logging.getLogger(__name__)
@dataclass
class OriginalFile:
"""Original camera file for a specimen."""
4. Symlink for Navigation¶
For files that need to appear in nav (like CHANGELOG):
Then reference in mkdocs.yml:
Benefits¶
โ Single Source: Edit once, appears everywhere โ Always Synced: Docs automatically reflect latest code โ No Duplication: One canonical version of each file โ Code Examples: Include actual source code, not copy-paste
Examples in This Project¶
Including Code Snippets¶
Instead of copying code into docs:
<!-- BAD: Duplicated code -->
\`\`\`python
from src.provenance.specimen_index import SpecimenIndex
\`\`\`
<!-- GOOD: Include from source -->
import hashlib
import json
import logging
import sqlite3
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional, Tuple
logger = logging.getLogger(__name__)
Including Root Files¶
Instead of duplicating README content:
<!-- BAD: Copy-paste from README -->
# Project Overview
AAFC Herbarium digitization...
<!-- GOOD: Include from root -->
# AAFC Herbarium Darwin Core Extraction
**Production-ready toolkit for extracting Darwin Core metadata from herbarium specimen images**
[](https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025/releases/tag/v2.0.0)
[](LICENSE)
[](https://www.python.org/downloads/)
[](https://aafc.devvyn.ca)
---
> **๐ [View Full Documentation](https://aafc.devvyn.ca)** - Complete guides, tutorials, and API reference
---
## ๐ฏ What This Does
Automatically extracts structured biodiversity data from herbarium specimen photographs using OCR and AI:
- **Reads labels** (handwritten & printed) from specimen images
- **Extracts Darwin Core fields** (scientific name, location, date, collector, etc.)
- **Outputs standardized data** ready for GBIF publication
- **Provides review tools** for quality validation
### Example Workflow
**Input:** Herbarium specimen image
**Output:** Structured database record
```csv
catalogNumber,scientificName,eventDate,recordedBy,locality,stateProvince,country
"019121","Bouteloua gracilis (HBK.) Lag.","1969-08-14","J. Looman","Beaver River crossing","Saskatchewan","Canada"
๐ Quick Start¶
# Install
git clone https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025.git
cd aafc-herbarium-dwc-extraction-2025
./bootstrap.sh
# Process specimens
python cli.py process --input photos/ --output results/
# Review results (Quart web app)
Validation¶
The docs validation workflow checks: 1. All snippet includes resolve correctly 2. No broken symlinks 3. No duplicate content warnings
Run locally:
Migration Guide¶
To consolidate duplicate docs:
- Identify duplicates: Compare docs/index.md and README.md
- Choose canonical source: Usually root for GitHub visibility
- Replace with includes: Use --8<-- syntax
- Test build: Run
mkdocs build --strict - Remove duplicates: Delete old files
Pattern: Documentation Quality Gates Status: Implemented with pymdownx.snippets plugin
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group