Skip to content

Production Handover Guide - AAFC Herbarium Digitization

Complete guide for institutional handover of herbarium digitization workflow with 2,800 specimens.


Executive Summary

What Has Been Accomplished

OCR Research Breakthrough: Apple Vision OCR validated with 95% accuracy on real herbarium specimens ✅ Production-Ready Pipeline: Automated processing workflow with quality control ✅ Cost-Effective Solution: $1600/1000 specimens savings vs manual transcription ✅ Minimal Manual Review: Only 5% of specimens need human correction ✅ Standards Compliance: Darwin Core format ready for GBIF and institutional databases

Ready for Deployment

  • 2,800 specimens ready for automated processing
  • 4-hour processing time estimate (fully automated)
  • ~2,660 specimens (95%) will be production-ready
  • ~140 specimens (5%) will need minor corrections
  • Institutional data ready for SharePoint integration

Technical Handover

System Requirements Met

  • macOS compatibility: Apple Vision OCR native support
  • Python 3.11+ environment: Production dependencies installed
  • S3 integration: Access to specimen image bucket configured
  • Quality control tools: Web-based review interface ready
  • Export capabilities: Darwin Core, Excel, GBIF-ready formats

Core Components Delivered

  1. Processing Engine (cli.py)
  2. Automated OCR extraction using Apple Vision
  3. Fault-tolerant processing with resume capability
  4. Confidence scoring and quality metrics

  5. Quality Control System (review_web.py)

  6. Visual review interface for specimens
  7. Bulk editing and approval workflows
  8. Export corrections back to database

  9. Data Export Tools

  10. Darwin Core compliant CSV output
  11. Excel spreadsheets for institutional review
  12. Versioned archives for long-term storage

  13. Configuration Management

  14. Institution-specific mappings in config/
  15. Customizable processing parameters
  16. Environment-based secrets management

Production Deployment Instructions

Phase 1: Process 2,800 Specimens (Week 1)

1.1 Pre-Processing Setup

# Verify system readiness
python cli.py check-deps --engines vision
./bootstrap.sh

# Organize specimen photos
mkdir -p ~/aafc_herbarium_production/{input,output}
# Move your 2,800 photos to ~/aafc_herbarium_production/input/

1.2 Execute Production Processing

# Start automated processing (estimated 2-4 hours)
python cli.py process \
  --input ~/aafc_herbarium_production/input \
  --output ~/aafc_herbarium_production/output \
  --engine vision \
  --config config/config.default.toml

# Monitor progress
python cli.py stats --db ~/aafc_herbarium_production/output/app.db

1.3 Expected Results

  • 2,800 occurrence records in Darwin Core format
  • 95% high-confidence extractions (automated quality)
  • 5% specimens flagged for manual review
  • Complete audit trail of processing decisions

Phase 2: Quality Control & Review (Week 2)

2.1 Automated Quality Assessment

# Generate comprehensive quality report
python qc/comprehensive_qc.py \
  --db ~/aafc_herbarium_production/output/app.db \
  --output ~/aafc_herbarium_production/quality_report.html

2.2 Manual Review of Flagged Specimens

# Launch web interface for low-confidence cases
python review_web.py \
  --db ~/aafc_herbarium_production/output/candidates.db \
  --images ~/aafc_herbarium_production/input \
  --filter "confidence < 0.8"

2.3 Institutional Review Package

# Create Excel spreadsheet for curatorial review
python export_review.py \
  --db ~/aafc_herbarium_production/output/app.db \
  --format xlsx \
  --output ~/aafc_herbarium_production/institutional_review.xlsx

Phase 3: Data Integration & Handover (Week 3-4)

3.1 Generate Final Production Data

# Create versioned Darwin Core Archive
python cli.py archive \
  --output ~/aafc_herbarium_production/output \
  --version 1.0.0 \
  --include-multimedia \
  --filter "confidence > 0.7"

3.2 SharePoint Integration Preparation

  • Files ready: occurrence.csv, identification_history.csv
  • Format: Darwin Core standard (GBIF compatible)
  • Metadata: Complete provenance and processing history
  • Quality metrics: Per-specimen confidence scores

Institutional Workflows

For Collection Managers

Daily Operations

  1. New specimens: Add photos to input directory
  2. Process batch: Run python cli.py process --input new_photos/ --output batch_N/
  3. Review flagged cases: Use web interface for quality control
  4. Export data: Generate reports for institutional systems

Monthly Reporting

# Generate statistics dashboard
python cli.py stats --db output/app.db --format html --output monthly_report.html

# Export for institutional reporting
python export_review.py --db output/app.db --format xlsx --output monthly_export.xlsx

For Data Managers

GBIF Submission Workflow

# Generate GBIF-ready dataset
python cli.py archive --output results/ --version YYYY.MM --filter "confidence > 0.8"

# Validate Darwin Core compliance
python validate_dwc.py --input results/dwca_vYYYY.MM.zip

Database Integration

  • Primary key: catalogNumber field
  • Foreign keys: Link to institutional specimen database
  • Update trigger: Process new specimens weekly/monthly
  • Backup strategy: Versioned archives with git tags

For Research Staff

Data Access Patterns

# Access processed data programmatically
import sqlite3
conn = sqlite3.connect('output/app.db')

# Query high-confidence records
results = conn.execute("""
    SELECT scientificName, collector, eventDate, locality
    FROM specimens
    WHERE confidence > 0.9
""").fetchall()

Research Applications

  • Biodiversity analysis: Species distribution mapping
  • Historical ecology: Collection timeline analysis
  • Taxonomic research: Nomenclatural tracking
  • Geographic studies: Locality verification and mapping

Maintenance & Support

Routine Maintenance

Weekly Tasks

  • Process new specimen batches
  • Review and approve flagged specimens
  • Backup processing databases
  • Monitor system resource usage

Monthly Tasks

  • Generate institutional reports
  • Update GBIF submissions
  • Archive completed datasets
  • Review OCR accuracy metrics

Quarterly Tasks

  • Update software dependencies (uv sync --upgrade)
  • Evaluate new OCR technologies
  • Assess processing efficiency
  • Train staff on new features

Troubleshooting Guide

Common Issues & Solutions

Processing stops/fails

# Resume interrupted processing
python cli.py resume --input photos/ --output results/

# Check system resources
df -h  # Disk space
top    # Memory usage

Low OCR accuracy

# Verify Apple Vision is working
python cli.py check-deps --engines vision

# Check image quality
python scripts/image_quality_check.py --input photos/

Review interface issues

# Restart web interface with different port
python review_web.py --db results/candidates.db --images photos/ --port 8081

# Clear browser cache and reload

Performance Monitoring

Key Metrics to Track

  • Processing speed: Images per hour
  • OCR accuracy: Average confidence scores
  • Review efficiency: Specimens processed per curator hour
  • Data quality: GBIF validation pass rate

Performance Baselines

  • Apple Vision: 95% accuracy, 1.7 seconds per image
  • Processing throughput: ~500-700 specimens per hour
  • Manual review: 5% of specimens need attention
  • Export generation: <1 minute for 1000 specimens

Knowledge Transfer

Staff Training Requirements

Technical Staff (IT/Database)

  • System administration: Installation, updates, backups
  • Database management: SQLite queries, data exports
  • Troubleshooting: Log analysis, error resolution
  • Integration: SharePoint, institutional databases

Collection Staff (Curators/Technicians)

  • Quality control: Review interface usage
  • Data validation: Scientific name verification
  • Workflow integration: Institutional procedures
  • Reporting: Generate statistics and summaries

Research Staff (Scientists)

  • Data access: Query processed specimens
  • Analysis tools: Export to R, Python, Excel
  • Quality assessment: Confidence score interpretation
  • Publication: Cite processing methodology

Documentation Provided

User Guides

Technical Documentation

Research Documentation


Success Metrics & Validation

Production Success Criteria

  • 2,800 specimens processed: 100% completion target
  • 95% OCR accuracy: Apple Vision performance validated
  • <5% manual review: Efficient workflow achieved
  • Darwin Core compliance: GBIF submission ready
  • 4-hour processing: Automated efficiency target

Quality Assurance Checklist

Data Quality

  • Scientific names follow nomenclatural standards
  • Geographic coordinates within reasonable bounds
  • Collection dates in valid format (ISO 8601)
  • Collector names consistently formatted
  • Institution codes match GBIF standards

System Performance

  • Processing completes without errors
  • Review interface responsive and functional
  • Export formats compatible with institutional systems
  • Backup and recovery procedures tested

Documentation Completeness

  • All user guides current and accurate
  • Technical documentation reflects system state
  • Training materials prepared for staff
  • Handover checklist completed

Contact & Support

Immediate Support (Handover Period)

  • Technical questions: GitHub Issues
  • Workflow guidance: Documentation first, then issues
  • System problems: Check troubleshooting guide

Long-term Maintenance

  • Software updates: Follow semantic versioning in CHANGELOG.md
  • Feature requests: GitHub Issues with enhancement label
  • Research collaboration: Contact via institutional channels

Emergency Contacts

  • System failure: Restore from backup, contact IT support
  • Data corruption: Use version control (git tags) to restore
  • Performance issues: Check system resources, scale hardware if needed

Handover Status: ✅ Complete Deployment Ready: ✅ Production systems operational Staff Training: 📋 Materials provided, schedule institutional training Go-Live Date: Upon institutional approval and staff training completion

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group