Skip to content

User Guide - Herbarium Specimen Digitization

Step-by-step guide for institutional staff to digitize herbarium specimens using OCR automation.


Quick Reference

Basic Workflow

  1. Setup → Install software and organize photos
  2. Process → Automated OCR extraction (2-4 hours for 1000 specimens)
  3. Review → Quality control using web interface
  4. Export → Generate Darwin Core data for GBIF/databases

Common Commands

# Process specimens
python cli.py process --input photos/ --output results/ --engine vision

# Review results (Quart web app)
python -m src.review.web_app --extraction-dir results/ --port 5002

# Generate reports
python cli.py stats --db results/app.db --format html

Getting Started

First Time Setup

1. Install Software

# Clone and install (one-time setup)
git clone https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025.git
cd aafc-herbarium-dwc-extraction-2025
./bootstrap.sh

2. Organize Your Photos

Create a consistent directory structure:

mkdir -p ~/herbarium_work/batch_1/{input,output}

Copy your specimen photos to the input directory:

cp /path/to/your/photos/*.jpg ~/herbarium_work/batch_1/input/

3. Verify System Ready

# Check OCR engines available
python cli.py check-deps --engines vision,tesseract,gpt

# Expected on macOS: ✅ Apple Vision: Available

Processing Specimens

Standard Processing Workflow

Step 1: Start Processing

python cli.py process \
  --input ~/herbarium_work/batch_1/input \
  --output ~/herbarium_work/batch_1/output \
  --engine vision

What happens: - Each photo is analyzed using Apple Vision OCR - Text is extracted and identified (scientific names, collectors, dates) - Results are saved with confidence scores - Progress is shown: "Processing specimen 1/100: photo_001.jpg"

Step 2: Monitor Progress

# Check processing status
python cli.py stats --db ~/herbarium_work/batch_1/output/app.db

# See confidence distribution
python cli.py stats --db ~/herbarium_work/batch_1/output/app.db --show-confidence

Step 3: Handle Interruptions

If processing stops, resume where it left off:

python cli.py resume \
  --input ~/herbarium_work/batch_1/input \
  --output ~/herbarium_work/batch_1/output


Understanding Results

Confidence Scores

Interpretation Guide

  • 0.95-1.0: Excellent - minimal review needed
  • 0.85-0.94: Good - spot check recommended
  • 0.70-0.84: Fair - review recommended
  • Below 0.70: Poor - manual review required

Quality Expectations

Based on OCR research: - Apple Vision: 95% of specimens achieve 0.85+ confidence - Manual review needed: ~5% of specimens - High accuracy fields: Institution names, collector names - Lower accuracy fields: Handwritten notes, damaged labels

Data Fields Extracted

Primary Fields (High Accuracy)

  • scientificName: Taxonomic identification
  • collector: Person who collected specimen
  • eventDate: Collection date
  • locality: Collection location
  • catalogNumber: Institution specimen number

Quality Control & Review

Launch Review Interface

python -m src.review.web_app \
  --extraction-dir ~/herbarium_work/batch_1/output \
  --port 5002

Open browser to: http://localhost:5002

Review Features

  • Side-by-side view: Photo and extracted text
  • Confidence filtering: Focus on specimens needing attention
  • Bulk editing: Fix common patterns across specimens
  • Quick approval: One-click for high-confidence results

Focus on Problem Cases

Filter specimens by priority in the web interface: - Use "Priority" dropdown to filter HIGH/CRITICAL priority specimens - Use "Status" dropdown to filter PENDING specimens needing review - Sort by quality score to focus on lowest-quality records first


Data Export & Integration

Generate Final Dataset

Darwin Core Export (GBIF Ready)

python cli.py archive \
  --output ~/herbarium_work/batch_1/output \
  --version 1.0.0 \
  --filter "confidence > 0.7" \
  --include-multimedia

Creates: dwca_v1.0.0.zip ready for GBIF submission

CSV Exports

Your processed data is automatically available: - output/occurrence.csv - Darwin Core records - output/identification_history.csv - Taxonomic determinations - output/raw.jsonl - Complete processing logs


Troubleshooting

Processing Issues

"No OCR engines available"

# Check what's installed
python cli.py check-deps --engines vision,tesseract,gpt

# On macOS: Ensure Apple Vision available
# On Linux/Windows: Install Tesseract
pip install pytesseract

Processing stops with errors

# Check disk space
df -h

# Resume processing
python cli.py resume --input photos/ --output results/

Poor OCR results

  1. Check image quality: Clear, well-lit photos work best
  2. Try different engines: --engine gpt for difficult specimens
  3. Adjust confidence threshold: --filter "confidence > 0.6"

Review Interface Issues

Web interface won't start

# Try different port
python -m src.review.web_app --extraction-dir results/ --port 5003

# Verify extraction directory has raw.jsonl
ls -la results/raw.jsonl

Best Practices

Photo Preparation

Optimal Image Quality

  • Resolution: 2-5 megapixels sufficient
  • Format: JPG or PNG
  • Lighting: Even lighting, avoid shadows
  • Focus: Ensure labels are in sharp focus
  • Angle: Straight-on view of labels

Quality Control

Review Priorities

  1. Start with low confidence: Focus effort where needed
  2. Verify scientific names: Use taxonomic databases
  3. Check geographic data: Validate locality information
  4. Confirm dates: Ensure reasonable collection dates

Getting Help

Documentation Resources

Support Channels

  • GitHub Issues: Bug reports and feature requests
  • Documentation: Search docs first
  • Community: Share experiences with other users

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group