User Guide - Herbarium Specimen Digitization¶
Step-by-step guide for institutional staff to digitize herbarium specimens using OCR automation.
Quick Reference¶
Basic Workflow¶
- Setup → Install software and organize photos
- Process → Automated OCR extraction (2-4 hours for 1000 specimens)
- Review → Quality control using web interface
- Export → Generate Darwin Core data for GBIF/databases
Common Commands¶
# Process specimens
python cli.py process --input photos/ --output results/ --engine vision
# Review results (Quart web app)
python -m src.review.web_app --extraction-dir results/ --port 5002
# Generate reports
python cli.py stats --db results/app.db --format html
Getting Started¶
First Time Setup¶
1. Install Software¶
# Clone and install (one-time setup)
git clone https://github.com/devvyn/aafc-herbarium-dwc-extraction-2025.git
cd aafc-herbarium-dwc-extraction-2025
./bootstrap.sh
2. Organize Your Photos¶
Create a consistent directory structure:
Copy your specimen photos to the input directory:
3. Verify System Ready¶
# Check OCR engines available
python cli.py check-deps --engines vision,tesseract,gpt
# Expected on macOS: ✅ Apple Vision: Available
Processing Specimens¶
Standard Processing Workflow¶
Step 1: Start Processing¶
python cli.py process \
--input ~/herbarium_work/batch_1/input \
--output ~/herbarium_work/batch_1/output \
--engine vision
What happens: - Each photo is analyzed using Apple Vision OCR - Text is extracted and identified (scientific names, collectors, dates) - Results are saved with confidence scores - Progress is shown: "Processing specimen 1/100: photo_001.jpg"
Step 2: Monitor Progress¶
# Check processing status
python cli.py stats --db ~/herbarium_work/batch_1/output/app.db
# See confidence distribution
python cli.py stats --db ~/herbarium_work/batch_1/output/app.db --show-confidence
Step 3: Handle Interruptions¶
If processing stops, resume where it left off:
python cli.py resume \
--input ~/herbarium_work/batch_1/input \
--output ~/herbarium_work/batch_1/output
Understanding Results¶
Confidence Scores¶
Interpretation Guide¶
- 0.95-1.0: Excellent - minimal review needed
- 0.85-0.94: Good - spot check recommended
- 0.70-0.84: Fair - review recommended
- Below 0.70: Poor - manual review required
Quality Expectations¶
Based on OCR research: - Apple Vision: 95% of specimens achieve 0.85+ confidence - Manual review needed: ~5% of specimens - High accuracy fields: Institution names, collector names - Lower accuracy fields: Handwritten notes, damaged labels
Data Fields Extracted¶
Primary Fields (High Accuracy)¶
- scientificName: Taxonomic identification
- collector: Person who collected specimen
- eventDate: Collection date
- locality: Collection location
- catalogNumber: Institution specimen number
Quality Control & Review¶
Web-Based Review (Recommended)¶
Launch Review Interface¶
Open browser to: http://localhost:5002
Review Features¶
- Side-by-side view: Photo and extracted text
- Confidence filtering: Focus on specimens needing attention
- Bulk editing: Fix common patterns across specimens
- Quick approval: One-click for high-confidence results
Focus on Problem Cases¶
Filter specimens by priority in the web interface: - Use "Priority" dropdown to filter HIGH/CRITICAL priority specimens - Use "Status" dropdown to filter PENDING specimens needing review - Sort by quality score to focus on lowest-quality records first
Data Export & Integration¶
Generate Final Dataset¶
Darwin Core Export (GBIF Ready)¶
python cli.py archive \
--output ~/herbarium_work/batch_1/output \
--version 1.0.0 \
--filter "confidence > 0.7" \
--include-multimedia
Creates: dwca_v1.0.0.zip ready for GBIF submission
CSV Exports¶
Your processed data is automatically available:
- output/occurrence.csv - Darwin Core records
- output/identification_history.csv - Taxonomic determinations
- output/raw.jsonl - Complete processing logs
Troubleshooting¶
Processing Issues¶
"No OCR engines available"¶
# Check what's installed
python cli.py check-deps --engines vision,tesseract,gpt
# On macOS: Ensure Apple Vision available
# On Linux/Windows: Install Tesseract
pip install pytesseract
Processing stops with errors¶
Poor OCR results¶
- Check image quality: Clear, well-lit photos work best
- Try different engines:
--engine gptfor difficult specimens - Adjust confidence threshold:
--filter "confidence > 0.6"
Review Interface Issues¶
Web interface won't start¶
# Try different port
python -m src.review.web_app --extraction-dir results/ --port 5003
# Verify extraction directory has raw.jsonl
ls -la results/raw.jsonl
Best Practices¶
Photo Preparation¶
Optimal Image Quality¶
- Resolution: 2-5 megapixels sufficient
- Format: JPG or PNG
- Lighting: Even lighting, avoid shadows
- Focus: Ensure labels are in sharp focus
- Angle: Straight-on view of labels
Quality Control¶
Review Priorities¶
- Start with low confidence: Focus effort where needed
- Verify scientific names: Use taxonomic databases
- Check geographic data: Validate locality information
- Confirm dates: Ensure reasonable collection dates
Getting Help¶
Documentation Resources¶
- FAQ: Common questions and answers
- Troubleshooting: Detailed problem solving
- Production Handover: Complete deployment guide
Support Channels¶
- GitHub Issues: Bug reports and feature requests
- Documentation: Search docs first
- Community: Share experiences with other users
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group