Apple Vision Deployment Guide - Process 2,800 Specimens¶
Quick deployment instructions for processing the captured herbarium specimens using Apple Vision OCR (95% accuracy).
Prerequisites¶
- macOS system (required for Apple Vision)
- 2,800 specimen photos organized in a directory
- Project installed (
./bootstrap.shcompleted) - Sufficient disk space (estimate 500MB-1GB for output databases)
Deployment Steps¶
1. Verify Apple Vision is Available¶
# Check OCR engines
python cli.py check-deps --engines vision
# Expected output:
# ✅ Apple Vision: Available (macOS native)
2. Organize Your 2,800 Photos¶
# Create consistent directory structure
mkdir -p ~/herbarium_processing/input
mkdir -p ~/herbarium_processing/output
# Move your 2,800 photos to input directory
# (adjust path to your actual photo location)
cp /path/to/your/2800/photos/* ~/herbarium_processing/input/
3. Start Apple Vision Processing¶
# Navigate to project directory
cd /Users/devvynmurphy/Documents/GitHub/aafc-herbarium-dwc-extraction-2025
# Start processing with Apple Vision
python cli.py process \
--input ~/herbarium_processing/input \
--output ~/herbarium_processing/output \
--engine vision \
--config config/config.default.toml
# Processing will show progress like:
# Processing specimen 1/2800: specimen_001.jpg
# Apple Vision confidence: 0.94
# Processing specimen 2/2800: specimen_002.jpg
# Apple Vision confidence: 0.96
Processing time estimate: 2-4 hours for 2,800 images (varies by image size)
4. Monitor Progress¶
# In another terminal, check progress
python cli.py stats --db ~/herbarium_processing/output/app.db
# View processing status
sqlite3 ~/herbarium_processing/output/app.db "SELECT status, COUNT(*) FROM specimens GROUP BY status;"
5. Handle Interruptions (Resume if needed)¶
# If processing gets interrupted, resume from where it left off
python cli.py resume \
--input ~/herbarium_processing/input \
--output ~/herbarium_processing/output \
--engine vision
Expected Results¶
Output Files Generated¶
After processing 2,800 specimens, you'll have:
~/herbarium_processing/output/
├── occurrence.csv # 2,800 Darwin Core records
├── identification_history.csv # Taxonomic data
├── raw.jsonl # Complete OCR results log
├── manifest.json # Processing metadata
├── candidates.db # SQLite database for review
├── app.db # Processing status database
└── images/ # Thumbnail cache
Quality Expectations (Based on Research)¶
- 95% accuracy on clear specimen labels
- ~2,660 specimens (95%) will need minimal or no manual review
- ~140 specimens (5%) may need manual correction
- High confidence on institutional names, scientific names, collectors, dates
Data Volume Estimates¶
- occurrence.csv: ~500KB-1MB (2,800 records)
- raw.jsonl: ~5-10MB (complete OCR logs)
- candidates.db: ~50-100MB (all OCR results)
- app.db: ~20-50MB (processing metadata)
Quality Control Workflow¶
1. Review High-Confidence Results¶
# Launch web review interface
python review_web.py \
--db ~/herbarium_processing/output/candidates.db \
--images ~/herbarium_processing/input \
--port 8080
# Open browser to http://localhost:8080
2. Focus on Low-Confidence Cases¶
# Review only specimens needing attention (confidence < 80%)
python review_web.py \
--db ~/herbarium_processing/output/candidates.db \
--images ~/herbarium_processing/input \
--filter "confidence < 0.8"
3. Export for Institutional Review¶
# Create Excel file for curatorial review
python export_review.py \
--db ~/herbarium_processing/output/app.db \
--format xlsx \
--output ~/herbarium_processing/institutional_review.xlsx
Production Handover Package¶
Generate Complete Dataset¶
# Create versioned Darwin Core Archive
python cli.py archive \
--output ~/herbarium_processing/output \
--version 1.0.0 \
--include-multimedia \
--filter "confidence > 0.7"
# Results in: ~/herbarium_processing/output/dwca_v1.0.0.zip
Quality Report¶
# Generate comprehensive quality report
python qc/comprehensive_qc.py \
--db ~/herbarium_processing/output/app.db \
--output ~/herbarium_processing/qc_report.html \
--include-geographic-validation \
--include-taxonomic-validation
Troubleshooting¶
Common Issues¶
Processing stops with errors:
# Check logs
tail -f ~/herbarium_processing/output/processing.log
# Resume processing
python cli.py resume --input ~/herbarium_processing/input --output ~/herbarium_processing/output
Low confidence results: - Apple Vision typically achieves 95% accuracy - If seeing lower confidence, check image quality - Consider preprocessing for damaged/blurry specimens
Out of disk space:
# Check disk usage
df -h ~/herbarium_processing/
# Clean up intermediate files if needed
rm -rf ~/herbarium_processing/output/temp/
Success Metrics¶
- 2,800 specimens processed: 100% completion
- Average confidence > 0.90: Meeting 95% accuracy target
- < 5% manual review needed: ~140 specimens or fewer
- Darwin Core compliance: Ready for GBIF submission
- Processing time < 4 hours: Efficient automated workflow
Next Steps After Processing¶
- Institutional Review: Use generated Excel files for curatorial review
- GBIF Submission: Submit dwca_v1.0.0.zip to GBIF
- Data Archival: Store complete results package
- Documentation: Update institutional procedures based on workflow
Contact: Open GitHub issue for deployment support.
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group