📊 Data Access Overview - What This Project Can Actually Do¶

For Stakeholders: Clear explanation of what data you can extract and access using this herbarium digitization toolkit.

🎯 What Data Can You Get From Your Specimen Images?¶

Input: Your herbarium specimen photographs¶

Output: Structured botanical data you can use¶

📸 From This Image → To This Data¶

Visual: Specimen photo with labels¶

Extracted: Structured information like:¶

{
  "scientificName": "Plantago major",
  "collector": "Smith, J.R.",
  "collectionNumber": "1234",
  "eventDate": "2023-07-15",
  "locality": "Ontario, Canada",
  "coordinates": "45.4215, -75.6919",
  "identifiedBy": "Dr. Wilson",
  "catalogNumber": "HERB-001234"
}

🔍 What Information Can Be Extracted?¶

✅ Apple Vision OCR - High Success Rate (95% accuracy)¶

Institution names ("REGINA RESEARCH STATION", "AGRICULTURE CANADA")
Scientific names (genus + species, including author citations)
Collection numbers (when clearly written)
Dates (in various formats, including handwritten)
Collector names (both printed and handwritten labels)
Geographic locations (countries, provinces, detailed localities)
Specimen types (holotype, isotype, etc.)

✅ Moderate Success Rate (80-90% accuracy)¶

Handwritten notes (legibility dependent)
Coordinates (when present on labels)
Complex locality descriptions (detailed geographic info)
Institution codes (herbarium identifiers)

⚠️ Challenging but Manageable (60-70% accuracy)¶

Faded or damaged labels (age-related deterioration)
Non-English text (may require specialized processing)
Heavily overlapped text (specimen material covering labels)

💾 Data Storage & Access Formats¶

Database Storage (SQLite)¶

-- View all processed specimens
SELECT scientific_name, collector, event_date
FROM specimens
WHERE confidence > 0.8;

-- Count specimens by collector
SELECT collector, COUNT(*)
FROM specimens
GROUP BY collector;

Export Formats Available¶

📊 CSV/Excel - For spreadsheet analysis
🌐 Darwin Core Archives - For GBIF publication
📋 JSON - For custom applications
📈 Summary Reports - For institutional reporting

🚀 Quick Data Access Commands¶

Process Your Images¶

# Process a folder of specimen photos
python cli.py process --input ./your_photos/ --output ./results/

# Check what was extracted
python cli.py status --db ./results/app.db

Export Your Data¶

# Export to Excel for review
python export_review.py --db ./results/app.db --format xlsx

# Create Darwin Core archive for GBIF
python cli.py export --output ./results/ --format dwca

Query Your Data¶

# See processing statistics
sqlite3 ./results/app.db "SELECT status, COUNT(*) FROM processing_state GROUP BY status;"

# Find specimens by collector
sqlite3 ./results/app.db "SELECT * FROM specimens WHERE collector LIKE '%Smith%';"

📈 Real-World Data Expectations¶

From 100 Specimen Photos, You Might Get:¶

Data Quality	Count	What This Means
Excellent	~40 photos	Complete, accurate data ready for database
Good	~35 photos	Mostly accurate, minor corrections needed
Needs Review	~20 photos	Partial data, requires human verification
Failed	~5 photos	Poor image quality, manual entry required

Typical Success Rates by Field:¶

Scientific Name: 85% success rate
Collector: 80% success rate
Date: 75% success rate
Location: 70% success rate
Collection Number: 90% success rate (when present)

🔧 Getting Started - Practical Steps¶

1. Test Run (Start Here!)¶

# Test with 5-10 images first
mkdir test_batch
cp your_best_5_photos/* test_batch/
python scripts/test_real_ocr_performance.py batch test_batch/ --summary

2. Review Results¶

Open the generated review interface
Check accuracy on your specific specimen types
Identify what works well vs. what needs improvement

3. Full Processing¶

# Process your complete collection
python cli.py process --input ./all_specimens/ --output ./final_results/

📊 Data You Can Access Right Away¶

Individual Specimen Data¶

All text extracted from each image
Confidence scores for each field
Processing timestamps and methods used
Image metadata and file information

Collection Statistics¶

Total specimens processed
Success/failure rates by processing method
Most common collectors, locations, species
Date ranges and geographic distribution

Quality Control Information¶

Which specimens need human review
Confidence scores for automated extractions
Comparison with GBIF taxonomic database
Flagged inconsistencies or errors

🎯 What This Means for Your Institution¶

Immediate Value¶

Searchable digital catalog of your specimens
Export-ready data for institutional databases
GBIF-compatible formats for biodiversity sharing
Quality reports for collection assessment

Time Savings¶

Automated data entry for ~75% of specimens
Structured review process for remaining 25%
Bulk export capabilities for institutional systems
Standardized formats reducing manual formatting

Research Benefits¶

Discoverable collections through online databases
Standardized metadata for research collaboration
Geographic/temporal analysis of collection patterns
Integration with global biodiversity networks

🚨 Important Limitations¶

What This Tool CANNOT Do¶

❌ Identify species from images (only extracts existing labels)
❌ Read severely damaged labels (some manual work required)
❌ Guarantee 100% accuracy (human review recommended)
❌ Replace taxonomic expertise (validation still needed)

What Still Requires Human Work¶

🤝 Quality control review of extracted data
🤝 Verification of scientific names against current taxonomy
🤝 Resolution of ambiguous handwriting
🤝 Geographic coordinate validation

📞 Getting Help with Your Data¶

Test Your Data First¶

# Run our practical test on your images
python scripts/test_real_ocr_performance.py batch ./your_samples/

Questions to Ask¶

"What's the accuracy on MY specimens?" - Run the test script
"How much manual work is required?" - Check the confidence scores
"What format do I need for my database?" - See export options
"How long will processing take?" - Depends on image count and review needs

This overview shows you exactly what data you can extract and access from your herbarium specimen collection using this toolkit! 🌿📊

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group