Skip to content

πŸ“Š Data Access Overview - What This Project Can Actually Do

For Stakeholders: Clear explanation of what data you can extract and access using this herbarium digitization toolkit.

🎯 What Data Can You Get From Your Specimen Images?

Input: Your herbarium specimen photographs

Output: Structured botanical data you can use


πŸ“Έ From This Image β†’ To This Data

Visual: Specimen photo with labels

Extracted: Structured information like:

{
  "scientificName": "Plantago major",
  "collector": "Smith, J.R.",
  "collectionNumber": "1234",
  "eventDate": "2023-07-15",
  "locality": "Ontario, Canada",
  "coordinates": "45.4215, -75.6919",
  "identifiedBy": "Dr. Wilson",
  "catalogNumber": "HERB-001234"
}

πŸ” What Information Can Be Extracted?

βœ… Apple Vision OCR - High Success Rate (95% accuracy)

  • Institution names ("REGINA RESEARCH STATION", "AGRICULTURE CANADA")
  • Scientific names (genus + species, including author citations)
  • Collection numbers (when clearly written)
  • Dates (in various formats, including handwritten)
  • Collector names (both printed and handwritten labels)
  • Geographic locations (countries, provinces, detailed localities)
  • Specimen types (holotype, isotype, etc.)

βœ… Moderate Success Rate (80-90% accuracy)

  • Handwritten notes (legibility dependent)
  • Coordinates (when present on labels)
  • Complex locality descriptions (detailed geographic info)
  • Institution codes (herbarium identifiers)

⚠️ Challenging but Manageable (60-70% accuracy)

  • Faded or damaged labels (age-related deterioration)
  • Non-English text (may require specialized processing)
  • Heavily overlapped text (specimen material covering labels)

πŸ’Ύ Data Storage & Access Formats

Database Storage (SQLite)

-- View all processed specimens
SELECT scientific_name, collector, event_date
FROM specimens
WHERE confidence > 0.8;

-- Count specimens by collector
SELECT collector, COUNT(*)
FROM specimens
GROUP BY collector;

Export Formats Available

  1. πŸ“Š CSV/Excel - For spreadsheet analysis
  2. 🌐 Darwin Core Archives - For GBIF publication
  3. πŸ“‹ JSON - For custom applications
  4. πŸ“ˆ Summary Reports - For institutional reporting

πŸš€ Quick Data Access Commands

Process Your Images

# Process a folder of specimen photos
python cli.py process --input ./your_photos/ --output ./results/

# Check what was extracted
python cli.py status --db ./results/app.db

Export Your Data

# Export to Excel for review
python export_review.py --db ./results/app.db --format xlsx

# Create Darwin Core archive for GBIF
python cli.py export --output ./results/ --format dwca

Query Your Data

# See processing statistics
sqlite3 ./results/app.db "SELECT status, COUNT(*) FROM processing_state GROUP BY status;"

# Find specimens by collector
sqlite3 ./results/app.db "SELECT * FROM specimens WHERE collector LIKE '%Smith%';"

πŸ“ˆ Real-World Data Expectations

From 100 Specimen Photos, You Might Get:

Data Quality Count What This Means
Excellent ~40 photos Complete, accurate data ready for database
Good ~35 photos Mostly accurate, minor corrections needed
Needs Review ~20 photos Partial data, requires human verification
Failed ~5 photos Poor image quality, manual entry required

Typical Success Rates by Field:

  • Scientific Name: 85% success rate
  • Collector: 80% success rate
  • Date: 75% success rate
  • Location: 70% success rate
  • Collection Number: 90% success rate (when present)

πŸ”§ Getting Started - Practical Steps

1. Test Run (Start Here!)

# Test with 5-10 images first
mkdir test_batch
cp your_best_5_photos/* test_batch/
python scripts/test_real_ocr_performance.py batch test_batch/ --summary

2. Review Results

  • Open the generated review interface
  • Check accuracy on your specific specimen types
  • Identify what works well vs. what needs improvement

3. Full Processing

# Process your complete collection
python cli.py process --input ./all_specimens/ --output ./final_results/

πŸ“Š Data You Can Access Right Away

Individual Specimen Data

  • All text extracted from each image
  • Confidence scores for each field
  • Processing timestamps and methods used
  • Image metadata and file information

Collection Statistics

  • Total specimens processed
  • Success/failure rates by processing method
  • Most common collectors, locations, species
  • Date ranges and geographic distribution

Quality Control Information

  • Which specimens need human review
  • Confidence scores for automated extractions
  • Comparison with GBIF taxonomic database
  • Flagged inconsistencies or errors

🎯 What This Means for Your Institution

Immediate Value

  • Searchable digital catalog of your specimens
  • Export-ready data for institutional databases
  • GBIF-compatible formats for biodiversity sharing
  • Quality reports for collection assessment

Time Savings

  • Automated data entry for ~75% of specimens
  • Structured review process for remaining 25%
  • Bulk export capabilities for institutional systems
  • Standardized formats reducing manual formatting

Research Benefits

  • Discoverable collections through online databases
  • Standardized metadata for research collaboration
  • Geographic/temporal analysis of collection patterns
  • Integration with global biodiversity networks

🚨 Important Limitations

What This Tool CANNOT Do

  • ❌ Identify species from images (only extracts existing labels)
  • ❌ Read severely damaged labels (some manual work required)
  • ❌ Guarantee 100% accuracy (human review recommended)
  • ❌ Replace taxonomic expertise (validation still needed)

What Still Requires Human Work

  • 🀝 Quality control review of extracted data
  • 🀝 Verification of scientific names against current taxonomy
  • 🀝 Resolution of ambiguous handwriting
  • 🀝 Geographic coordinate validation

πŸ“ž Getting Help with Your Data

Test Your Data First

# Run our practical test on your images
python scripts/test_real_ocr_performance.py batch ./your_samples/

Questions to Ask

  1. "What's the accuracy on MY specimens?" - Run the test script
  2. "How much manual work is required?" - Check the confidence scores
  3. "What format do I need for my database?" - See export options
  4. "How long will processing take?" - Depends on image count and review needs

This overview shows you exactly what data you can extract and access from your herbarium specimen collection using this toolkit! πŸŒΏπŸ“Š

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group