π Data Access Overview - What This Project Can Actually Do¶
For Stakeholders: Clear explanation of what data you can extract and access using this herbarium digitization toolkit.
π― What Data Can You Get From Your Specimen Images?¶
Input: Your herbarium specimen photographs¶
Output: Structured botanical data you can use¶
πΈ From This Image β To This Data¶
Visual: Specimen photo with labels¶
Extracted: Structured information like:¶
{
"scientificName": "Plantago major",
"collector": "Smith, J.R.",
"collectionNumber": "1234",
"eventDate": "2023-07-15",
"locality": "Ontario, Canada",
"coordinates": "45.4215, -75.6919",
"identifiedBy": "Dr. Wilson",
"catalogNumber": "HERB-001234"
}
π What Information Can Be Extracted?¶
β Apple Vision OCR - High Success Rate (95% accuracy)¶
- Institution names ("REGINA RESEARCH STATION", "AGRICULTURE CANADA")
- Scientific names (genus + species, including author citations)
- Collection numbers (when clearly written)
- Dates (in various formats, including handwritten)
- Collector names (both printed and handwritten labels)
- Geographic locations (countries, provinces, detailed localities)
- Specimen types (holotype, isotype, etc.)
β Moderate Success Rate (80-90% accuracy)¶
- Handwritten notes (legibility dependent)
- Coordinates (when present on labels)
- Complex locality descriptions (detailed geographic info)
- Institution codes (herbarium identifiers)
β οΈ Challenging but Manageable (60-70% accuracy)¶
- Faded or damaged labels (age-related deterioration)
- Non-English text (may require specialized processing)
- Heavily overlapped text (specimen material covering labels)
πΎ Data Storage & Access Formats¶
Database Storage (SQLite)¶
-- View all processed specimens
SELECT scientific_name, collector, event_date
FROM specimens
WHERE confidence > 0.8;
-- Count specimens by collector
SELECT collector, COUNT(*)
FROM specimens
GROUP BY collector;
Export Formats Available¶
- π CSV/Excel - For spreadsheet analysis
- π Darwin Core Archives - For GBIF publication
- π JSON - For custom applications
- π Summary Reports - For institutional reporting
π Quick Data Access Commands¶
Process Your Images¶
# Process a folder of specimen photos
python cli.py process --input ./your_photos/ --output ./results/
# Check what was extracted
python cli.py status --db ./results/app.db
Export Your Data¶
# Export to Excel for review
python export_review.py --db ./results/app.db --format xlsx
# Create Darwin Core archive for GBIF
python cli.py export --output ./results/ --format dwca
Query Your Data¶
# See processing statistics
sqlite3 ./results/app.db "SELECT status, COUNT(*) FROM processing_state GROUP BY status;"
# Find specimens by collector
sqlite3 ./results/app.db "SELECT * FROM specimens WHERE collector LIKE '%Smith%';"
π Real-World Data Expectations¶
From 100 Specimen Photos, You Might Get:¶
| Data Quality | Count | What This Means |
|---|---|---|
| Excellent | ~40 photos | Complete, accurate data ready for database |
| Good | ~35 photos | Mostly accurate, minor corrections needed |
| Needs Review | ~20 photos | Partial data, requires human verification |
| Failed | ~5 photos | Poor image quality, manual entry required |
Typical Success Rates by Field:¶
- Scientific Name: 85% success rate
- Collector: 80% success rate
- Date: 75% success rate
- Location: 70% success rate
- Collection Number: 90% success rate (when present)
π§ Getting Started - Practical Steps¶
1. Test Run (Start Here!)¶
# Test with 5-10 images first
mkdir test_batch
cp your_best_5_photos/* test_batch/
python scripts/test_real_ocr_performance.py batch test_batch/ --summary
2. Review Results¶
- Open the generated review interface
- Check accuracy on your specific specimen types
- Identify what works well vs. what needs improvement
3. Full Processing¶
# Process your complete collection
python cli.py process --input ./all_specimens/ --output ./final_results/
π Data You Can Access Right Away¶
Individual Specimen Data¶
- All text extracted from each image
- Confidence scores for each field
- Processing timestamps and methods used
- Image metadata and file information
Collection Statistics¶
- Total specimens processed
- Success/failure rates by processing method
- Most common collectors, locations, species
- Date ranges and geographic distribution
Quality Control Information¶
- Which specimens need human review
- Confidence scores for automated extractions
- Comparison with GBIF taxonomic database
- Flagged inconsistencies or errors
π― What This Means for Your Institution¶
Immediate Value¶
- Searchable digital catalog of your specimens
- Export-ready data for institutional databases
- GBIF-compatible formats for biodiversity sharing
- Quality reports for collection assessment
Time Savings¶
- Automated data entry for ~75% of specimens
- Structured review process for remaining 25%
- Bulk export capabilities for institutional systems
- Standardized formats reducing manual formatting
Research Benefits¶
- Discoverable collections through online databases
- Standardized metadata for research collaboration
- Geographic/temporal analysis of collection patterns
- Integration with global biodiversity networks
π¨ Important Limitations¶
What This Tool CANNOT Do¶
- β Identify species from images (only extracts existing labels)
- β Read severely damaged labels (some manual work required)
- β Guarantee 100% accuracy (human review recommended)
- β Replace taxonomic expertise (validation still needed)
What Still Requires Human Work¶
- π€ Quality control review of extracted data
- π€ Verification of scientific names against current taxonomy
- π€ Resolution of ambiguous handwriting
- π€ Geographic coordinate validation
π Getting Help with Your Data¶
Test Your Data First¶
# Run our practical test on your images
python scripts/test_real_ocr_performance.py batch ./your_samples/
Questions to Ask¶
- "What's the accuracy on MY specimens?" - Run the test script
- "How much manual work is required?" - Check the confidence scores
- "What format do I need for my database?" - See export options
- "How long will processing take?" - Depends on image count and review needs
This overview shows you exactly what data you can extract and access from your herbarium specimen collection using this toolkit! πΏπ
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group