Frequently Asked Questions (FAQ)¶
Common questions and answers about the herbarium OCR to Darwin Core toolkit.
General Questions¶
What is this toolkit for?¶
The herbarium OCR to Darwin Core toolkit is designed for digitizing herbarium specimen collections. It processes images of pressed plant specimens, extracts text information using OCR (Optical Character Recognition), and maps the results to the Darwin Core biodiversity data standard for publication and sharing.
What makes this different from other OCR tools?¶
This toolkit is specifically designed for herbarium specimens and includes: - Multiple OCR engines optimized for handwritten scientific labels - Built-in Darwin Core field mapping - GBIF taxonomic validation - Quality control workflows for scientific data - Support for multilingual historical specimens - Export formats compatible with biodiversity databases
Who should use this toolkit?¶
- Herbarium managers and curators
- Biodiversity data managers
- Research institutions digitizing natural history collections
- GBIF data publishers
- Botanical researchers working with specimen data
Installation and Setup¶
What operating systems are supported?¶
The toolkit works on: - macOS: Full support including Apple Vision framework - Linux: Full support with all open-source engines - Windows: Supported via WSL (Windows Subsystem for Linux)
Do I need programming experience?¶
Basic command-line familiarity is helpful, but not extensive programming knowledge. The toolkit provides: - Simple command-line interface - Web-based review interface - Configuration files instead of code changes - Comprehensive documentation and examples
What OCR engines are supported?¶
- Tesseract: Free, open-source, good for typed text
- Apple Vision: macOS only, excellent for handwritten text
- PaddleOCR: Free, supports 80+ languages
- GPT-4 Vision: Commercial API, best overall accuracy but requires OpenAI subscription
How much does it cost to run?¶
- Free options: Tesseract, Apple Vision (macOS), PaddleOCR
- Commercial: GPT-4 Vision typically costs $0.01-0.05 per image depending on size and API pricing
For a collection of 1,000 specimens using GPT-4 Vision, expect costs of $10-50.
Data and Processing¶
What image formats are supported?¶
- Supported: JPG, PNG, TIFF, BMP
- Recommended: High-resolution JPG (300+ DPI)
- File naming: Use consistent naming like
INSTITUTION_BARCODE.jpg
What image quality do I need?¶
Minimum requirements: - Resolution: 150 DPI or higher - Readable text when viewed at 100% zoom - Good contrast between text and background
Recommended: - Resolution: 300+ DPI - Well-lit, even lighting - Specimen label clearly visible - Minimal glare or shadows
How accurate is the OCR?¶
Accuracy depends on several factors:
Text Type: - Typed labels: 90-98% accuracy - Clear handwriting: 80-95% accuracy - Poor handwriting: 60-85% accuracy - Historical faded labels: 40-80% accuracy
OCR Engine: - GPT-4 Vision: Best overall, especially for handwriting - Apple Vision: Excellent for handwriting on macOS - Tesseract: Good for typed text, improving with v5 - PaddleOCR: Good multilingual support
How long does processing take?¶
Processing time per specimen: - Tesseract: 1-5 seconds - Apple Vision: 2-8 seconds - PaddleOCR: 3-10 seconds - GPT-4 Vision: 10-30 seconds (including API latency)
For 1,000 specimens: - Tesseract only: 30 minutes - 2 hours - Multiple engines: 2-8 hours - GPT-4 Vision included: 4-12 hours
Can I process specimens in multiple languages?¶
Yes! The toolkit supports multilingual processing:
Supported languages (depending on engine): - Latin script: English, French, German, Spanish, Italian, etc. - Extended Latin: Danish, Swedish, Polish, Czech, etc. - Cyrillic: Russian, Ukrainian, Bulgarian - Asian languages: Chinese, Japanese (with PaddleOCR)
Configuration example:
[ocr]
langs = ["en", "fr", "de", "la"] # English, French, German, Latin
preferred_engine = "paddleocr"
[paddleocr]
lang = "latin"
Darwin Core and Data Standards¶
What is Darwin Core?¶
Darwin Core is the global standard for biodiversity data. It defines fields like:
- scientificName: Taxonomic name
- decimalLatitude/decimalLongitude: Geographic coordinates
- eventDate: Collection date
- recordedBy: Collector name
- basisOfRecord: Type of specimen record
What Darwin Core fields are extracted?¶
The toolkit extracts common specimen fields:
Taxonomic: - Scientific name, family, genus, species - Identification history and determiners
Geographic: - Country, state/province, locality - Coordinates (if present on label)
Temporal: - Collection date, identification date
People: - Collector names, determiner names
Administrative: - Institution codes, catalog numbers
How do I customize field mapping?¶
Edit the mapping configuration:
[dwc.mappings]
# Map OCR text patterns to Darwin Core fields
collector = ["collected by", "leg.", "coll."]
locality = ["locality", "loc.", "site"]
coordinates = ["lat", "long", "°N", "°W"]
Is the output compatible with GBIF?¶
Yes! The toolkit: - Follows Darwin Core standard structure - Validates taxonomic names against GBIF backbone - Exports Darwin Core Archive (DwC-A) format - Includes required GBIF metadata fields
Quality Control and Review¶
How do I ensure data quality?¶
The toolkit provides multiple quality control layers:
Automated: - Confidence scoring for OCR results - GBIF taxonomic validation - Geographic coordinate validation - Duplicate detection
Manual: - Web-based review interface - Flagging of low-confidence records - Batch editing capabilities - Expert review workflows
What confidence scores should I use?¶
Recommended thresholds: - High confidence: >0.8 - minimal review needed - Medium confidence: 0.5-0.8 - spot check recommended - Low confidence: <0.5 - manual review required
Configuration:
[ocr]
confidence_threshold = 0.7 # Minimum for auto-acceptance
[qc]
manual_review_threshold = 0.5 # Flag for review below this
How do I handle problematic specimens?¶
Common issues and solutions:
- Faded labels: Use contrast enhancement preprocessing
- Handwritten text: Enable GPT or Apple Vision engines
- Multiple languages: Configure language detection
- Damaged labels: Manual data entry may be required
- Ambiguous text: Flag for expert review
Reprocessing workflow:
# Identify problematic specimens
python qc/identify_problems.py --db ./output/app.db
# Reprocess with enhanced settings
python cli.py process \
--input ./problematic \
--config config/enhanced.toml \
--engine gpt
Export and Publishing¶
What export formats are available?¶
Standard formats: - CSV: For spreadsheet analysis - Darwin Core Archive (DwC-A): For GBIF publishing - JSONL: For programmatic processing - Excel: For manual review and editing
Custom exports:
# Export specific fields
python export_review.py \
--format csv \
--fields "scientificName,decimalLatitude,decimalLongitude" \
--output coordinates.csv
# Export with filters
python export_review.py \
--format dwca \
--filter "confidence > 0.8" \
--version 2.0.0
How do I publish to GBIF?¶
-
Prepare data:
-
Validate compliance:
-
Upload to IPT (Integrated Publishing Toolkit)
- Register dataset with GBIF
Can I integrate with existing collection management systems?¶
Yes, through various approaches:
Data import/export: - CSV import to Specify, Symbiota, etc. - API integration with modern systems - Custom field mapping for institutional schemas
Database integration:
# Example: Export to institutional database
python export_review.py \
--format sql \
--schema institutional \
--output import_statements.sql
Troubleshooting¶
The OCR results are poor. What can I improve?¶
Check image quality:
Try preprocessing adjustments:
[preprocess]
pipeline = ["grayscale", "contrast", "deskew", "binarize"]
contrast_factor = 1.5
binarize_method = "adaptive"
Use multiple engines:
python cli.py process \
--engine tesseract \
--engine vision \
--engine gpt \
--confidence-threshold 0.6
Processing is very slow. How can I speed it up?¶
Optimize for speed:
[preprocess]
max_dim_px = 1500 # Smaller images
pipeline = ["resize", "grayscale"] # Minimal preprocessing
[ocr]
preferred_engine = "tesseract" # Fastest engine
enabled_engines = ["tesseract"] # Single engine
Parallel processing:
# Process in batches
python scripts/parallel_process.py \
--input ./large_collection \
--workers 4 \
--batch-size 100
I'm getting API errors with GPT-4 Vision¶
Common solutions:
-
Rate limiting:
-
API key issues:
-
Network connectivity:
Where can I get help?¶
- Documentation: Check the docs/ directory
- GitHub Issues: Report bugs and request features
- Configuration Examples: See config/ directory
- Community: Join discussions in GitHub Discussions
Creating a support request:
# Generate diagnostic bundle
python scripts/create_support_bundle.py \
--output support_bundle.zip \
--include-config \
--include-logs
Include this bundle when requesting help to provide context about your setup and any errors encountered.
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group