Troubleshooting Guide¶
This guide helps diagnose and resolve common issues when using the herbarium OCR to Darwin Core toolkit.
Table of Contents¶
- Installation Issues
- OCR Engine Problems
- Image Processing Issues
- API and Network Issues
- Data Quality Problems
- Performance Issues
- Export and Format Issues
Installation Issues¶
Python Version Compatibility¶
Problem: Import errors or syntax issues
Solution:
# Check Python version
python --version
# Should be 3.11 or later
# If using older Python, install newer version
# macOS with Homebrew:
brew install python@3.11
# Update pip and install
pip install --upgrade pip
pip install -e .[dev]
Dependency Installation Failures¶
Problem: Installation fails with compilation errors
Solution:
# Clear pip cache
pip cache purge
# Install with verbose output to identify issues
pip install -e .[dev] -v
# For M1/M2 Macs with compilation issues:
export ARCHFLAGS="-arch arm64"
pip install -e .[dev]
# Alternative: use conda for problematic packages
conda install tesseract pillow
Missing System Dependencies¶
Problem: ImportError: cannot import name 'X' for Tesseract or other engines
macOS Solution:
# Install Tesseract
brew install tesseract
# Install additional language packs if needed
brew install tesseract-lang
# Verify installation
tesseract --version
Linux Solution:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-fra tesseract-ocr-deu
# Verify installation
tesseract --version
OCR Engine Problems¶
Tesseract Not Found¶
Problem:
Solution:
# Check if tesseract is in PATH
which tesseract
# If not found, install and add to PATH
# Add to ~/.bashrc or ~/.zshrc:
export PATH="/opt/homebrew/bin:$PATH" # macOS with Homebrew
# Test configuration
python -c "import pytesseract; print(pytesseract.get_tesseract_version())"
Poor OCR Quality¶
Problem: Low confidence scores, garbled text output
Diagnosis:
# Check image quality
python scripts/diagnose_images.py --input ./input/problematic/
# Test with different preprocessing
python cli.py process \
--input ./test-single-image \
--output ./test-output \
--config config/debug.toml \
--engine tesseract \
--debug
Solutions:
-
Improve image preprocessing:
-
Adjust Tesseract parameters:
-
Use higher resolution images:
Apple Vision Framework Issues¶
Problem: Vision engine not working on macOS
Solution:
# Ensure you're running on macOS 10.15+
sw_vers
# Install PyObjC if missing
pip install pyobjc-framework-Vision
# Test Vision availability
python -c "import Vision; print('Vision available')"
PaddleOCR Installation Issues¶
Problem: PaddleOCR fails to install or run
Solution:
# Clear package cache
pip cache purge
# Install with specific versions
pip install paddlepaddle==2.4.2 paddleocr==2.6.1.3
# For M1 Macs, use CPU version
pip install paddlepaddle -i https://pypi.tuna.tsinghua.edu.cn/simple/
# Test installation
python -c "from paddleocr import PaddleOCR; print('PaddleOCR ready')"
Image Processing Issues¶
Unsupported Image Formats¶
Problem:
Solution:
# Convert images to supported formats
find ./input -name "*.tiff" -exec convert {} {}.jpg \;
# Check image integrity
python scripts/validate_images.py --input ./input/
# Supported formats: JPG, PNG, TIFF, BMP
Large Image Memory Issues¶
Problem:
Solution:
[preprocess]
max_dim_px = 2000 # Reduce from default 4000
pipeline = ["resize", "grayscale", "binarize"] # Resize first
Alternative:
# Batch resize before processing
python scripts/batch_resize.py \
--input ./huge_images \
--output ./resized_images \
--max-dimension 2000
Preprocessing Pipeline Failures¶
Problem: Images fail during preprocessing
Diagnosis:
# Test individual preprocessing steps
python -c "
from preprocess.flows import preprocess_image
from pathlib import Path
result = preprocess_image(Path('problematic.jpg'), ['grayscale'])
print(f'Grayscale: {result is not None}')
"
Solution:
[preprocess]
# Start with minimal pipeline
pipeline = ["grayscale"]
# Add steps incrementally: "deskew", "binarize", "resize"
API and Network Issues¶
OpenAI API Errors¶
Problem:
Solutions:
-
Rate limiting:
-
Authentication:
-
Network connectivity:
GBIF API Timeouts¶
Problem: GBIF validation fails with timeouts
Solution:
[qc.gbif]
timeout = 30 # increase timeout
retry_delay = 5
max_retries = 3
batch_size = 10 # smaller batches
Alternative - Offline Mode:
# Download GBIF backbone for offline use
python scripts/download_gbif_backbone.py --output ./data/gbif/
# Configure offline validation
python qc/gbif.py --offline --backbone ./data/gbif/backbone.csv
Data Quality Problems¶
Missing Required Darwin Core Fields¶
Problem: Export validation fails due to missing required fields
Diagnosis:
# Check field coverage
python qc/field_coverage.py \
--db ./output/collection/app.db \
--report ./reports/field_coverage.html
Solution:
[dwc]
strict_minimal_fields = false # Allow incomplete records
assume_country_if_missing = "Canada" # Set default country
default_basis_of_record = "PreservedSpecimen"
Invalid Coordinates¶
Problem: Geographic coordinates outside valid ranges
Solution:
# Run coordinate validation
python qc/coordinates.py \
--input ./output/occurrence.csv \
--fix-common-errors \
--output ./output/occurrence_fixed.csv
# Common fixes applied:
# - Swap lat/long if reversed
# - Convert degrees/minutes/seconds to decimal
# - Remove leading zeros
Taxonomic Name Issues¶
Problem: Scientific names not recognized by GBIF
Diagnosis:
# Generate taxonomic report
python qc/taxonomy_report.py \
--db ./output/collection/app.db \
--output ./reports/taxonomy.xlsx
Solution:
# Use fuzzy matching for similar names
python qc/gbif.py \
--db ./output/collection/app.db \
--fuzzy-threshold 0.8 \
--update-names
# Manual review of unmatched names
python review_web.py \
--db ./output/collection/candidates.db \
--filter "gbif_match = false"
Performance Issues¶
Slow Processing Speed¶
Problem: Processing takes much longer than expected
Diagnosis:
# Profile processing time
python cli.py process \
--input ./test-small \
--output ./test-output \
--profile \
--engine tesseract
# Check bottlenecks in log
grep "processing time" ./test-output/app.log
Solutions:
-
Optimize OCR engine selection:
-
Reduce image size:
-
Batch processing:
High Memory Usage¶
Problem: Process uses excessive memory or crashes
Solution:
# Monitor memory usage
python cli.py process \
--input ./test \
--output ./output \
--memory-limit 4GB
# Process sequentially instead of batch
python cli.py process \
--input ./large_collection \
--output ./output \
--sequential
Export and Format Issues¶
Invalid Darwin Core Archive¶
Problem: Generated DwC-A fails validation
Diagnosis:
# Validate archive structure
python qc/validate_dwca.py \
--input ./output/dwca_v1.0.0.zip \
--output ./validation_report.html
Solution:
# Regenerate with strict validation
python cli.py archive \
--output ./output/collection \
--version 1.0.1 \
--validate-strict \
--fix-encoding
CSV Export Encoding Issues¶
Problem: Special characters corrupted in CSV files
Solution:
# Export with UTF-8 BOM for Excel compatibility
python export_review.py \
--db ./output/app.db \
--format csv \
--encoding utf-8-sig \
--output ./exports/compatible.csv
Large Export File Issues¶
Problem: Export files too large for downstream systems
Solution:
# Split large exports
python export_review.py \
--db ./output/app.db \
--format csv \
--split-size 10000 \
--output-prefix ./exports/batch_
# Compress exports
gzip ./exports/*.csv
Getting Additional Help¶
Debug Mode¶
Enable verbose logging for detailed diagnostics:
python cli.py process \
--input ./problematic \
--output ./debug-output \
--log-level DEBUG \
--save-intermediates
Generate Support Bundle¶
# Create comprehensive diagnostic report
python scripts/create_support_bundle.py \
--output ./support_bundle.zip \
--include-logs \
--include-config \
--include-sample-data
Community Resources¶
- GitHub Issues: Report bugs and feature requests
- Documentation: Check docs/ directory for detailed guides
- Configuration Examples: See config/ directory for working configurations
Configuration Validation¶
# Validate your configuration before processing
python scripts/validate_config.py --config ./config/custom.toml
# Test all engines
python scripts/test_engines.py --config ./config/custom.toml
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group