πΈ Reproducible Image Access for Herbarium Digitization¶
This guide explains how to set up and use reproducible image references for testing, documentation, and development of the herbarium digitization toolkit.
π― Overview¶
The toolkit provides a comprehensive system for managing test images that enables:
- Reproducible testing across different environments
- Consistent documentation with standard example images
- Quality stratification for realistic testing scenarios
- Public accessibility for team collaboration and community use
π§ Setup Process¶
Step 1: Configure AWS Access¶
You have several options for AWS access:
Option A: Use Existing API Key¶
If you have an AWS API key from another repository:
- 
Copy your AWS credentials: 
- 
Or create a credentials file: 
Option B: Create New Claude-Specific Key¶
For dedicated access, create a new IAM user with S3 read permissions:
- AWS Console β IAM β Users β Create User
- Attach policy: AmazonS3ReadOnlyAccess
- Create access key for programmatic access
- Use credentials as in Option A
Step 2: Discover Your S3 Bucket¶
Use the setup script to find and explore your bucket:
# Install required dependency
pip install boto3
# List available buckets
python scripts/setup_s3_access.py --list-buckets
# Explore a specific bucket
python scripts/setup_s3_access.py --bucket your-herbarium-bucket --explore
# Update configuration with discovered images
python scripts/setup_s3_access.py --bucket your-herbarium-bucket --update-config
Step 3: Verify Configuration¶
After setup, verify your configuration works:
# List available image categories
python scripts/manage_test_images.py list-categories
# Validate that URLs are accessible
python scripts/manage_test_images.py validate-urls
# List available sample collections
python scripts/manage_test_images.py list-collections
π Image Quality Stratification¶
The system organizes images into quality categories for realistic testing:
π’ Readable Specimens (40% of test set)¶
- Characteristics: Clear, legible labels with good lighting
- Expected Accuracy: >95% with GPT processing
- Use Case: Demonstrating best-case performance
π‘ Minimal Text Specimens (25% of test set)¶
- Characteristics: Some readable text, acceptable quality
- Expected Accuracy: ~85% with hybrid triage
- Use Case: Testing OCR fallback scenarios
π Unlabeled Specimens (20% of test set)¶
- Characteristics: No visible text labels, specimen only
- Expected Accuracy: ~30% (limited to specimen analysis)
- Use Case: Testing edge cases and failure modes
π΄ Poor Quality Specimens (15% of test set)¶
- Characteristics: Blurry, damaged, or difficult to process
- Expected Accuracy: ~15% (requires manual review)
- Use Case: Testing robustness and error handling
π Multilingual Specimens (Variable)¶
- Characteristics: Labels in various languages
- Expected Accuracy: ~80% with multilingual OCR
- Use Case: Testing language detection and processing
π― Usage Examples¶
Create Test Bundles for Development¶
# Create a small demo bundle (10 images)
python scripts/manage_test_images.py create-bundle demo \
  --output ./test_images/demo \
  --download
# Create comprehensive validation set (100 images)
python scripts/manage_test_images.py create-bundle validation \
  --output ./test_images/validation \
  --download
# Create performance benchmark set (1000 images)
python scripts/manage_test_images.py create-bundle benchmark \
  --output ./test_images/benchmark
  # Note: --download omitted for large sets to use URLs directly
Generate Documentation URLs¶
# Get 3 URLs per category for documentation
python scripts/manage_test_images.py generate-doc-urls --count 3
Output example:
readable_specimens:
  https://your-bucket.s3.us-east-1.amazonaws.com/clear_specimen_001.jpg
  https://your-bucket.s3.us-east-1.amazonaws.com/readable_label_002.jpg
  https://your-bucket.s3.us-east-1.amazonaws.com/good_quality_003.jpg
Validate Image Accessibility¶
# Check all categories
python scripts/manage_test_images.py validate-urls
# Check specific category
python scripts/manage_test_images.py validate-urls --category readable_specimens
π Integration with Processing Scripts¶
Use with Hybrid Triage Processing¶
# Process a test bundle with the hybrid triage system
python scripts/process_with_hybrid_triage.py \
  --input ./test_images/validation \
  --output ./results/validation_test \
  --budget 5.00 \
  --openai-api-key your_key
Use with OCR Validation¶
# Run validation tests using stratified samples
python scripts/run_ocr_validation.py \
  --engines tesseract vision_swift multilingual \
  --test-bundle ./test_images/validation \
  --config config/test_validation.toml
π Public Access Configuration¶
Making Images Publicly Accessible¶
To make images accessible to teammates and community members:
- 
S3 Bucket Policy (if using S3): 
- 
CDN Setup (optional, for better performance): 
- 
URL Templates: The system supports multiple URL patterns: 
- Direct S3: https://bucket.s3.region.amazonaws.com/key
- CDN: https://cdn-endpoint/key
- Custom domain: https://images.your-domain.com/key
π File Structure¶
After setup, your repository will have:
config/
βββ image_sources.toml          # Central configuration
βββ test_validation.toml        # Testing parameters
scripts/
βββ setup_s3_access.py          # Initial S3 configuration
βββ manage_test_images.py       # Image management utilities
test_images/                    # Downloaded test bundles
βββ demo/                       # Small demo set
βββ validation/                 # Comprehensive validation set
βββ benchmark/                  # Performance testing set
π Troubleshooting¶
Common Issues¶
AWS credentials not found:
# Set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
Bucket access denied:
- Verify IAM permissions include s3:ListBucket and s3:GetObject
- Check bucket policy allows your IAM user/role
Images not downloading:
Configuration not found:
Validation Commands¶
# Test AWS connection
aws s3 ls s3://your-bucket --max-items 5
# Test image accessibility
python scripts/manage_test_images.py validate-urls --category readable_specimens
# Verify bundle creation
python scripts/manage_test_images.py create-bundle demo --output ./test --download
π Benefits for Team Collaboration¶
For Developers¶
- Consistent test data across development environments
- Reproducible benchmarks for performance comparisons
- Automated testing with realistic image diversity
For Documentation¶
- Standard example images for tutorials and guides
- Quality category examples for accuracy demonstrations
- Public URLs for easy sharing in documentation
For Scientific Users¶
- Realistic test scenarios matching real herbarium collections
- Quality expectations aligned with processing capabilities
- Reproducible workflows for institutional adoption
π Next Steps¶
Once your reproducible image system is configured:
- Run validation tests to establish baseline performance
- Update documentation with your specific image examples
- Share public URLs with team members for collaboration
- Integrate with CI/CD for automated testing with real images
The system provides a solid foundation for reproducible, collaborative herbarium digitization workflows! πΏπ
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group