Skip to content

πŸ“Έ Reproducible Image Access System - Implementation Summary

Date: September 24, 2025 Status: βœ… Complete and Ready for Use Purpose: Enable reproducible herbarium image referencing for testing, documentation, and team collaboration

🎯 Problem Solved

The project needed a standardized way to reference and access downscaled herbarium images from S3 for: - Reproducible testing across different environments - Consistent documentation with standard example images - Team collaboration with publicly accessible image URLs - Quality assurance with realistic test scenarios

Note: All images are non-sensitive, non-protected content, making public accessibility safe and beneficial for collaboration.

πŸ“‹ Documented Research Methodology

Data Preparation Process Documentation

  • Research Process: Systematic approach for uploading herbarium image folders to S3 for research purposes
  • Methodology: CLI-based workflow using standard boto3 tools for organized specimen image storage
  • Academic Documentation: Process documented to support research reproducibility and methodology transparency
  • Scope Focus: Core emphasis remains on herbarium digitization analysis rather than auxiliary tool maintenance
  • Research Value: Complete documented methodology from data preparation through validation

End-to-End Research Workflow

  1. πŸ“€ Data Organization: Documented process for systematic upload of specimen image folders to research storage
  2. πŸ” Discovery & Configuration: setup_s3_access.py discovers and configures access to research image datasets
  3. πŸ“Š Quality Assessment: Automated categorization and stratification of images by quality characteristics
  4. πŸ§ͺ Research Testing: manage_test_images.py creates reproducible test bundles for consistent workflows
  5. βœ… Validation: Comprehensive testing and validation of complete research methodology

Project Focus: Core herbarium digitization research with documented data preparation processes to support reproducibility.

βœ… Complete Solution Delivered

πŸ”§ Core Components Implemented

1. Automated S3 Discovery & Configuration

  • File: scripts/setup_s3_access.py
  • Purpose: Discovers S3 buckets, explores contents, and configures access
  • Features:
  • Lists available S3 buckets
  • Explores bucket contents with filtering
  • Automatically categorizes images based on naming patterns
  • Updates configuration with discovered images
  • Supports both existing and new AWS credentials

2. Central Configuration System

  • File: config/image_sources.toml
  • Purpose: Centralized configuration for all image sources and categories
  • Features:
  • S3 bucket and region configuration
  • Quality-stratified image categorization
  • Predefined sample collection definitions
  • Public access settings and URL templates
  • Metadata and licensing information

3. Test Bundle Management

  • File: scripts/manage_test_images.py
  • Purpose: Create and manage reproducible test image bundles
  • Features:
  • List available image categories and collections
  • Create sample bundles based on predefined collections
  • Download images locally or use URLs directly
  • Validate image URL accessibility
  • Generate documentation-ready URL sets

4. Comprehensive Documentation

  • File: docs/reproducible_image_access.md
  • Purpose: Complete setup and usage guide
  • Features:
  • Step-by-step setup instructions
  • AWS credential configuration options
  • Quality category explanations
  • Integration examples with existing scripts
  • Troubleshooting and validation procedures

πŸ“Š Quality-Stratified Image System

Image Categories with Realistic Distribution

Category Distribution Expected Accuracy Processing Method Use Case
🟒 Readable Specimens 40% >95% GPT Herbarium Best-case performance demonstration
🟑 Minimal Text 25% ~85% Hybrid Triage OCR fallback scenarios
🟠 Unlabeled 20% ~30% Specimen Analysis Edge cases and failure modes
πŸ”΄ Poor Quality 15% ~15% Manual Review Robustness and error handling
🌍 Multilingual Variable ~80% Multilingual OCR Language detection testing

Predefined Sample Collections

Collection Count Purpose Distribution
Demo 10 Quick testing and documentation 4:3:2:1 across categories
Validation 100 Comprehensive quality assessment Proportional to realistic distribution
Benchmark 1000 Performance testing Balanced for comprehensive evaluation

πŸš€ Quick Setup Guide

For Team Members with Existing AWS Access

# 1. Set up AWS credentials (copy from existing setup)
export AWS_ACCESS_KEY_ID=your_existing_key
export AWS_SECRET_ACCESS_KEY=your_existing_secret
export AWS_DEFAULT_REGION=us-east-1

# 2. Install required dependency
pip install boto3

# 3. Discover and configure your S3 bucket
python scripts/setup_s3_access.py --list-buckets
python scripts/setup_s3_access.py --bucket your-herbarium-bucket --update-config

# 4. Verify configuration
python scripts/manage_test_images.py list-categories
python scripts/manage_test_images.py validate-urls

# 5. Create test bundle
python scripts/manage_test_images.py create-bundle demo --output ./test_images/demo --download

For New Users

# 1. Create new AWS IAM user with S3ReadOnlyAccess policy
# 2. Generate access key for programmatic access
# 3. Follow setup steps above with new credentials

🎯 Integration with Existing Systems

Hybrid Triage Processing

# Process test bundle with intelligent triage
python scripts/process_with_hybrid_triage.py \
  --input ./test_images/validation \
  --output ./results/validation_test \
  --budget 5.00 \
  --openai-api-key your_key

OCR Validation Testing

# Run validation with stratified samples
python scripts/run_ocr_validation.py \
  --engines tesseract vision_swift multilingual \
  --test-bundle ./test_images/validation \
  --config config/test_validation.toml

Documentation Generation

# Get URLs for documentation examples
python scripts/manage_test_images.py generate-doc-urls --count 3

🌐 Public Access Benefits

For Team Collaboration

  • βœ… Consistent test data across all development environments
  • βœ… Public URLs for easy sharing in GitHub issues and PRs
  • βœ… Reproducible benchmarks for performance comparisons
  • βœ… Automated validation with realistic image diversity

For Documentation

  • βœ… Standard example images for tutorials and guides
  • βœ… Quality category demonstrations with real specimens
  • βœ… Public accessibility for community contributions
  • βœ… Realistic scenarios matching institutional workflows

For Scientific Users

  • βœ… Quality expectations aligned with processing capabilities
  • βœ… Realistic test scenarios matching real herbarium collections
  • βœ… Reproducible workflows for institutional adoption
  • βœ… Performance metrics for different specimen types

πŸ“ File Structure Created

config/
└── image_sources.toml              # Central image source configuration

scripts/
β”œβ”€β”€ setup_s3_access.py              # AWS S3 discovery and configuration
└── manage_test_images.py           # Test image bundle management

docs/
└── reproducible_image_access.md    # Complete setup and usage guide

# After setup:
test_images/                        # Downloaded test bundles (optional)
β”œβ”€β”€ demo/                          # Small demo set (10 images)
β”œβ”€β”€ validation/                    # Validation set (100 images)
└── benchmark/                     # Benchmark set (1000 images)

πŸ” Validation and Health Checks

URL Accessibility Validation

# Check all categories
python scripts/manage_test_images.py validate-urls

# Check specific category
python scripts/manage_test_images.py validate-urls --category readable_specimens

Configuration Verification

# List available categories
python scripts/manage_test_images.py list-categories

# List sample collections
python scripts/manage_test_images.py list-collections

AWS Connection Testing

# Test AWS connectivity
aws s3 ls s3://your-bucket --max-items 5

# Test bucket access
python scripts/setup_s3_access.py --bucket your-bucket --explore

πŸŽ‰ Impact and Benefits Achieved

Reproducibility

  • βœ… Standardized image sets ensure consistent testing across environments
  • βœ… Version-controlled configuration enables rollback and change tracking
  • βœ… Documented quality categories provide clear expectations
  • βœ… Automated bundle creation eliminates manual image selection

Collaboration

  • βœ… Public URLs enable easy sharing in documentation and issues
  • βœ… Team-accessible configuration allows collaborative improvement
  • βœ… Community-friendly design supports external contributions
  • βœ… Non-sensitive data makes open sharing safe and beneficial

Development Workflow

  • βœ… Integrated testing with existing processing scripts
  • βœ… Automated validation catches regressions early
  • βœ… Performance benchmarking tracks improvement over time
  • βœ… Quality assurance ensures realistic test scenarios

Scientific Accuracy

  • βœ… Realistic distributions match actual herbarium collections
  • βœ… Quality stratification tests edge cases and failure modes
  • βœ… Expected accuracy metrics provide clear performance targets
  • βœ… Multilingual support enables global institutional adoption

πŸš€ Next Steps for Users

  1. Setup Access: Configure AWS credentials and discover your S3 bucket
  2. Validate System: Run validation checks to ensure everything works
  3. Create Test Bundles: Generate sample collections for your use case
  4. Integrate Testing: Use bundles with existing processing scripts
  5. Update Documentation: Add your specific image examples to guides
  6. Share with Team: Distribute public URLs for collaborative development

πŸ“ˆ Success Metrics

The reproducible image access system successfully provides:

  • 🎯 100% Reproducible: Same image sets across all environments
  • 🌐 Publicly Accessible: Safe sharing of non-sensitive herbarium images
  • πŸ“Š Quality Stratified: Realistic distribution matching real collections
  • πŸ”§ Fully Integrated: Works with all existing processing scripts
  • πŸ“š Well Documented: Complete setup and usage guides
  • βœ… Production Ready: Validated and ready for team adoption

Repository: Ready for immediate use by team members and collaborators Documentation: Complete setup and integration guides available Support: All validation and health check tools included Community: Designed for open collaboration and external contributions

This implementation resolves the need for reproducible, standardized image referencing while enabling effective team collaboration and community engagement! πŸŒΏπŸ“Šβœ¨

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group