Storage Abstraction Architecture¶

Version: 1.0 Status: Implemented Last Updated: 2025-10-04

Overview¶

The storage abstraction layer decouples the core extraction pipeline from storage implementation details, enabling the software to work with images from multiple sources:

Local filesystem - Traditional directory-based storage
AWS S3 - Cloud object storage
MinIO - Self-hosted S3-compatible storage
HTTP/HTTPS - Remote image fetching (planned)

Key Benefits¶

Storage Independence: Core extraction logic doesn't know or care where images come from
Transparent Caching: Remote images automatically cached locally via decorator pattern
Configuration-Driven: Storage backend selected via TOML config, no code changes needed
Performance: Direct filesystem access when available, efficient streaming when not
Future-Proof: Easy to add new backends (Azure Blob, Google Cloud Storage, etc.)

Architecture Pattern¶

The implementation follows the Strategy Pattern with Decorator Pattern for caching:

┌─────────────────────────────────────────┐
│        Core Extraction Logic            │
│  (operates on ImageLocator interface)   │
└──────────────┬──────────────────────────┘
               │
               ▼
┌─────────────────────────────────────────┐
│       CachingImageLocator               │
│    (optional transparent caching)       │
└──────────────┬──────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────┐
│         ImageLocator Backend             │
│  ┌────────────┬─────────────┬──────────┐ │
│  │   Local    │     S3      │  MinIO   │ │
│  └────────────┴─────────────┴──────────┘ │
└──────────────────────────────────────────┘

Components¶

ImageLocator Protocol (`src/io_utils/locator.py`)¶

Core interface defining storage operations:

class ImageLocator(Protocol):
    def exists(self, identifier: str) -> bool:
        """Check if image exists"""

    def get_image(self, identifier: str) -> bytes:
        """Fetch image data"""

    def get_metadata(self, identifier: str) -> ImageMetadata:
        """Get image metadata (size, type, etc.)"""

    def list_images(self, prefix: Optional[str] = None) -> Iterator[str]:
        """List available images"""

    def get_local_path(self, identifier: str) -> Optional[Path]:
        """Get local path if available (optimization)"""

Backend Implementations¶

LocalFilesystemLocator (`src/io_utils/locators/local.py`)¶

Simplest backend for traditional directory-based storage:

locator = LocalFilesystemLocator(Path("/data/herbarium-images"))
image_data = locator.get_image("specimen_001.jpg")
# Reads from /data/herbarium-images/specimen_001.jpg

Features: - Direct filesystem access (no caching needed) - Recursive directory traversal - Standard image extension filtering - Fast metadata access via filesystem stats

S3ImageLocator (`src/io_utils/locators/s3.py`)¶

AWS S3 and S3-compatible storage backend:

locator = S3ImageLocator(
    bucket="my-herbarium-bucket",
    prefix="specimens/batch1/",
    region="us-east-1"
)
image_data = locator.get_image("IMG_001.jpg")
# Fetches s3://my-herbarium-bucket/specimens/batch1/IMG_001.jpg

Features: - Boto3-based S3 access - Optional AWS credentials (uses default chain if omitted) - Paginated listing for large buckets - Works with MinIO via custom endpoint configuration

CachingImageLocator Decorator (`src/io_utils/caching.py`)¶

Transparent pass-through caching wrapper:

# Wrap any backend with caching
backend = S3ImageLocator(bucket="my-bucket")
cached = CachingImageLocator(
    backend,
    cache_dir=Path("/tmp/image-cache"),
    max_cache_size_mb=2000  # Optional size limit
)

# First access: cache miss, fetches from S3, saves to cache
data = cached.get_image("specimen_001.jpg")

# Second access: cache hit, returns from local filesystem (fast!)
data = cached.get_image("specimen_001.jpg")

Features: - SHA256-based cache keys (handles special chars, long names) - LRU eviction when cache size limit exceeded - Cache statistics (get_cache_stats()) - Manual cache management (clear_cache()) - Transparent to caller - same ImageLocator interface

Factory Function (`src/io_utils/locator_factory.py`)¶

Configuration-driven instantiation:

from src.io_utils.locator_factory import create_image_locator

config = load_config(config_path)
locator = create_image_locator(config)
# Returns appropriate backend based on config

Configuration¶

Example: Local Filesystem (Default)¶

# No [storage] section needed - uses --input directory

Example: S3 with Caching¶

[storage]
backend = "s3"
cache_enabled = true
cache_dir = "/tmp/herbarium-cache"
cache_max_size_mb = 2000

[storage.s3]
bucket = "my-herbarium-bucket"
prefix = "specimens/"
region = "us-east-1"
# AWS credentials optional (uses default chain)

Example: MinIO¶

[storage]
backend = "minio"
cache_enabled = true
cache_dir = "/tmp/cache"

[storage.minio]
endpoint = "http://localhost:9000"
bucket = "herbarium"
access_key = "minioadmin"
secret_key = "minioadmin"

Migration Guide¶

Phase 1: Core Abstractions (Completed ✅)¶

✅ ImageLocator protocol defined
✅ LocalFilesystemLocator implemented
✅ S3ImageLocator implemented
✅ CachingImageLocator decorator implemented
✅ Factory function for config-based creation
✅ Configuration support in default TOML
✅ Comprehensive tests (18 passing)
✅ Example configs for S3 with caching

Phase 2: CLI Integration (Future)¶

Current State: CLI works perfectly with local filesystem via existing --input directory.

Future Enhancement: Optionally use ImageLocator when [storage] configured:

# In cli.py process_cli()
if "storage" in cfg:
    locator = create_image_locator(cfg)
    for identifier in iter_images_from_locator(locator):
        # Process using locator.get_image(identifier)
else:
    # Legacy path: use --input directory
    for img_path in iter_images(input_dir):
        # Process using path directly

Benefits of Deferred Integration: - No breaking changes to existing workflows - Architecture proven via tests and examples - CLI migration can happen gradually - Current local filesystem usage unaffected

Phase 3: Advanced Features (Future)¶

Potential enhancements:

HTTP/HTTPS backend for fetching images from web servers
Azure Blob Storage backend
Google Cloud Storage backend
Parallel download for remote backends
Cache warming - pre-download images before processing
Cache sharing - multiple runs share same cache
Compression - compress cached images to save disk space

Usage Examples¶

Basic Local Filesystem¶

from src.io_utils.locators.local import LocalFilesystemLocator

locator = LocalFilesystemLocator(Path("/data/images"))
for identifier in locator.list_images():
    image_data = locator.get_image(identifier)
    # Process image_data...

S3 with Automatic Caching¶

from src.io_utils.locator_factory import create_image_locator

config = {
    "storage": {
        "backend": "s3",
        "cache_enabled": True,
        "cache_dir": "/tmp/cache",
        "s3": {
            "bucket": "my-bucket",
            "prefix": "images/"
        }
    }
}

locator = create_image_locator(config)
for identifier in locator.list_images():
    # First iteration: downloads from S3, caches locally
    # Subsequent iterations: reads from cache
    image_data = locator.get_image(identifier)

Direct S3 Access (No Caching)¶

from src.io_utils.locators.s3 import S3ImageLocator

locator = S3ImageLocator(
    bucket="my-bucket",
    prefix="specimens/"
)

# Always fetches from S3 (no caching)
image_data = locator.get_image("IMG_001.jpg")

Custom Cache Management¶

from src.io_utils.caching import CachingImageLocator

locator = CachingImageLocator(backend, cache_dir)

# Check cache statistics
stats = locator.get_cache_stats()
print(f"Cached files: {stats['num_files']}")
print(f"Cache size: {stats['total_size_mb']:.2f} MB")

# Clear cache if needed
locator.clear_cache()

Performance Characteristics¶

LocalFilesystemLocator¶

Listing: O(n) directory traversal, filesystem speed
Fetch: Direct file read, no overhead
Metadata: Filesystem stat() call, very fast

S3ImageLocator¶

Listing: Paginated API calls, ~100ms per 1000 keys
Fetch: Network latency + transfer time (~100-500ms per image)
Metadata: HEAD request, ~50-100ms

CachingImageLocator¶

Cache Hit: Same as LocalFilesystemLocator (filesystem speed)
Cache Miss: Backend speed + cache write overhead (~10-20ms)
Eviction: O(n log n) for LRU sorting when limit exceeded

Testing¶

Comprehensive test suite in tests/unit/test_locators.py:

# Run all storage abstraction tests
uv run pytest tests/unit/test_locators.py -v

# Test specific component
uv run pytest tests/unit/test_locators.py::TestCachingImageLocator -v

Test Coverage: - ✅ LocalFilesystemLocator (11 tests) - ✅ CachingImageLocator (7 tests) - ✅ All edge cases (missing files, invalid paths, cache eviction) - ⏳ S3ImageLocator (requires AWS credentials or moto mocking)

Design Principles¶

Protocol over ABC: Use Protocol for duck typing, not abstract base classes
Decorator Pattern: Caching is a wrapper, not baked into backends
Fail Fast: Invalid config raises ValueError at startup, not during processing
Lazy Import: Backend dependencies (boto3) only imported when needed
Explicit Over Implicit: Configuration is explicit, no magic defaults

Troubleshooting¶

"Invalid or missing configuration"¶

ValueError: Local backend requires 'base_path' in config or input_path argument

Fix: Provide either storage.base_path in config or input_path argument to factory.

"S3 backend requires boto3"¶

ImportError: S3 backend requires boto3: pip install boto3

Fix: Install boto3:

uv pip install boto3

"Access denied to S3 object"¶

PermissionError: Access denied to S3 object: specimen_001.jpg

Fix: Check AWS credentials and S3 bucket permissions.

Cache eviction too aggressive¶

Symptom: Cache constantly evicting files even with max_cache_size_mb=2000.

Fix: Increase cache size limit or check disk space:

df -h /tmp  # Check available space

References¶

Design Document: ~/Desktop/20251004160850-0600-storage-abstraction-architecture.md
Issue Discussion: Storage abstraction requirements and architecture
Example Config: config/config.s3-cached.toml
Tests: tests/unit/test_locators.py

Contributing¶

To add a new storage backend:

Create backend class implementing ImageLocator protocol
Add backend to locator_factory.py factory function
Update config/config.default.toml with backend configuration
Add tests to tests/unit/test_locators.py
Update this documentation

Example stub for HTTP backend:

# src/io_utils/locators/http.py
class HTTPImageLocator:
    def __init__(self, base_url: str):
        self.base_url = base_url

    def exists(self, identifier: str) -> bool:
        # HEAD request to check existence
        ...

    def get_image(self, identifier: str) -> bytes:
        # GET request to fetch image
        ...

    # ... implement remaining methods

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group