Extraction Run Analysis: run_20250930_181456¶

Summary¶

Analysis of full_dataset_processing/run_20250930_181456 revealed a combination of issues that resulted in 5,770 entries in raw.jsonl for only 2,885 unique specimens.

Findings¶

1. Duplicate Processing¶

Total extractions: 5,770
Unique specimens: 2,885
Ratio: Exactly 2.0 (every specimen processed twice)

2. All Extractions Failed¶

Error: Missing OpenAI API key

Every single extraction in this run failed with:

{
  "errors": [
    "The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable"
  ],
  "dwc": {}
}

Impact: 5,770 failed extraction attempts, consuming processing time with zero usable results.

3. No Deduplication¶

The extraction pipeline did not check whether (image, extraction_params) combinations had already been processed, allowing the same image to be extracted multiple times with identical parameters.

Root Causes¶

Missing environment variable: OPENAI_API_KEY not set during extraction run
No extraction-level deduplication: System didn't prevent re-processing identical (image, params) combinations
Unknown trigger for duplicate processing: Unclear why each image was queued twice

Recommendations¶

Implemented Solutions¶

Specimen Index (src/provenance/specimen_index.py)
Tracks specimens through transformations and extraction runs
Deduplication at (image_sha256, params_hash) level
Prevents redundant extraction of identical combinations
Migration Tool (scripts/migrate_to_specimen_index.py)
Analyzes existing runs to identify duplicates
Populates specimen index from historical data
Flags data quality issues
Architecture Documentation (docs/specimen_provenance_architecture.md)
Specimen-centric data model
Full provenance tracking: original files → transformations → extractions → review
Data quality checks for catalog number violations

Usage¶

Check before extraction:

from src.provenance.specimen_index import SpecimenIndex

index = SpecimenIndex("specimen_index.db")

# Before processing an image:
should_extract, existing_id = index.should_extract(
    image_sha256="000e426d...",
    extraction_params={
        "ocr_engine": "vision",
        "model": "gpt-4o-mini",
        "prompt_version": "v2.1"
    }
)

if not should_extract:
    logger.info(f"Skipping: already extracted ({existing_id})")
    continue

# Proceed with extraction...

Analyze existing runs:

python scripts/migrate_to_specimen_index.py \
    --run-dir full_dataset_processing/run_20250930_181456 \
    --index specimen_index.db \
    --analyze-duplicates \
    --check-quality

Benefits¶

Efficiency: Eliminate redundant extraction attempts
Cost savings: Avoid duplicate API calls
Data quality: Automatic detection of catalog number violations
Provenance: Full lineage from camera files to final DwC records
Aggregation: Multiple extraction attempts per specimen contribute to better candidate fields

Future Work¶

Original filename mapping: Link content-addressed images back to camera files (DSC_*.JPG)
Transformation tracking: Record preprocessing operations in specimen index
Review integration: Update review UI to show all extraction attempts and data quality flags
Quality gates: Prevent extraction runs from starting without required API keys

Architecture: docs/specimen_provenance_architecture.md
Implementation: src/provenance/specimen_index.py
Migration: scripts/migrate_to_specimen_index.py
Monitor TUI: scripts/monitor_tui.py

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group