Schema and Mapping Improvements¶
This document describes the enhancements made to address issues #188 "Parse official DwC and ABCD schemas" and #189 "Auto-generate Darwin Core term mappings".
Overview¶
The schema and mapping system has been significantly enhanced to provide:
- Official Schema Parsing: Direct fetching and parsing of Darwin Core and ABCD schemas from their canonical TDWG sources
- Automatic Mapping Generation: Dynamic generation of field mappings based on parsed schemas
- Enhanced Validation: Validation against official schema definitions with compatibility checking
- Dynamic Configuration: Runtime configuration of mapping rules and schema handling
- Improved Compatibility: Version compatibility checking between different schema versions
Key Components¶
1. Enhanced Schema Module (dwc/schema.py)¶
New Features:¶
- Official Schema URLs: Canonical sources for DwC and ABCD schemas
- Schema Fetching: Network-based retrieval of schemas with caching
- Detailed Parsing: Extraction of term metadata including descriptions and data types
- Version Compatibility: Cross-schema compatibility validation
Key Functions:¶
# Fetch official schemas from TDWG sources
schemas = fetch_official_schemas(use_cache=True)
# Load terms from official sources
terms = load_schema_terms_from_official_sources(['dwc_simple', 'abcd_206'])
# Configure terms from official sources
configure_terms_from_official_sources(['dwc_simple'])
# Validate term compatibility
compatibility = validate_schema_compatibility(terms, target_schemas)
Official Schema Sources:¶
- Darwin Core:
http://rs.tdwg.org/dwc/xsd/tdwg_dwc_simple.xsd - ABCD 2.06:
https://abcd.tdwg.org/xml/ABCD_2.06.xsd
2. Enhanced Mapper Module (dwc/mapper.py)¶
New Features:¶
- Dynamic Mappings: Runtime-generated mappings based on official schemas
- Fuzzy Matching: Similarity-based field matching with configurable thresholds
- Automatic Generation: Schema-driven mapping rule generation
- Validation Integration: Built-in validation against target schemas
Key Functions:¶
# Generate automatic mappings
mappings = auto_generate_mappings_from_schemas(
schema_names=['dwc_simple'],
include_fuzzy=True,
similarity_threshold=0.6
)
# Configure dynamic mappings
configure_dynamic_mappings(['dwc_simple'], include_fuzzy=True)
# Validate mapped records
validation = validate_mapping_against_schemas(record, ['dwc_simple'])
# Get mapping suggestions
suggestions = suggest_mapping_improvements(unmapped_fields)
3. Schema Manager (dwc/schema_manager.py)¶
A comprehensive management system for schema handling:
Features:¶
- Centralized Management: Single interface for all schema operations
- Caching System: Local caching of downloaded schemas with update intervals
- Status Monitoring: Detailed status and health reporting
- Compatibility Analysis: Cross-schema compatibility reporting
Usage Example:¶
from dwc import SchemaManager
# Initialize manager
manager = SchemaManager(
cache_dir=Path("cache"),
update_interval_days=30,
preferred_schemas=["dwc_simple", "abcd_206"]
)
# Get schemas and generate mappings
schemas = manager.get_schemas()
mappings = manager.generate_mappings()
suggestions = manager.suggest_mappings(unmapped_fields)
# Compatibility analysis
report = manager.get_schema_compatibility_report("dwc_simple", ["abcd_206"])
Configuration Enhancements¶
Updated Configuration (config/config.default.toml)¶
New schema-related configuration options:
[dwc]
# Schema source configuration
use_official_schemas = false # Enable official schema fetching
preferred_official_schemas = ["dwc_simple", "abcd_206"]
schema_cache_enabled = true
schema_update_interval_days = 30
schema_compatibility_check = true
# Existing configuration...
schema = "dwc-abcd"
schema_uri = "http://rs.tdwg.org/dwc/terms/"
schema_files = ["dwc.xsd", "abcd.xsd"]
Mapping Improvements¶
1. Dynamic Mapping Generation¶
The system now automatically generates mappings based on:
- Case variations: catalognumber → catalogNumber
- Format variations: scientific_name → scientificName
- Common aliases: lat → decimalLatitude, collector → recordedBy
- Fuzzy matching: Similarity-based suggestions for unmapped fields
2. Enhanced Field Resolution¶
Mapping priority order:
1. Static rules (from config/rules/dwc_rules.toml)
2. Dynamic mappings (generated from schemas)
3. Custom mappings (from configuration)
3. Validation and Quality Control¶
- Schema compliance: Validate fields against official schema definitions
- Compatibility scoring: Quantitative compatibility assessment
- Error flagging: Automatic flagging of invalid or deprecated fields
- Suggestion system: Intelligent suggestions for unmapped fields
API Reference¶
New Imports Available:¶
from dwc import (
SchemaManager, # Schema management
configure_terms_from_official_sources, # Official schema configuration
configure_dynamic_mappings, # Dynamic mapping setup
auto_generate_mappings_from_schemas, # Mapping generation
validate_mapping_against_schemas, # Record validation
validate_schema_compatibility, # Term compatibility checking
suggest_mapping_improvements, # Mapping suggestions
fetch_official_schemas, # Schema fetching
)
Examples and Testing¶
Demo Script¶
Run the comprehensive demo to see all features in action:
Test Coverage¶
New test suites cover:
- Schema manager functionality (tests/unit/test_schema_manager.py)
- Enhanced mapping integration (tests/integration/test_enhanced_mapping.py)
Performance Considerations¶
Caching Strategy¶
- Local caching: Downloaded schemas cached with configurable update intervals
- Lazy loading: Schemas fetched only when needed
- Memory efficiency: Schemas cached in memory during session
Network Resilience¶
- Fallback behavior: Falls back to local schemas if official sources unavailable
- Timeout handling: Configurable timeouts for schema fetching
- Error recovery: Graceful degradation when official schemas can't be accessed
Migration Guide¶
For Existing Codebases¶
- No breaking changes: All existing functionality preserved
- Optional features: Enhanced features are opt-in via configuration
- Backward compatibility: Existing mapping rules continue to work
To Enable Enhanced Features¶
-
Update configuration:
-
Use SchemaManager for advanced features:
-
Leverage automatic mappings:
Future Enhancements¶
Planned Improvements¶
- ABCD 3.0 support: Integration with upcoming ABCD 3.0 specification
- Custom schema support: User-defined schema integration
- Machine learning: ML-based mapping suggestion improvements
- Real-time validation: Live validation during data entry
Community Integration¶
- GBIF integration: Enhanced GBIF backbone validation
- iDigBio compatibility: Support for iDigBio data standards
- Community schemas: Support for community-specific extensions
Troubleshooting¶
Common Issues¶
- Network connectivity: Official schemas require internet access
-
Solution: Enable caching and configure appropriate timeouts
-
Schema parsing errors: Malformed or inaccessible schemas
-
Solution: System falls back to local schemas automatically
-
Mapping conflicts: Multiple mappings for the same field
- Solution: Clear priority order ensures consistent behavior
Debug Information¶
Enable detailed logging to troubleshoot:
Conclusion¶
These enhancements significantly improve the schema and mapping capabilities of the herbarium DwC extraction system. The implementation provides:
- Standards compliance: Direct integration with official TDWG specifications
- Automation: Reduced manual configuration requirements
- Flexibility: Configurable and extensible mapping system
- Quality assurance: Enhanced validation and compatibility checking
- Future-proofing: Foundation for ongoing standards evolution
The improvements address the core requirements of issues #188 and #189 while maintaining backward compatibility and providing a path for future enhancements.
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group