Mapping and vocabulary¶
During the mapping phase, OCR output is normalised before loading into the
primary DwC+ABCD database. Field aliases are resolved using
dwc_rules.toml, while controlled vocabulary
values such as basisOfRecord and typeStatus are defined in
vocab.toml.
See the configuration README for an overview of all available rule files.
Field mapping example¶
Create a custom alias for barcode by adding a [dwc.custom] section to the
configuration:
With this configuration, the mapping functions convert data:
from dwc import configure_mappings, map_custom_schema, map_ocr_to_dwc
configure_mappings({"barcode": "catalogNumber"})
record = map_ocr_to_dwc({"barcode": "ABC123"})
custom = map_custom_schema({"barcode": "XYZ"})
Both record.catalogNumber and custom.catalogNumber are populated from the
barcode field. See issue #156 for background on configuration-based schema mapping.
The default rules already map common labels such as collector number to
recordNumber via dwc_rules.toml.
Future work¶
Additional mapping rules will be populated in config/rules/dwc_rules.toml and config/rules/vocab.toml (issue #157).
Vocabulary normalisation example¶
Controlled terms such as basisOfRecord are harmonised via
vocab.toml:
This call returns "PreservedSpecimen".
Passing "field note" instead normalises the value to "HumanObservation".
[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group