Skip to content

Mapping and vocabulary

During the mapping phase, OCR output is normalised before loading into the primary DwC+ABCD database. Field aliases are resolved using dwc_rules.toml, while controlled vocabulary values such as basisOfRecord and typeStatus are defined in vocab.toml.

See the configuration README for an overview of all available rule files.

Field mapping example

Create a custom alias for barcode by adding a [dwc.custom] section to the configuration:

[dwc.custom]
barcode = "catalogNumber"

With this configuration, the mapping functions convert data:

from dwc import configure_mappings, map_custom_schema, map_ocr_to_dwc

configure_mappings({"barcode": "catalogNumber"})
record = map_ocr_to_dwc({"barcode": "ABC123"})
custom = map_custom_schema({"barcode": "XYZ"})

Both record.catalogNumber and custom.catalogNumber are populated from the barcode field. See issue #156 for background on configuration-based schema mapping.

The default rules already map common labels such as collector number to recordNumber via dwc_rules.toml.

Future work

Additional mapping rules will be populated in config/rules/dwc_rules.toml and config/rules/vocab.toml (issue #157).

Vocabulary normalisation example

Controlled terms such as basisOfRecord are harmonised via vocab.toml:

from dwc import normalize_vocab

normalize_vocab("herbarium sheet", "basisOfRecord")

This call returns "PreservedSpecimen".

Passing "field note" instead normalises the value to "HumanObservation".

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group