Multilingual OCR engine¶

The engines.multilingual module wraps PaddleOCR to extract text from images in multiple languages. It is part of the OCR phase of the digitization pipeline and produces raw text and token confidences for downstream mapping.

Installation¶

pip install paddlepaddle paddleocr

Usage¶

from pathlib import Path
from engines import dispatch
import engines.multilingual  # noqa: F401 ensures engine registration

text, confidences = dispatch(
    "image_to_text",
    image=Path("specimen.jpg"),
    engine="multilingual",
    langs=["fr", "en"],
)

The engine accepts ISO 639-1 (two-letter) and ISO 639-2 (three-letter) codes. Mixed lists such as "eng", "fr", and "la" are normalized automatically before invoking PaddleOCR, so the same configuration can drive Tesseract and multilingual OCR without manual edits.

Supported languages¶

PaddleOCR's multilingual model covers 80+ languages including en, fr, de, es, ru, and it. Refer to the PaddleOCR documentation for the full list.

[AAFC]: Agriculture and Agri-Food Canada [GBIF]: Global Biodiversity Information Facility [DwC]: Darwin Core [OCR]: Optical Character Recognition [API]: Application Programming Interface [CSV]: Comma-Separated Values [IPT]: Integrated Publishing Toolkit [TDWG]: Taxonomic Databases Working Group