Normalizing Inconsistent Driver Input Fields

A Driver Vehicle Inspection Report is only as trustworthy as the fields the driver typed into it. When one driver records an odometer as 142,305, another as 142.3k, and a third as N/A, the raw payload cannot be compared, validated, or archived until it is reduced to a single deterministic representation. This page answers one focused question: how do you normalize free-form driver input — odometer readings, defect-severity words, and inspection timestamps — into a canonical, audit-defensible value in Python without silently corrupting the record? Get it wrong and a false 0-mile odometer defeats the mileage-interval logic that 49 CFR § 396.11(a) recordkeeping depends on, or a misread "minor" masks an out-of-service condition and produces a compliance gap a DOT auditor will treat as a violation.

Normalization sits immediately downstream of ingestion, inside the DVIR Field Mapping & Data Normalization stage of the DVIR Ingestion & Digital/Paper Parsing Workflows pipeline. Field mapping decides which canonical key a value belongs to; normalization decides what that value is. The two must stay separate: mapping is a routing concern, normalization is a value-coercion concern, and conflating them makes both untestable.

Prerequisites

Python 3.10+ — the code below uses match/case and PEP 604 X | None union syntax.
pydantic>=2.0 — the coercion stage returns a validated model, not a bare dict; v2’s model_validator runs after field coercion.
The canonical DVIRRecord contract from Standardized DVIR JSON Schema Design, so normalized values land in typed fields.
The controlled defect vocabulary from Defect Taxonomy Mapping for Heavy Trucks — the target enum every free-text severity word must resolve to.
A quarantine sink (Redis list, SQS, or a table) for values that cannot be coerced, distinct from the hard-reject path.

Data schema: input variance and its canonical target

Normalization is defined by the gap between what arrives and what the schema requires. The three worst-offending driver fields, and the canonical form each must reach:

Field	Observed raw variants	Canonical type	Compliance tag
`odometer_reading`	`"142,305"`, `"142305 mi"`, `"142.3k"`, `"N/A"`	`int` (whole miles)	49 CFR § 396.11(a)
`defect_severity`	`"S"`, `"minor"`, `"cosmetic"`, `"OOS"`, `"out of service"`	`DefectSeverity` enum	49 CFR § 396.9©(2)
`inspection_ts`	`"2024-05-12 9:22am"`, `"5/12/24 09:22"`, epoch millis	`datetime` (UTC, tz-aware)	49 CFR § 396.11(a)(3)

The severity enum is the same one used everywhere the value is later graded, so it must not be redefined per page:

from enum import Enum


class DefectSeverity(Enum):
    NON_CRITICAL = "NON_CRITICAL"     # cosmetic / advisory, vehicle stays in service
    CRITICAL = "CRITICAL"             # repair required, not yet OOS
    OUT_OF_SERVICE = "OUT_OF_SERVICE" # meets an OOS criterion, vehicle is held

Step-by-step implementation

Normalization runs as three ordered stages, each a pure function: identical input yields identical output, no I/O, no hidden state. Keep the raw value on the record throughout — the original string is the audit artifact 49 CFR § 396.11(a) obliges the carrier to preserve, and you must never overwrite it in place.

Step 1: Lexical cleaning

Strip artifacts that carry no meaning — thousands separators, trailing units, stray whitespace — before any parsing decision is made. Do not lowercase yet; casing still matters to the semantic stage.

def lexical_clean(raw: str) -> str:
    """Remove display artifacts without interpreting the value."""
    return (
        raw.strip()
        .replace(" ", " ")   # non-breaking spaces from mobile keyboards
        .replace(",", "")         # thousands separators
        .replace("–", "-")   # en-dash → hyphen (date ranges, negatives)
    )

Step 2: Odometer semantic mapping and type coercion

Detect explicit sentinels first, then compiled-regex extraction, then apply the multiplier. Coercion failures raise — they do not return a guess.

import re

# One compiled pattern, reused across every record (recompiling per call is a hot-loop regression).
_ODOMETER = re.compile(
    r"^(?P<value>\d+(?:\.\d+)?)\s*(?P<mult>[kKmM])?\s*(?:mi|km|miles|kilometers)?$"
)

_SENTINELS = {"N/A", "NA", "UNKNOWN", "", "-"}


def normalize_odometer(raw: str) -> int | None:
    """Coerce a driver odometer string to whole miles, or None for a sentinel."""
    cleaned = lexical_clean(raw).upper()
    if cleaned in _SENTINELS:
        return None  # genuine "not recorded" — distinct from a parse failure

    match = _ODOMETER.match(cleaned)
    if not match:
        # Raise, never coerce to 0: a false 0 would defeat § 396.11 mileage-interval logic.
        raise ValueError(f"unparseable odometer: {raw!r}")

    value = float(match["value"])
    match (match["mult"] or "").upper():
        case "K":
            value *= 1_000
        case "M":
            value *= 1_000_000
    return int(value)

Step 3: Severity semantic mapping

Map every colloquial variant to the canonical enum through an externalized dictionary — version-controlled YAML in production so compliance officers can add a synonym without a code deploy. An unrecognized word must quarantine, not default: silently mapping an unknown term to NON_CRITICAL is exactly how an out-of-service truck gets released.

# In production this dict is loaded from version-controlled YAML, not hard-coded.
_SEVERITY_ALIASES: dict[str, DefectSeverity] = {
    "s": DefectSeverity.NON_CRITICAL,
    "minor": DefectSeverity.NON_CRITICAL,
    "cosmetic": DefectSeverity.NON_CRITICAL,
    "advisory": DefectSeverity.NON_CRITICAL,
    "c": DefectSeverity.CRITICAL,
    "critical": DefectSeverity.CRITICAL,
    "major": DefectSeverity.CRITICAL,
    "oos": DefectSeverity.OUT_OF_SERVICE,
    "out of service": DefectSeverity.OUT_OF_SERVICE,
    "red tag": DefectSeverity.OUT_OF_SERVICE,
}


class QuarantineSignal(Exception):
    """Recoverable: preserve the record for human/reconciliation review."""


def normalize_severity(raw: str) -> DefectSeverity:
    key = lexical_clean(raw).lower()
    try:
        return _SEVERITY_ALIASES[key]
    except KeyError:
        # Quarantine the unknown term — do NOT default to NON_CRITICAL.
        raise QuarantineSignal(f"unmapped severity term: {raw!r}")

Step 4: Assemble the dual-payload record

Every normalized field ships alongside its untouched raw value so the audit chain is reconstructable. A stable hash over the raw payload lets an auditor prove no unauthorized edit occurred between ingest and archive.

import hashlib
from datetime import datetime, timezone

from pydantic import BaseModel


class NormalizedField(BaseModel):
    field_name: str
    raw_payload: str
    normalized_value: int | str | None
    compliance_rule: str
    transformation_sha256: str
    normalized_at_utc: datetime


def to_normalized_field(name: str, raw: str, value, rule: str) -> NormalizedField:
    digest = hashlib.sha256(raw.encode("utf-8")).hexdigest()
    return NormalizedField(
        field_name=name,
        raw_payload=raw,                       # preserved verbatim for § 396.11 audit
        normalized_value=value,
        compliance_rule=rule,
        transformation_sha256=digest,
        normalized_at_utc=datetime.now(timezone.utc),
    )

A value that clears normalization can be trusted by every downstream stage: a coerced OUT_OF_SERVICE severity is what triggers the Critical vs Non-Critical Routing Logic hold, and a clean integer odometer is what the Severity Scoring Algorithms for DVIR Defects stage grades against maintenance intervals. Batches whose failure rate spikes are shunted to Async Batching for High-Volume Ingestion for isolated reprocessing rather than blocking the live stream.

Verification and testing

Normalization is a pure function, which makes it the most testable stage in the pipeline — assert on exact outputs, and assert that failures raise rather than return a guess.

import pytest


@pytest.mark.parametrize("raw,expected", [
    ("142,305",     142_305),
    ("142305 mi",   142_305),
    ("142.3k",      142_300),
    ("N/A",         None),
    (" 98000 ",     98_000),
])
def test_odometer_variants(raw, expected):
    assert normalize_odometer(raw) == expected


def test_unparseable_odometer_raises_not_zero():
    # The critical property: a bad value must NEVER become 0.
    with pytest.raises(ValueError):
        normalize_odometer("twelve thousand")


def test_unknown_severity_quarantines():
    with pytest.raises(QuarantineSignal):
        normalize_severity("kinda bad")


def test_normalization_is_idempotent():
    once = normalize_odometer("142.3k")
    assert normalize_odometer(str(once)) == once  # re-running a canonical value is a no-op

The idempotency assertion is the contract that lets you safely re-run normalization on already-processed records during a backfill without double-transforming them. Also add a schema contract test that every _SEVERITY_ALIASES value is a member of the shared DefectSeverity enum, so the day someone adds a new tier to the taxonomy, the untranslated alias table fails CI instead of quarantining live traffic.

Common failure modes and gotchas

Coercing a parse failure to 0. The single most dangerous mistake: a try/except that returns 0 on failure turns an unreadable odometer into a plausible-looking zero, silently defeating § 396.11 mileage-interval scheduling. Raise and quarantine — a missing value (None) and an unparseable value (exception) are different states and must stay different.
Defaulting unmapped severity to NON_CRITICAL. An unknown term routed to the least-severe class can release an out-of-service vehicle. Quarantine unmapped words; the driver’s attestation is preserved for reconciliation instead of being flattened into a false pass.
142.3k rounding drift. 142.3 * 1000 is exactly 142300, but int(142.35 * 1000) truncates to 142349 from binary float error. Where sub-thousand precision matters, parse the integer and fractional parts separately or use decimal.Decimal; document the rounding rule so two services never disagree on the same reading.
Naive local timestamps. "5/12/24 09:22" with no offset is ambiguous across terminals in different zones; a driver signing off at end-of-shift can straddle a UTC date boundary. Always attach the terminal’s zone at ingest and store UTC, or the § 396.11(a)(3) inspection date can land on the wrong day. Offline mobile submissions that reconcile hours later compound this — treat the device-supplied timestamp as authoritative only after zone resolution.

Frequently Asked Questions

Should an unparseable odometer become 0 or None?

Neither by default. None means “the driver genuinely did not record a value” — a legitimate, distinguishable state. An unparseable string ("twelve thousand", a smudged OCR read) must raise and route to quarantine, because coercing it to 0 fabricates a mileage that defeats § 396.11 interval logic and corrupts every downstream maintenance decision.

Why not default an unknown severity word to NON_CRITICAL?

Because the failure is asymmetric. Defaulting an unrecognized term to the least-severe class can release a truck that a driver actually flagged as out of service. Quarantine the unmapped word, alert, and let a human or a versioned alias update resolve it — the driver’s original attestation stays intact for reconciliation.

Where should the severity alias dictionary live?

In version-controlled YAML loaded at startup, not hard-coded in the module. Compliance officers add regional synonyms and legacy abbreviations by editing config that flows through CI, and a contract test asserts every alias target is a valid DefectSeverity member — so a taxonomy change can never leave a dangling mapping in production.

How do I keep the original driver input for a DOT audit?

Emit a dual payload: the untouched raw_payload string plus the normalized_value, bound together with a SHA-256 hash of the raw input and a UTC timestamp. The raw value is never overwritten in place, and the hash lets an auditor prove no unauthorized modification occurred between ingest and archive.

DVIR Field Mapping & Data Normalization — the parent stage that routes each value to its canonical key before this coercion runs.
Standardized DVIR JSON Schema Design — the typed DVIRRecord contract normalized values land in.
Defect Taxonomy Mapping for Heavy Trucks — the controlled DefectSeverity vocabulary the semantic stage resolves to.
Critical vs Non-Critical Routing Logic — where a coerced OUT_OF_SERVICE severity triggers an immediate hold.
Async Batching for High-Volume Ingestion — how quarantined and failed records are isolated and reprocessed at scale.

Back to the parent topic DVIR Field Mapping & Data Normalization, part of DVIR Ingestion & Digital/Paper Parsing Workflows.