PDF & Image OCR Pipeline Setup

Q: Why bypass OCR for native PDFs instead of running everything through Tesseract?

Native PDFs carry an embedded text layer at 100% character fidelity; running OCR over readable text can only introduce errors and add compute cost. Classify first and route digital_native pages straight to deterministic field extraction.

Q: What confidence floor should critical fields require?

Enforce a per-field floor of 0.85 on VIN, odometer, and the certification signature. Records at or above the floor are admitted; any critical field below it routes the page to human verification rather than admitting a guess.

Q: How does the pipeline stay auditable when OCR is probabilistic?

Preserve the original raster and its SHA-256 hash, emit a per-field confidence with every extraction, and log every admit/review/reject decision. Deterministic classification and gating let an auditor reproduce any extraction from the retained raster.

A scanned Driver Vehicle Inspection Report is worthless to a DOT auditor if the optical extraction dropped the certification signature or misread the VIN — and under 49 CFR § 396.11(a)(3) the content of that report is what the carrier is legally obligated to retain and act on. Paper and image-only inspections are still submitted from maintenance bays, third-party carriers, and drivers who photograph a triplicate form with a cab-mounted camera, so the OCR pipeline is the stage inside the DVIR Ingestion & Digital/Paper Parsing Workflows architecture that turns those pixels into a field-typed record the rest of the system can trust. This page specifies a deterministic, auditable pipeline that routes native-text PDFs around OCR entirely, preprocesses genuine raster scans, extracts text under a hard confidence floor, and hands the result to the canonical schema — so that every inspection that happened on paper becomes exactly one immutable, reconstructable compliance record rather than a silent guess.

The engineering problem is that OCR is inherently probabilistic and § 396.11 is not. A VIN misread from 1HGCM82633A004352 to 1HGCM82633A0O4352 is not a cosmetic defect; it breaks the vehicle-to-report join that the entire certification chain depends on. The pipeline’s single job is therefore to make the probabilistic stage bounded and observable: emit a per-field confidence, refuse to admit any critical field below the floor, and preserve the original raster and its hash so an auditor can always re-derive the extraction.

Prerequisites and Environment Setup

Target Python 3.10+ — the code below uses match statements, StrEnum, and the X | Y union syntax. The pipeline deliberately favors deterministic, self-hosted tooling over managed OCR APIs so that extraction is reproducible during an audit; a comparison of the managed alternatives (AWS Textract, Azure Form Recognizer) belongs to a separate toolchain-selection decision and is out of scope here. The dependency stack is:

PyMuPDF (fitz) — probe a PDF for an embedded selectable-text layer and rasterize pages at a controlled DPI when OCR is required.
pdfplumber — extract AcroForm fields and table geometry from native digital exports without any optical step.
opencv-python — deterministic preprocessing: grayscale, adaptive thresholding, deskew, CLAHE, and grid-line removal.
Pillow — Lanczos resampling and colour-space handling for sub-threshold captures.
pytesseract (wrapping Tesseract 5.x with eng and equ traineddata) — the recognition engine and its TSV confidence output.
pydantic (v2) — coerce the extracted fields into the canonical DVIR contract and reject records that fail typing.

The pipeline does not own the field schema. Its typed output must satisfy the canonical extracted-field contract defined in the Standardized DVIR JSON Schema Design reference, and any defect codes it reads off the form must resolve against the controlled vocabulary in Defect Taxonomy Mapping for Heavy Trucks before they are trusted downstream. Native digital PDFs that arrive through the Mobile App DVIR Export Integration endpoint carry embedded text and must bypass OCR; only genuine raster scans enter the optical path described below.

Data Schema and Normalization

The pipeline emits a single OCR envelope per page that wraps evidentiary metadata (source hash, DPI, classification) around the extracted fields and their per-field confidence. The confidence metadata is what makes the probabilistic stage auditable — it is the input to the compliance gate later on. The controlled PageClass and ExtractionConfidence enumerations below are the contract every downstream stage reads.

Field	Type	Enumeration / Range	Compliance tag
`source_sha256`	`str` (64 hex)	—	Evidentiary anchor for § 396.11 retention
`page_class`	`PageClass`	`digital_native` \| `raster_scan` \| `rejected`	Routing decision
`capture_dpi`	`int`	≥ 150 (reject below)	Readability floor
`vin`	`str`	17-char, ISO 3779	Vehicle join key — critical field
`unit_id`	`str`	fleet-local	Vehicle join key
`driver_id`	`str`	carrier roster	§ 396.11(a) preparer identity
`inspection_ts_utc`	`datetime`	UTC, tz-aware	Preparation timestamp
`defect_codes`	`list[str]`	resolves to taxonomy	§ 396.11(a)(3) recorded defects
`certification_present`	`bool`	—	§ 396.11©(2) signature capture
`field_confidence`	`dict[str, float]`	0.0–1.0 per field	Gate input
`extraction_confidence`	`ExtractionConfidence`	`high` \| `review` \| `reject`	Admit / HITL / reject decision

from datetime import datetime
from enum import StrEnum
from pydantic import BaseModel, Field, field_validator

CONFIDENCE_FLOOR = 0.85  # per-field floor for critical fields (VIN, odometer, signature)

class PageClass(StrEnum):
    DIGITAL_NATIVE = "digital_native"   # embedded text layer — no OCR
    RASTER_SCAN = "raster_scan"         # image-only — full optical path
    REJECTED = "rejected"               # failed the validation gate

class ExtractionConfidence(StrEnum):
    HIGH = "high"       # all critical fields >= floor -> admit
    REVIEW = "review"   # a critical field below floor -> human-in-the-loop
    REJECT = "reject"   # unreadable / sub-threshold input -> reject payload

class OCREnvelope(BaseModel):
    source_sha256: str = Field(pattern=r"^[0-9a-f]{64}$")
    page_class: PageClass
    capture_dpi: int = Field(ge=150)  # § 396.11 readability floor
    vin: str | None = None
    unit_id: str | None = None
    driver_id: str | None = None
    inspection_ts_utc: datetime | None = None
    defect_codes: list[str] = Field(default_factory=list)
    certification_present: bool = False
    field_confidence: dict[str, float] = Field(default_factory=dict)
    extraction_confidence: ExtractionConfidence

    @field_validator("vin")
    @classmethod
    def vin_length(cls, v: str | None) -> str | None:
        if v is not None and len(v) != 17:
            raise ValueError("VIN must be 17 characters (ISO 3779)")
        return v

Core Algorithm: Classification, Preprocessing, and Extraction

The pipeline runs as a strict, auditable state machine with the validation gate as its only entry point. Do not let any channel write a record without passing this gate.

1. Validation gate. Reject any file below 150 DPI or above 50 MB before touching it. A sub-150-DPI scan cannot yield a reliable VIN, and admitting it produces a plausible-looking but unauditable guess — reject the payload and return it to the client for re-capture rather than let it progress.

2. Classification. Probe for a selectable text layer. If extractable text covers more than 60% of the page area, classify the page as digital_native and route it straight to deterministic field extraction — never OCR text you can read directly, because optical recognition can only degrade a source that is already at 100% character fidelity. Otherwise, classify it as raster_scan.

import fitz  # PyMuPDF

def classify_page(pdf_bytes: bytes) -> PageClass:
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    page = doc[0]
    text = page.get_text("text")
    page_area = abs(page.rect.width * page.rect.height)
    # Sum the bounding-box area of every extracted word span.
    covered = sum(
        abs((x1 - x0) * (y1 - y0))
        for x0, y0, x1, y1, *_ in page.get_text("words")
    )
    density = covered / page_area if page_area else 0.0
    if text.strip() and density > 0.60:
        return PageClass.DIGITAL_NATIVE  # embedded text — bypass OCR
    return PageClass.RASTER_SCAN

3. Preprocessing (raster path only). Scans from maintenance bays exhibit skew, uneven illumination, JPEG artifacts, and pre-printed grid lines that Tesseract misreads as hyphens and underscores. Apply a deterministic sequence — grayscale, deskew via a Hough estimate, CLAHE for local-contrast recovery, and grid-line suppression — so the same input always yields the same raster. Upsample any sub-300-DPI page with Lanczos resampling before binarization to prevent character fragmentation.

import cv2
import numpy as np

def preprocess(raster: np.ndarray) -> np.ndarray:
    gray = cv2.cvtColor(raster, cv2.COLOR_BGR2GRAY)

    # Deskew from the dominant text-line angle.
    coords = np.column_stack(np.where(gray < 128))
    angle = cv2.minAreaRect(coords)[-1]
    angle = -(90 + angle) if angle < -45 else -angle
    h, w = gray.shape
    m = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    gray = cv2.warpAffine(gray, m, (w, h),
                          flags=cv2.INTER_CUBIC,
                          borderMode=cv2.BORDER_REPLICATE)

    # Recover local contrast without eroding fine print.
    gray = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8)).apply(gray)

    # Suppress pre-printed form grid lines so they are not read as glyphs.
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV, blockSize=15, C=2,
    )
    horiz = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
    vert = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
    lines = cv2.add(
        cv2.morphologyEx(binary, cv2.MORPH_OPEN, horiz),
        cv2.morphologyEx(binary, cv2.MORPH_OPEN, vert),
    )
    return cv2.bitwise_and(binary, cv2.bitwise_not(lines))

4. Recognition. Run Tesseract with --oem 3 (the LSTM engine with legacy fallback) so it captures both machine-printed odometer values and cursive annotations. Request TSV output so every recognized token carries its own confidence — that per-token confidence is the raw material for the compliance gate. Region-specific page-segmentation tuning (--psm 6 for the header block, --psm 12 with ROI cropping for isolated fields) is covered in depth in Tesseract OCR Setup for Fleet Inspection Forms; consult the official Tesseract documentation for the full flag reference.

import pytesseract
from pytesseract import Output

def extract_fields(binary: np.ndarray) -> dict[str, float]:
    data = pytesseract.image_to_data(
        binary, output_type=Output.DICT,
        config="--oem 3 --psm 6",
    )
    # Return the minimum confidence per recognized token (0.0-1.0).
    return {
        data["text"][i]: int(data["conf"][i]) / 100.0
        for i in range(len(data["text"]))
        if data["text"][i].strip() and int(data["conf"][i]) >= 0
    }

Compliance Thresholding and Routing

Confidence is not advisory — it is a hard gate mapped to an explicit compliance action. Enforce a per-field floor of 0.85 on the critical fields that the vehicle-to-report join and the certification chain depend on: vin, odometer, and certification_present. A record whose critical fields all clear the floor is admitted as high; a record with any critical field below the floor is routed to human verification as review — never silently admitted. An input that is unreadable end to end is rejected outright.

Extraction state	Condition	Compliance action
`high`	all critical fields ≥ 0.85	Admit to canonical schema; append to audit store
`review`	any critical field < 0.85	Route to human-in-the-loop queue; do not admit
`reject`	page below DPI floor / no legible text	Reject payload; return to client for re-capture

CRITICAL_FIELDS = ("vin", "odometer", "certification_present")

def gate(field_confidence: dict[str, float], legible: bool) -> ExtractionConfidence:
    if not legible:
        return ExtractionConfidence.REJECT  # reject the payload outright
    for name in CRITICAL_FIELDS:
        if field_confidence.get(name, 0.0) < CONFIDENCE_FLOOR:
            return ExtractionConfidence.REVIEW  # route to HITL, never guess
    return ExtractionConfidence.HIGH

This floor is what keeps an OCR misread from becoming a compliance defect. A VIN admitted at 0.71 confidence would produce a record that joins to the wrong vehicle — or to no vehicle — and under § 396.11©(2) the carrier could not then prove it certified the correct unit’s repair before dispatch. Routing the page to review costs a human thirty seconds; admitting the guess costs the carrier its audit defensibility.

Production Integration and Platform Synchronization

The OCR pipeline is a producer, not a terminal. Once a page clears the gate, hash the original raster and the extracted envelope with SHA-256, chain that hash into the append-only audit ledger, and emit the typed record onto the ingestion queue. Records that arrive in end-of-shift bursts must not run synchronously behind a single OCR worker; hand raster_scan payloads to the bounded worker pool described in Async Batching for High-Volume Ingestion so a terminal-wide upload spike cannot starve the parsers. Fields read off the form are then folded into the canonical record by the Automated Field Mapping & Data Normalization stage before any defect enters classification.

Emit the SHA-256 of the source raster as the idempotency key. The same photographed form re-uploaded after a flaky mobile connection must resolve to the same audit entry, not a duplicate compliance record — deduplicate on the source hash rather than create a second record. Any defect the pipeline reads is a candidate that the downstream classifier still scores; a defect the driver flagged as safety-critical must reach the routing engine intact so it can trigger the correct band. The severity bands the score maps to are the same on every page in this section: 0–34 minor, 35–69 major, 70–100 critical.

Engineering Standards Checklist

Schema validation: every extracted envelope validates against the OCREnvelope Pydantic contract and the canonical DVIR schema before admission; a record that fails typing is rejected, never coerced.
Deterministic execution: classification, preprocessing, and gating produce identical output for identical input — no stochastic model decides admission, so any extraction is reproducible during an audit.
Confidence floor enforced: the 0.85 per-field floor on vin, odometer, and certification_present is applied on every raster record; sub-floor critical fields route to human verification, not admission.
Evidentiary hashing: the source raster and the extracted envelope are SHA-256 hashed and chained into the WORM audit ledger to satisfy § 396.11 retention.
Idempotent ingest: the source-raster hash is the idempotency key, so re-uploads deduplicate instead of creating duplicate compliance records.
OCR bypass verified: digital_native pages are asserted to skip the optical path entirely, preserving 100% character fidelity.
Audit logging: every admit / review / reject decision is logged with the correlation ID, the DPI, and the field-confidence map for reconstruction.

Frequently Asked Questions

Why bypass OCR for native PDFs instead of running everything through Tesseract?

Native PDFs from mobile exports carry an embedded text layer at 100% character fidelity. Running OCR over readable text can only introduce errors — a misread VIN or odometer — while adding compute cost. Classify first, and route digital_native pages straight to deterministic field extraction so the optical stage only ever touches genuine image-only scans.

What confidence floor should critical fields require?

Enforce a per-field floor of 0.85 on the fields the certification chain depends on — VIN, odometer, and the certification signature. A record with all critical fields at or above the floor is admitted; any critical field below it routes the page to human verification rather than admitting a guess that could break the vehicle-to-report join required by § 396.11©(2).

How does the pipeline stay auditable when OCR is probabilistic?

Preserve the original raster and its SHA-256 hash, emit a per-field confidence with every extraction, and log every admit/review/reject decision with its correlation ID. Because classification, preprocessing, and gating are deterministic, an auditor can always re-run the exact extraction from the retained raster and reproduce the result.

Back to DVIR Ingestion & Digital/Paper Parsing Workflows.