By Patrick McCurley

Invoice POC — Self-Healing Architecture & Semantic Validation

By Patrick McCurley · Created Mar 12, 2026 public

This document covers how the Invoice POC currently achieves self-correction, where the gaps are, and the proposed architecture to close the semantic validation problem.

Current Architecture: What "Self-Healing" Actually Means Today

The pipeline generates extraction scripts once per courier format, caches them in a KV store, and reuses them deterministically across subsequent invoice runs. The AI agent is only invoked on a cache miss — when a new section pattern is encountered for the first time.

flowchart TD
    A([Invoice PDF]) --> B[Extract text & tables\npdfExtract.ts]
    B --> C[Fingerprint each section\nfingerprintPage]
    C --> D{Pattern cached?}

    D -->|Cache Hit| E[Run cached script]
    D -->|Cache Miss| F[AI agent\nwrites new script]

    F --> G[run_script]
    G --> E

    E --> H{_meta.confidence?}
    H -->|ok| I[✓ Accept output]
    H -->|low| J[Vision spot-check]
    H -->|failed| K[Invalidate script\nremovePattern]

    J --> L{Vision agrees?}
    L -->|yes| I
    L -->|no| K

    K --> F

    style F fill:#e1f5fe
    style J fill:#f3e5f5
    style I fill:#e8f5e9
    style K fill:#fce4ec

Script Lifecycle

  1. First run — AI agent calls read_pdf_tables, read_pdf_text, writes a TypeScript extraction script, and calls run_script to validate it
  2. Script saved — stored as {courier}/{patternId}.ts in the KV store alongside a detector function
  3. Subsequent runs — detector fires on each page, matching script runs immediately with no AI call
  4. Self-correction — if the script self-reports failure (_meta.confidence = 'failed') or vision cross-check disagrees, removePattern deletes it and the next run triggers AI regeneration

The _meta Quality Contract

Scripts are required by the system prompt to self-report extraction quality via a _meta block:

const _meta = {
  rowCount: items.length,
  confidence: items.length === 0 ? 'failed'
            : items.length < 5  ? 'low'
            : 'ok',
  warnings: items.length < 5 ? ['Very few rows extracted — possible parsing issue'] : [],
};
return { lineItems: items, _meta };

This is zero-cost on every BAU run — the script itself decides whether to escalate.


The Semantic Gap: What We Cannot Catch Today

The self-healing above addresses structural failures — scripts that extract nothing, crash, or produce obviously empty output. It does not address the harder class of failure: semantically wrong extractions.

flowchart LR
    subgraph Arithmetic Gate catches
        A1[Script returns empty array]
        A2[Script crashes / syntax error]
        A3[Total mismatch > 2%]
    end

    subgraph Semantic Gap — missed by arithmetic
        B1[Wrong field: VAT instead of net amount]
        B2[Missing rows: 12 extracted, 47 exist]
        B3[Wrong column: description in amount field]
        B4[Coincidental total match on partial data]
        B5[Misidentified service type]
        B6[Missing surcharge section entirely]
    end

    style B1 fill:#fce4ec
    style B2 fill:#fce4ec
    style B3 fill:#fce4ec
    style B4 fill:#fce4ec
    style B5 fill:#fce4ec
    style B6 fill:#fce4ec
    style A1 fill:#e8f5e9
    style A2 fill:#e8f5e9
    style A3 fill:#e8f5e9

The Coincidental Match Problem

Consider a DPD invoice with 47 consignments. An extraction script that only processes the first page of the consignment table would return 12 items. If those 12 items happen to sum to a subtotal visible elsewhere in the document, the arithmetic gate passes — and the error is silently cached.

The arithmetic gate is only valid when extractContentTotal finds a total line in the chunk. For most consignment pages (which contain no "Total: £X" line — only the summary/overview page does), extractContentTotal returns null and the check silently passes regardless of item count.

Example Semantic Failures

Failure Arithmetic detects? _meta detects? Vision detects?
Script extracts VAT not net amount Only if totals differ No — rowCount looks fine Yes — column header mismatch
Missing 35 of 47 consignments Only on summary page Yes — rowCount unexpectedly low Yes — row count mismatch
Wrong service type mapped Never No Possibly — if label differs
Entire surcharge section skipped If total includes surcharge Yes — if script marks low Yes — section not in output
Column shifted one right Only if totals differ No Yes — values in wrong fields

Three Approaches to Semantic Self-Correction

Approach A: Vision Model Cross-Check

After a script runs, send a rendered page screenshot to a vision model (Claude vision, available via the read_pdf_vision tool) and ask it to compare what it sees against the extracted JSON.

sequenceDiagram
    participant S as Cached Script
    participant V as Vision Model
    participant KV as Pattern Store

    S->>S: Extract → JSON output
    S->>V: PDF page image + extracted JSON
    V->>V: Compare row count, column headers,\nvisible totals vs extracted values
    V-->>S: { ok: true/false, feedback: "..." }

    alt Vision agrees
        S->>KV: Cache validation result
        Note over KV: Subsequent runs skip vision
    else Vision flags issue
        S->>KV: removePattern
        Note over KV: Next run triggers AI regeneration\nwith vision feedback in prompt
    end

Pros: Catches column misidentification, missing sections, wrong item counts Cons: Vision models can hallucinate; adds ~1s per new section (one-time cost); requires API key Cost model: One vision call per new pattern, cached. Zero cost on repeat runs.

Approach B: Schema Confidence Scoring

After extraction, run a second cheap LLM call asking it to score confidence on each critical field.

Rate 0–100: how confident are you that lineItems[0].netAmount is the
correct net charge (excluding VAT, excluding fuel surcharge)?
Explain any ambiguity in the column structure.

Pros: Catches field-level semantic ambiguity; explicit reasoning helps prompt improvement Cons: An LLM scoring its own output has limited independence; can miss visual layout cues

Approach C: Ground Truth Seeding (Best Long-Term)

For each courier, provide one "known good" invoice with manually verified line items as a seed document. The AI is told the expected output upfront.

flowchart TD
    A[Seed invoice\nmanually verified] --> B[Stored as\ncourier/seed.json]
    B --> C{First-run prompt}
    C --> D["Write a script that produces\nthis output for this invoice:\n{seed data}"]
    D --> E[AI writes anchored script]
    E --> F[Compare extracted totals\nper service type vs seed pattern]
    F --> G{Matches seed pattern?}
    G -->|yes| H[Accept & cache]
    G -->|no| I[Flag for review]

    style A fill:#fff3e0
    style B fill:#fff3e0
    style H fill:#e8f5e9
    style I fill:#fce4ec

Pros: Encodes business domain knowledge — what "net amount" means for DPD specifically, what "fuel surcharge" looks like for Evri; most accurate long-term Cons: Requires one-time manual verification per courier format; seed may drift if format changes


What Actually Matters: Extraction Quality vs Document Parsing

The pipeline splits concern well:

Layer Tool Responsibility
PDF → text/tables pdfExtract.ts + pdfplumber Raw content extraction
Text → structure AI-written scripts Semantic parsing
Structure → validation _meta + vision Quality assurance

The real semantic risk is at the script-writing layer, not the PDF extraction layer. pdfplumber handles tabular DPD data correctly — the "wrong headers" problem is a table boundary issue where pdfplumber misidentifies the first data row as a header row.

This is fixable at the prompt level:

"When pdfplumber's table has a first row that looks like data rather than headers (numbers, not column names), look earlier in the extracted text for the actual column names."

Vision can also help here: rendering the page lets a model identify actual column headers visually, bypassing pdfplumber's layout inference entirely.

The Fingerprinting Bug (Fixed)

An instructive example of structural self-healing working as intended: the fingerprintPage() function was using the first data row as the page fingerprint instead of the header row. This caused a 120-page DPD invoice to generate 120 unique section types instead of ~4–6, invoking the AI agent 120 times.

// BEFORE (broken) — first data row varies per page
for (let i = dataStart + 1; i < lines.length; i++) {  // +1 was the bug
  const t = lines[i].trim();
  if (t && !/^-+$/.test(t)) return t.slice(0, 80);
}

// AFTER (fixed) — header row is stable across all pages of same section type
for (const line of lines) {
  const t = line.trim();
  const matches = t.match(COLUMN_HEADER_WORDS);
  if (matches && matches.length >= 2) {
    return t.slice(0, 120).replace(/\s+/g, ' ');
  }
}

The fix reduced a 120-section run back to ~4–6 sections — the expected architecture.


Recommended Hybrid: Arithmetic + Vision Spot Check

flowchart TD
    A([Run cached script]) --> B[Script outputs JSON\n+ _meta block]

    B --> C{_meta.confidence}
    C -->|failed| D[Immediate invalidate]
    C -->|low| E[Vision spot-check\nGemini via OpenRouter]
    C -->|ok| F{First run\nfor this pattern?}

    F -->|yes| E
    F -->|no| I[✓ Accept — skip vision]

    E --> G{Vision agrees?}
    G -->|yes| H[Cache vision result\nmark pattern as\nvision-validated]
    G -->|no| D

    D --> J[removePattern]
    J --> K[AI regeneration\nwith vision feedback\nin prompt]

    H --> I

    style E fill:#f3e5f5
    style I fill:#e8f5e9
    style D fill:#fce4ec
    style K fill:#e1f5fe

Cost Profile

Event Cost Frequency
Script runs on cache hit Zero Every run
_meta confidence check Zero Every run
Vision spot-check ~1 Gemini call First run per pattern only
AI regeneration ~1 Claude call Only on failure

Once a pattern is vision-validated, it runs forever at zero AI cost until a format change triggers a new cache miss.

What This Gives You


Current Test Results

Results below are from live test runs during the POC phase. Exact numbers will update as the pipeline stabilises.

Document Set

Courier Format Count Status
Evri PDF 3 Tested
DPD PDF 4 Tested (fingerprinting fix applied)
DPD CSV 1 Tested

Architecture Verdict

Concern Status
Structural correctness Sound — _meta + invalidation works
Section deduplication Fixed — fingerprinting now uses header row
Agent behaviour Fixed — prompt enforces sequential tool use, no duplicate calls
Semantic correctness Needs vision layer — arithmetic gate only fires on summary pages
Ongoing cost Near-zero after first pattern cached
Format drift Detected on next cache miss — triggers AI regeneration

The current architecture is sound for structural correctness. The missing piece is a vision layer on first run to validate that the AI-written script correctly identifies the semantic meaning of each column — not just that the numbers add up.