By Patrick McCurley

Invoice POC — Self-Healing Architecture & Semantic Validation

By Patrick McCurley · Created Mar 12, 2026 public

This document covers how the Invoice POC currently achieves self-correction, where the gaps are, and the proposed architecture to close the semantic validation problem.

Current Architecture: What "Self-Healing" Actually Means Today

The pipeline generates extraction scripts once per courier format, caches them in a KV store, and reuses them deterministically across subsequent invoice runs. The AI agent is only invoked on a cache miss — when a new section pattern is encountered for the first time.

flowchart TD
    A([Invoice PDF]) --> B[Extract text & tables\npdfExtract.ts]
    B --> C[Fingerprint each section\nfingerprintPage]
    C --> D{Pattern cached?}

    D -->|Cache Hit| E[Run cached script]
    D -->|Cache Miss| F[AI agent\nwrites new script]

    F --> G[run_script]
    G --> E

    E --> H{_meta.confidence?}
    H -->|ok| I[✓ Accept output]
    H -->|low| J[Vision spot-check]
    H -->|failed| K[Invalidate script\nremovePattern]

    J --> L{Vision agrees?}
    L -->|yes| I
    L -->|no| K

    K --> F

    style F fill:#e1f5fe
    style J fill:#f3e5f5
    style I fill:#e8f5e9
    style K fill:#fce4ec

Script Lifecycle

First run — AI agent calls read_pdf_tables, read_pdf_text, writes a TypeScript extraction script, and calls run_script to validate it
Script saved — stored as {courier}/{patternId}.ts in the KV store alongside a detector function
Subsequent runs — detector fires on each page, matching script runs immediately with no AI call
Self-correction — if the script self-reports failure (_meta.confidence = 'failed') or vision cross-check disagrees, removePattern deletes it and the next run triggers AI regeneration

The `_meta` Quality Contract

Scripts are required by the system prompt to self-report extraction quality via a _meta block:

const _meta = {
  rowCount: items.length,
  confidence: items.length === 0 ? 'failed'
            : items.length < 5  ? 'low'
            : 'ok',
  warnings: items.length < 5 ? ['Very few rows extracted — possible parsing issue'] : [],
};
return { lineItems: items, _meta };

This is zero-cost on every BAU run — the script itself decides whether to escalate.

The Semantic Gap: What We Cannot Catch Today

The self-healing above addresses structural failures — scripts that extract nothing, crash, or produce obviously empty output. It does not address the harder class of failure: semantically wrong extractions.

flowchart LR
    subgraph Arithmetic Gate catches
        A1[Script returns empty array]
        A2[Script crashes / syntax error]
        A3[Total mismatch > 2%]
    end

    subgraph Semantic Gap — missed by arithmetic
        B1[Wrong field: VAT instead of net amount]
        B2[Missing rows: 12 extracted, 47 exist]
        B3[Wrong column: description in amount field]
        B4[Coincidental total match on partial data]
        B5[Misidentified service type]
        B6[Missing surcharge section entirely]
    end

    style B1 fill:#fce4ec
    style B2 fill:#fce4ec
    style B3 fill:#fce4ec
    style B4 fill:#fce4ec
    style B5 fill:#fce4ec
    style B6 fill:#fce4ec
    style A1 fill:#e8f5e9
    style A2 fill:#e8f5e9
    style A3 fill:#e8f5e9

The Coincidental Match Problem

Consider a DPD invoice with 47 consignments. An extraction script that only processes the first page of the consignment table would return 12 items. If those 12 items happen to sum to a subtotal visible elsewhere in the document, the arithmetic gate passes — and the error is silently cached.

The arithmetic gate is only valid when extractContentTotal finds a total line in the chunk. For most consignment pages (which contain no "Total: £X" line — only the summary/overview page does), extractContentTotal returns null and the check silently passes regardless of item count.

Example Semantic Failures

Failure	Arithmetic detects?	`_meta` detects?	Vision detects?
Script extracts VAT not net amount	Only if totals differ	No — rowCount looks fine	Yes — column header mismatch
Missing 35 of 47 consignments	Only on summary page	Yes — rowCount unexpectedly low	Yes — row count mismatch
Wrong service type mapped	Never	No	Possibly — if label differs
Entire surcharge section skipped	If total includes surcharge	Yes — if script marks low	Yes — section not in output
Column shifted one right	Only if totals differ	No	Yes — values in wrong fields

Three Approaches to Semantic Self-Correction

Approach A: Vision Model Cross-Check

After a script runs, send a rendered page screenshot to a vision model (Claude vision, available via the read_pdf_vision tool) and ask it to compare what it sees against the extracted JSON.

sequenceDiagram
    participant S as Cached Script
    participant V as Vision Model
    participant KV as Pattern Store

    S->>S: Extract → JSON output
    S->>V: PDF page image + extracted JSON
    V->>V: Compare row count, column headers,\nvisible totals vs extracted values
    V-->>S: { ok: true/false, feedback: "..." }

    alt Vision agrees
        S->>KV: Cache validation result
        Note over KV: Subsequent runs skip vision
    else Vision flags issue
        S->>KV: removePattern
        Note over KV: Next run triggers AI regeneration\nwith vision feedback in prompt
    end

Pros: Catches column misidentification, missing sections, wrong item counts Cons: Vision models can hallucinate; adds ~1s per new section (one-time cost); requires API key Cost model: One vision call per new pattern, cached. Zero cost on repeat runs.

Approach B: Schema Confidence Scoring

After extraction, run a second cheap LLM call asking it to score confidence on each critical field.

Rate 0–100: how confident are you that lineItems[0].netAmount is the
correct net charge (excluding VAT, excluding fuel surcharge)?
Explain any ambiguity in the column structure.

Scores stored alongside the cached script in the KV store
Critical fields below threshold trigger regeneration on next run
Model: Haiku (fast, cheap) — not Sonnet

Pros: Catches field-level semantic ambiguity; explicit reasoning helps prompt improvement Cons: An LLM scoring its own output has limited independence; can miss visual layout cues

Approach C: Ground Truth Seeding (Best Long-Term)

For each courier, provide one "known good" invoice with manually verified line items as a seed document. The AI is told the expected output upfront.

flowchart TD
    A[Seed invoice\nmanually verified] --> B[Stored as\ncourier/seed.json]
    B --> C{First-run prompt}
    C --> D["Write a script that produces\nthis output for this invoice:\n{seed data}"]
    D --> E[AI writes anchored script]
    E --> F[Compare extracted totals\nper service type vs seed pattern]
    F --> G{Matches seed pattern?}
    G -->|yes| H[Accept & cache]
    G -->|no| I[Flag for review]

    style A fill:#fff3e0
    style B fill:#fff3e0
    style H fill:#e8f5e9
    style I fill:#fce4ec

Pros: Encodes business domain knowledge — what "net amount" means for DPD specifically, what "fuel surcharge" looks like for Evri; most accurate long-term Cons: Requires one-time manual verification per courier format; seed may drift if format changes

What Actually Matters: Extraction Quality vs Document Parsing

The pipeline splits concern well:

Layer	Tool	Responsibility
PDF → text/tables	`pdfExtract.ts` + pdfplumber	Raw content extraction
Text → structure	AI-written scripts	Semantic parsing
Structure → validation	`_meta` + vision	Quality assurance

The real semantic risk is at the script-writing layer, not the PDF extraction layer. pdfplumber handles tabular DPD data correctly — the "wrong headers" problem is a table boundary issue where pdfplumber misidentifies the first data row as a header row.

This is fixable at the prompt level:

"When pdfplumber's table has a first row that looks like data rather than headers (numbers, not column names), look earlier in the extracted text for the actual column names."

Vision can also help here: rendering the page lets a model identify actual column headers visually, bypassing pdfplumber's layout inference entirely.

The Fingerprinting Bug (Fixed)

An instructive example of structural self-healing working as intended: the fingerprintPage() function was using the first data row as the page fingerprint instead of the header row. This caused a 120-page DPD invoice to generate 120 unique section types instead of ~4–6, invoking the AI agent 120 times.

// BEFORE (broken) — first data row varies per page
for (let i = dataStart + 1; i < lines.length; i++) {  // +1 was the bug
  const t = lines[i].trim();
  if (t && !/^-+$/.test(t)) return t.slice(0, 80);
}

// AFTER (fixed) — header row is stable across all pages of same section type
for (const line of lines) {
  const t = line.trim();
  const matches = t.match(COLUMN_HEADER_WORDS);
  if (matches && matches.length >= 2) {
    return t.slice(0, 120).replace(/\s+/g, ' ');
  }
}

The fix reduced a 120-section run back to ~4–6 sections — the expected architecture.

Recommended Hybrid: Arithmetic + Vision Spot Check

flowchart TD
    A([Run cached script]) --> B[Script outputs JSON\n+ _meta block]

    B --> C{_meta.confidence}
    C -->|failed| D[Immediate invalidate]
    C -->|low| E[Vision spot-check\nGemini via OpenRouter]
    C -->|ok| F{First run\nfor this pattern?}

    F -->|yes| E
    F -->|no| I[✓ Accept — skip vision]

    E --> G{Vision agrees?}
    G -->|yes| H[Cache vision result\nmark pattern as\nvision-validated]
    G -->|no| D

    D --> J[removePattern]
    J --> K[AI regeneration\nwith vision feedback\nin prompt]

    H --> I

    style E fill:#f3e5f5
    style I fill:#e8f5e9
    style D fill:#fce4ec
    style K fill:#e1f5fe

Cost Profile

Event	Cost	Frequency
Script runs on cache hit	Zero	Every run
`_meta` confidence check	Zero	Every run
Vision spot-check	~1 Gemini call	First run per pattern only
AI regeneration	~1 Claude call	Only on failure

Once a pattern is vision-validated, it runs forever at zero AI cost until a format change triggers a new cache miss.

What This Gives You

Structural self-healing (arithmetic / _meta confidence) — catches empty output, crashes, row count anomalies
Semantic self-healing (vision spot-check) — catches wrong columns, missing sections, misidentified headers
Near-zero ongoing cost — AI is only invoked on first-run or on detected failure

Current Test Results

Results below are from live test runs during the POC phase. Exact numbers will update as the pipeline stabilises.

Document Set

Courier	Format	Count	Status
Evri	PDF	3	Tested
DPD	PDF	4	Tested (fingerprinting fix applied)
DPD	CSV	1	Tested

Architecture Verdict

Concern	Status
Structural correctness	Sound — `_meta` + invalidation works
Section deduplication	Fixed — fingerprinting now uses header row
Agent behaviour	Fixed — prompt enforces sequential tool use, no duplicate calls
Semantic correctness	Needs vision layer — arithmetic gate only fires on summary pages
Ongoing cost	Near-zero after first pattern cached
Format drift	Detected on next cache miss — triggers AI regeneration

The current architecture is sound for structural correctness. The missing piece is a vision layer on first run to validate that the AI-written script correctly identifies the semantic meaning of each column — not just that the numbers add up.

Invoice POC — Self-Healing Architecture & Semantic Validation

Current Architecture: What "Self-Healing" Actually Means Today

Script Lifecycle

The _meta Quality Contract

The Semantic Gap: What We Cannot Catch Today

The Coincidental Match Problem

Example Semantic Failures

Three Approaches to Semantic Self-Correction

Approach A: Vision Model Cross-Check

Approach B: Schema Confidence Scoring

Approach C: Ground Truth Seeding (Best Long-Term)

What Actually Matters: Extraction Quality vs Document Parsing

The Fingerprinting Bug (Fixed)

Recommended Hybrid: Arithmetic + Vision Spot Check

Cost Profile

What This Gives You

Current Test Results

Document Set

Architecture Verdict

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share

The `_meta` Quality Contract