Invoice POC — Self-Healing Architecture & Semantic Validation
This document covers how the Invoice POC currently achieves self-correction, where the gaps are, and the proposed architecture to close the semantic validation problem.
Current Architecture: What "Self-Healing" Actually Means Today
The pipeline generates extraction scripts once per courier format, caches them in a KV store, and reuses them deterministically across subsequent invoice runs. The AI agent is only invoked on a cache miss — when a new section pattern is encountered for the first time.
flowchart TD
A([Invoice PDF]) --> B[Extract text & tables\npdfExtract.ts]
B --> C[Fingerprint each section\nfingerprintPage]
C --> D{Pattern cached?}
D -->|Cache Hit| E[Run cached script]
D -->|Cache Miss| F[AI agent\nwrites new script]
F --> G[run_script]
G --> E
E --> H{_meta.confidence?}
H -->|ok| I[✓ Accept output]
H -->|low| J[Vision spot-check]
H -->|failed| K[Invalidate script\nremovePattern]
J --> L{Vision agrees?}
L -->|yes| I
L -->|no| K
K --> F
style F fill:#e1f5fe
style J fill:#f3e5f5
style I fill:#e8f5e9
style K fill:#fce4ecScript Lifecycle
- First run — AI agent calls
read_pdf_tables,read_pdf_text, writes a TypeScript extraction script, and callsrun_scriptto validate it - Script saved — stored as
{courier}/{patternId}.tsin the KV store alongside a detector function - Subsequent runs — detector fires on each page, matching script runs immediately with no AI call
- Self-correction — if the script self-reports failure (
_meta.confidence = 'failed') or vision cross-check disagrees,removePatterndeletes it and the next run triggers AI regeneration
The _meta Quality Contract
Scripts are required by the system prompt to self-report extraction quality via a _meta block:
const _meta = {
rowCount: items.length,
confidence: items.length === 0 ? 'failed'
: items.length < 5 ? 'low'
: 'ok',
warnings: items.length < 5 ? ['Very few rows extracted — possible parsing issue'] : [],
};
return { lineItems: items, _meta };This is zero-cost on every BAU run — the script itself decides whether to escalate.
The Semantic Gap: What We Cannot Catch Today
The self-healing above addresses structural failures — scripts that extract nothing, crash, or produce obviously empty output. It does not address the harder class of failure: semantically wrong extractions.
flowchart LR
subgraph Arithmetic Gate catches
A1[Script returns empty array]
A2[Script crashes / syntax error]
A3[Total mismatch > 2%]
end
subgraph Semantic Gap — missed by arithmetic
B1[Wrong field: VAT instead of net amount]
B2[Missing rows: 12 extracted, 47 exist]
B3[Wrong column: description in amount field]
B4[Coincidental total match on partial data]
B5[Misidentified service type]
B6[Missing surcharge section entirely]
end
style B1 fill:#fce4ec
style B2 fill:#fce4ec
style B3 fill:#fce4ec
style B4 fill:#fce4ec
style B5 fill:#fce4ec
style B6 fill:#fce4ec
style A1 fill:#e8f5e9
style A2 fill:#e8f5e9
style A3 fill:#e8f5e9The Coincidental Match Problem
Consider a DPD invoice with 47 consignments. An extraction script that only processes the first page of the consignment table would return 12 items. If those 12 items happen to sum to a subtotal visible elsewhere in the document, the arithmetic gate passes — and the error is silently cached.
The arithmetic gate is only valid when extractContentTotal finds a total line in the chunk. For most consignment pages (which contain no "Total: £X" line — only the summary/overview page does), extractContentTotal returns null and the check silently passes regardless of item count.
Example Semantic Failures
| Failure | Arithmetic detects? | _meta detects? |
Vision detects? |
|---|---|---|---|
| Script extracts VAT not net amount | Only if totals differ | No — rowCount looks fine | Yes — column header mismatch |
| Missing 35 of 47 consignments | Only on summary page | Yes — rowCount unexpectedly low | Yes — row count mismatch |
| Wrong service type mapped | Never | No | Possibly — if label differs |
| Entire surcharge section skipped | If total includes surcharge | Yes — if script marks low | Yes — section not in output |
| Column shifted one right | Only if totals differ | No | Yes — values in wrong fields |
Three Approaches to Semantic Self-Correction
Approach A: Vision Model Cross-Check
After a script runs, send a rendered page screenshot to a vision model (Claude vision, available via the read_pdf_vision tool) and ask it to compare what it sees against the extracted JSON.
sequenceDiagram
participant S as Cached Script
participant V as Vision Model
participant KV as Pattern Store
S->>S: Extract → JSON output
S->>V: PDF page image + extracted JSON
V->>V: Compare row count, column headers,\nvisible totals vs extracted values
V-->>S: { ok: true/false, feedback: "..." }
alt Vision agrees
S->>KV: Cache validation result
Note over KV: Subsequent runs skip vision
else Vision flags issue
S->>KV: removePattern
Note over KV: Next run triggers AI regeneration\nwith vision feedback in prompt
endPros: Catches column misidentification, missing sections, wrong item counts Cons: Vision models can hallucinate; adds ~1s per new section (one-time cost); requires API key Cost model: One vision call per new pattern, cached. Zero cost on repeat runs.
Approach B: Schema Confidence Scoring
After extraction, run a second cheap LLM call asking it to score confidence on each critical field.
Rate 0–100: how confident are you that lineItems[0].netAmount is the
correct net charge (excluding VAT, excluding fuel surcharge)?
Explain any ambiguity in the column structure.- Scores stored alongside the cached script in the KV store
- Critical fields below threshold trigger regeneration on next run
- Model: Haiku (fast, cheap) — not Sonnet
Pros: Catches field-level semantic ambiguity; explicit reasoning helps prompt improvement Cons: An LLM scoring its own output has limited independence; can miss visual layout cues
Approach C: Ground Truth Seeding (Best Long-Term)
For each courier, provide one "known good" invoice with manually verified line items as a seed document. The AI is told the expected output upfront.
flowchart TD
A[Seed invoice\nmanually verified] --> B[Stored as\ncourier/seed.json]
B --> C{First-run prompt}
C --> D["Write a script that produces\nthis output for this invoice:\n{seed data}"]
D --> E[AI writes anchored script]
E --> F[Compare extracted totals\nper service type vs seed pattern]
F --> G{Matches seed pattern?}
G -->|yes| H[Accept & cache]
G -->|no| I[Flag for review]
style A fill:#fff3e0
style B fill:#fff3e0
style H fill:#e8f5e9
style I fill:#fce4ecPros: Encodes business domain knowledge — what "net amount" means for DPD specifically, what "fuel surcharge" looks like for Evri; most accurate long-term Cons: Requires one-time manual verification per courier format; seed may drift if format changes
What Actually Matters: Extraction Quality vs Document Parsing
The pipeline splits concern well:
| Layer | Tool | Responsibility |
|---|---|---|
| PDF → text/tables | pdfExtract.ts + pdfplumber |
Raw content extraction |
| Text → structure | AI-written scripts | Semantic parsing |
| Structure → validation | _meta + vision |
Quality assurance |
The real semantic risk is at the script-writing layer, not the PDF extraction layer. pdfplumber handles tabular DPD data correctly — the "wrong headers" problem is a table boundary issue where pdfplumber misidentifies the first data row as a header row.
This is fixable at the prompt level:
"When pdfplumber's table has a first row that looks like data rather than headers (numbers, not column names), look earlier in the extracted text for the actual column names."
Vision can also help here: rendering the page lets a model identify actual column headers visually, bypassing pdfplumber's layout inference entirely.
The Fingerprinting Bug (Fixed)
An instructive example of structural self-healing working as intended: the fingerprintPage() function was using the first data row as the page fingerprint instead of the header row. This caused a 120-page DPD invoice to generate 120 unique section types instead of ~4–6, invoking the AI agent 120 times.
// BEFORE (broken) — first data row varies per page
for (let i = dataStart + 1; i < lines.length; i++) { // +1 was the bug
const t = lines[i].trim();
if (t && !/^-+$/.test(t)) return t.slice(0, 80);
}
// AFTER (fixed) — header row is stable across all pages of same section type
for (const line of lines) {
const t = line.trim();
const matches = t.match(COLUMN_HEADER_WORDS);
if (matches && matches.length >= 2) {
return t.slice(0, 120).replace(/\s+/g, ' ');
}
}The fix reduced a 120-section run back to ~4–6 sections — the expected architecture.
Recommended Hybrid: Arithmetic + Vision Spot Check
flowchart TD
A([Run cached script]) --> B[Script outputs JSON\n+ _meta block]
B --> C{_meta.confidence}
C -->|failed| D[Immediate invalidate]
C -->|low| E[Vision spot-check\nGemini via OpenRouter]
C -->|ok| F{First run\nfor this pattern?}
F -->|yes| E
F -->|no| I[✓ Accept — skip vision]
E --> G{Vision agrees?}
G -->|yes| H[Cache vision result\nmark pattern as\nvision-validated]
G -->|no| D
D --> J[removePattern]
J --> K[AI regeneration\nwith vision feedback\nin prompt]
H --> I
style E fill:#f3e5f5
style I fill:#e8f5e9
style D fill:#fce4ec
style K fill:#e1f5feCost Profile
| Event | Cost | Frequency |
|---|---|---|
| Script runs on cache hit | Zero | Every run |
_meta confidence check |
Zero | Every run |
| Vision spot-check | ~1 Gemini call | First run per pattern only |
| AI regeneration | ~1 Claude call | Only on failure |
Once a pattern is vision-validated, it runs forever at zero AI cost until a format change triggers a new cache miss.
What This Gives You
- Structural self-healing (arithmetic /
_metaconfidence) — catches empty output, crashes, row count anomalies - Semantic self-healing (vision spot-check) — catches wrong columns, missing sections, misidentified headers
- Near-zero ongoing cost — AI is only invoked on first-run or on detected failure
Current Test Results
Results below are from live test runs during the POC phase. Exact numbers will update as the pipeline stabilises.
Document Set
| Courier | Format | Count | Status |
|---|---|---|---|
| Evri | 3 | Tested | |
| DPD | 4 | Tested (fingerprinting fix applied) | |
| DPD | CSV | 1 | Tested |
Architecture Verdict
| Concern | Status |
|---|---|
| Structural correctness | Sound — _meta + invalidation works |
| Section deduplication | Fixed — fingerprinting now uses header row |
| Agent behaviour | Fixed — prompt enforces sequential tool use, no duplicate calls |
| Semantic correctness | Needs vision layer — arithmetic gate only fires on summary pages |
| Ongoing cost | Near-zero after first pattern cached |
| Format drift | Detected on next cache miss — triggers AI regeneration |
The current architecture is sound for structural correctness. The missing piece is a vision layer on first run to validate that the AI-written script correctly identifies the semantic meaning of each column — not just that the numbers add up.