By Patrick McCurley

ETL Pipeline — Bug Investigation Report

By Patrick McCurley · Created Mar 17, 2026 public

Date: 2026-03-17 | Context: Full test run of 12 invoices uncovered 3 bugs, 2 critical

The test run (see full results) processed all 12 invoices successfully, but two produced obviously wrong totals. This document explains what went wrong, why, and how to fix it.

How the Pipeline Works (Quick Recap)

Before diving into the bugs, here's the relevant pipeline context:

Two things to note: the fan-out ensures duplicate tables are all processed, and the cache lets future invoices skip the LLM call entirely if the table structure matches.

Both of these mechanisms have bugs.


Bug 1: Fuzzy Recovery Double-Counts Duplicate Tables

Affected invoice: DPD 116154.I16806945.pdf (120 pages, 125 tables) Symptom: Extracted £35,183 vs invoice total £29,579 — overcounting by £5,604 (19%)

What should happen

What actually happens

The fix

Before fuzzy recovery runs, build a set of header keys that are already covered. Skip any uncovered table whose header key is already in that set. This is a 5-line change:

// Build set of header keys already covered by fan-out
const coveredHeaderKeys = new Set<string>();
for (const idx of itemsByTable.keys()) {
  const t = tables.find(t => t.index === idx);
  if (t) coveredHeaderKeys.add(headerKey(t));
}

// Filter uncovered tables: skip duplicates of covered structures
const uncoveredTables = tables.filter(t =>
  !coveredIndices.has(t.index) && !coveredHeaderKeys.has(headerKey(t))
);

Expected result: DPD 116154 goes from 119% → ~100% accuracy.


Bug 2: Cached Mapper Produces Nonsense Total With No Safety Net

Affected invoice: DPD 451806.I16806375.pdf (49 pages, 55 tables) Symptom: Cache hit from a different DPD invoice's mapper produced £16.8 million — obviously wrong for a courier invoice

How caching works

When the LLM generates mapping rules for an invoice, those rules are saved (cached) with a fingerprint — a hash of the table column headers. Next time an invoice arrives with the same fingerprint, the cached rules are applied without calling the LLM.

Why the fingerprint matched

The fingerprint is computed from column headers only:

headers: ["Date", "Tracking", "Service", "Price", ...]
         → lowercase, sort, join with |
         → SHA-256 hash → first 16 chars

Both DPD invoices have the same column names (they're from the same courier system) — so they produce identical fingerprints despite having completely different data and even different numbers of tables (125 vs 55).

Why validation didn't catch it

When a cached mapper is applied, validateMappingResult() runs 4 checks:

Check What it validates Would it catch £16.8M?
Items exist At least 1 line item produced No — 1,995 items were produced
Row coverage >20% of input rows mapped No — mapping worked structurally
Descriptions >80% of items have descriptions No — descriptions were populated
Non-zero amounts At least 1 item has an amount No — amounts were present

The critical gap: There is no check on whether the total amount is reasonable. The mapper mapped the wrong column to totalAmount (a column containing large reference numbers instead of prices), producing £16.8M. Since items existed and had descriptions, all checks passed.

The fix (two parts)

Part A — Sanity check on cached results:

Add a total amount ceiling to validateMappingResult(). A courier invoice exceeding £500K is almost certainly wrong:

const totalAmount = allItems.reduce((s, i) => s + (Number(i.totalAmount) || 0), 0);
if (totalAmount > 500_000) {
  return { valid: false, reason: `Total £${totalAmount.toFixed(0)} exceeds sanity threshold` };
}

Part B — Include row counts in fingerprint:

Make the fingerprint more discriminating so 125-table and 55-table invoices don't collide:

function fingerprint(tables: DoclingTable[]): string {
  const parts = tables.map(t => {
    const headers = t.headers.map(h => h.toLowerCase().trim()).sort().join('|');
    return `${headers}::${t.numRows}`;  // include row count
  }).sort();
  return crypto.createHash('sha256').update(parts.join(';;')).digest('hex').slice(0, 16);
}

Note: changing the fingerprint function invalidates all existing cached mappers. This is fine — they'll be regenerated on next use.


How Both Bugs Interact

These two bugs can compound. Consider what happens if DPD 451806 is processed first (before 116154):

This is why both bugs need to be fixed together — fixing just one still leaves the system vulnerable.


Fix Priority

# Bug Severity Fix Effort Impact
1 Fuzzy recovery double-counts duplicates CRITICAL 5 lines DPD 116154: 119% → ~100%
2 No total amount sanity check on cache CRITICAL 4 lines Blocks £16.8M false positives
3 Fingerprint is header-only (collisions) HIGH 3 lines (invalidates cache) Prevents wrong mapper being used at all

All three fixes are in the same file (parse.routes.ts) and total ~12 lines of code.