ETL Pipeline — Bug Investigation Report
Date: 2026-03-17 | Context: Full test run of 12 invoices uncovered 3 bugs, 2 critical
The test run (see full results) processed all 12 invoices successfully, but two produced obviously wrong totals. This document explains what went wrong, why, and how to fix it.
How the Pipeline Works (Quick Recap)
Before diving into the bugs, here's the relevant pipeline context:
Two things to note: the fan-out ensures duplicate tables are all processed, and the cache lets future invoices skip the LLM call entirely if the table structure matches.
Both of these mechanisms have bugs.
Bug 1: Fuzzy Recovery Double-Counts Duplicate Tables
Affected invoice: DPD 116154.I16806945.pdf (120 pages, 125 tables)
Symptom: Extracted £35,183 vs invoice total £29,579 — overcounting by £5,604 (19%)
What should happen
What actually happens
The fix
Before fuzzy recovery runs, build a set of header keys that are already covered. Skip any uncovered table whose header key is already in that set. This is a 5-line change:
// Build set of header keys already covered by fan-out
const coveredHeaderKeys = new Set<string>();
for (const idx of itemsByTable.keys()) {
const t = tables.find(t => t.index === idx);
if (t) coveredHeaderKeys.add(headerKey(t));
}
// Filter uncovered tables: skip duplicates of covered structures
const uncoveredTables = tables.filter(t =>
!coveredIndices.has(t.index) && !coveredHeaderKeys.has(headerKey(t))
);Expected result: DPD 116154 goes from 119% → ~100% accuracy.
Bug 2: Cached Mapper Produces Nonsense Total With No Safety Net
Affected invoice: DPD 451806.I16806375.pdf (49 pages, 55 tables)
Symptom: Cache hit from a different DPD invoice's mapper produced £16.8 million — obviously wrong for a courier invoice
How caching works
When the LLM generates mapping rules for an invoice, those rules are saved (cached) with a fingerprint — a hash of the table column headers. Next time an invoice arrives with the same fingerprint, the cached rules are applied without calling the LLM.
Why the fingerprint matched
The fingerprint is computed from column headers only:
headers: ["Date", "Tracking", "Service", "Price", ...]
→ lowercase, sort, join with |
→ SHA-256 hash → first 16 charsBoth DPD invoices have the same column names (they're from the same courier system) — so they produce identical fingerprints despite having completely different data and even different numbers of tables (125 vs 55).
Why validation didn't catch it
When a cached mapper is applied, validateMappingResult() runs 4 checks:
| Check | What it validates | Would it catch £16.8M? |
|---|---|---|
| Items exist | At least 1 line item produced | No — 1,995 items were produced |
| Row coverage | >20% of input rows mapped | No — mapping worked structurally |
| Descriptions | >80% of items have descriptions | No — descriptions were populated |
| Non-zero amounts | At least 1 item has an amount | No — amounts were present |
The critical gap: There is no check on whether the total amount is reasonable. The mapper mapped the wrong column to totalAmount (a column containing large reference numbers instead of prices), producing £16.8M. Since items existed and had descriptions, all checks passed.
The fix (two parts)
Part A — Sanity check on cached results:
Add a total amount ceiling to validateMappingResult(). A courier invoice exceeding £500K is almost certainly wrong:
const totalAmount = allItems.reduce((s, i) => s + (Number(i.totalAmount) || 0), 0);
if (totalAmount > 500_000) {
return { valid: false, reason: `Total £${totalAmount.toFixed(0)} exceeds sanity threshold` };
}Part B — Include row counts in fingerprint:
Make the fingerprint more discriminating so 125-table and 55-table invoices don't collide:
function fingerprint(tables: DoclingTable[]): string {
const parts = tables.map(t => {
const headers = t.headers.map(h => h.toLowerCase().trim()).sort().join('|');
return `${headers}::${t.numRows}`; // include row count
}).sort();
return crypto.createHash('sha256').update(parts.join(';;')).digest('hex').slice(0, 16);
}Note: changing the fingerprint function invalidates all existing cached mappers. This is fine — they'll be regenerated on next use.
How Both Bugs Interact
These two bugs can compound. Consider what happens if DPD 451806 is processed first (before 116154):
This is why both bugs need to be fixed together — fixing just one still leaves the system vulnerable.
Fix Priority
| # | Bug | Severity | Fix Effort | Impact |
|---|---|---|---|---|
| 1 | Fuzzy recovery double-counts duplicates | CRITICAL | 5 lines | DPD 116154: 119% → ~100% |
| 2 | No total amount sanity check on cache | CRITICAL | 4 lines | Blocks £16.8M false positives |
| 3 | Fingerprint is header-only (collisions) | HIGH | 3 lines (invalidates cache) | Prevents wrong mapper being used at all |
All three fixes are in the same file (parse.routes.ts) and total ~12 lines of code.