By Patrick McCurley

Invoice Processing Pipeline — How We Get to 100% Accuracy

By Patrick McCurley · Created Mar 20, 2026 · Updated Mar 22, 2026 public

Courier invoices arrive in many formats — from clean spreadsheets to dense, multi-page PDFs with merged cells and multi-row records. The pipeline tries the cheapest approach first and only escalates when accuracy is measurably insufficient.

End-to-End Flow

Extract

Spreadsheets (CSV, Excel) are parsed natively. Wide CSVs (250+ columns like UPS) have empty columns stripped automatically before being sent to the LLM.

PDFs go through Docling (IBM open-source) in 10-20 page chunks. Docling's memory accumulates per document (~100-600MB) with no way to reclaim, so the pipeline monitors RSS via a health endpoint and proactively restarts Docling when it exceeds 3GB — checked before every chunk and every page render.

Both produce identical output: a list of tables with column headers and rows.

Tier 1: Text Mapping

The LLM receives column headers, sample rows (8 for large tables, all rows for small tables ≤20 rows), and a column activity summary. It returns:

Column mappings — which column maps to which schema field
Classification rules — how to categorise services from text patterns
Skip rules — which rows to ignore (totals only — never empty cells, which would break multi-row formats)
Summary items — charges extracted directly from small summary tables
Invoice total — the pre-VAT total found in summary tables (or from repeated-value columns in CSVs)

CSVs and PDFs get different prompts. Each courier has optional hints:

Courier	Hint
UPS	Skip rollup rows (empty Charge Description). Use "Net Amount" only.
DPD	CSV: expand surcharge columns. PDF: base charges only in "Amount".
DHL	Use "Total amount (incl. VAT)" — don't map component columns.
Evri	No hint needed — clean format.

Self-Healing (85-115% threshold)

After applying rules, the pipeline checks: does the extracted total match the LLM-detected invoice total? If accuracy is outside 85-115%, it retries with feedback — which tables overcounted, which column it chose, and the per-table breakdown. Up to 3 attempts, each with more sample rows.

The 85-115% band is tight enough to catch real errors (the LLM mapping a wrong column at 77%) while loose enough to accept minor gaps from surcharges on summary pages (~1-5%).

Tier 2: Guided Text Extraction

Only triggers when two conditions are met:

Text mapping accuracy is below 95%
The extraction ratio is below 15% — fewer than 15% of input rows became line items

The extraction ratio gate prevents this tier from triggering on DPD invoices at 94% accuracy with thousands of items (format is understood, gap is from surcharges). It only activates for genuinely complex formats like DHL's multi-row shipment records (22 items from 600 rows = 3.6%).

Tier 3: Vision Fallback

Pages rendered as high-resolution PNG images on-demand (2 at a time, discarded after extraction) and read visually by AI models. This bypasses text extraction entirely — the last resort when the Docling text output is too damaged or the format too complex for even guided text extraction.

Tested Results

12 invoices from 4 couriers. UPS CSV was previously non-deterministic at 77% — fixed by tightening the self-healing threshold from 50-150% to 85-115%, allowing the pipeline to catch and correct wrong column mapping.

Complexity	Accuracy	Cost	Speed	Method
CSVs (UPS, DPD)	100%	~$0.03	11-16s	Text mapping + courier hints + self-healing
Small PDFs (Evri, DHL 2pg)	100%	~$0.03-0.06	8-48s	Text mapping or vision
Medium PDFs (DPD 9-49pg)	96-100%	~$0.03	36-187s	Text mapping
Large PDFs (DPD 120pg)	95.4%	~$0.09	384s	Text mapping + chunked extraction
Complex PDFs (DHL 32pg)	94-105%	~$0.08-0.21	347-1278s	Guided text or vision

Average accuracy: 98.3%. Average cost: ~$0.05/invoice.

Summary

What matters	How it's achieved
Accuracy	Three self-correcting tiers. Text mapping self-heals at 85-115%. Guided text self-heals via Sonnet feedback loop. Vision reads images directly.
Cost	$0.03 for most invoices. Guided text ~$0.08 only for complex formats. Vision ~$0.18 as last resort.
Speed	CSVs: 11-16s. Small PDFs: under a minute. Large PDFs: 2-7 min. Complex formats: 6-21 min.
Smart escalation	Extraction ratio gate prevents expensive guided text tier from triggering on flat-table invoices where the gap is from surcharges, not format complexity.
Courier agnostic	Tested on DPD, DHL, Evri, UPS. Courier hints are optional — the pipeline works without them, just better with them.

Invoice Processing Pipeline — How We Get to 100% Accuracy

End-to-End Flow

Extract

Tier 1: Text Mapping

Self-Healing (85-115% threshold)

Tier 2: Guided Text Extraction

Tier 3: Vision Fallback

Tested Results

Summary

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share