By Patrick McCurley

Invoice ETL Pipeline — How It Works

By Patrick McCurley · Created Mar 16, 2026 public

An overview of how the invoice data extraction pipeline turns courier invoices into structured, queryable pricing data. Built as a proof of concept, tested against real DPD and Evri invoices.

What It Does

The pipeline takes a courier invoice (PDF, CSV, or Excel) and extracts every individual line item — tracking numbers, service types, weights, amounts, surcharges — into a single normalised format. Whether it's a 49-page DPD invoice with 2,000 consignments or a single-page Evri summary, the output is the same structured data.

The Three Phases

Every invoice goes through three distinct steps: Extract, Transform, Load.

Extract — Reading the Document

A machine learning model (Docling) reads the invoice and identifies every table in the document. It doesn't just read text — it understands table structure, detecting headers, rows, column boundaries, and which page each table is on.

For a 49-page DPD invoice, this produces around 55 separate tables. Many of these have identical column structures repeated across pages (the same line-item format on every page), while a few are different (surcharge summaries, tax breakdowns, payment details).

Transform — Understanding the Format

This is where the intelligence lives. Rather than asking an AI to process every single row (slow, expensive, and inconsistent), the system takes a smarter approach:

Groups identical tables — those 55 tables collapse to just 3-4 unique structures
Shows the AI a small sample — just the column headers and 8 representative rows from across the entire invoice
The AI returns rules, not data — "Column A is the tracking number, Column B is the service type, if the description contains 'Scottish Highlands' then the destination region is Scotland"
The server applies those rules to every row — deterministically, instantly, consistently

The result: the AI call takes about 12 seconds and costs a fraction of a penny. But the rules it produces can process thousands of rows in milliseconds. And because the rules are cached, the next invoice from the same courier is processed instantly — no AI call needed.

Load — Structured Output

Every line item is normalised into a common format regardless of which courier it came from. A DPD consignment and an Evri parcel end up with the same fields: tracking number, service type, amount, category, weight, destination.

How It Learns Invoice Formats

The system doesn't need to be pre-configured for each courier. When it encounters a new invoice format, the AI analyses the table structure and creates a complete set of rules.

The AI produces four types of rules:

Column Mappings — which source column maps to which output field (e.g. "Consignment Number" becomes trackingNumber)
Classification Rules — pattern-matching to derive service type, region, category from text
Skip Rules — identify rows to ignore (totals, subtotals, headers, metadata)
Default Values — static fields like carrier name and currency

Classification Rules — The Secret Sauce

This is where the system gets particularly clever. Many courier invoices embed rich information inside free-text descriptions rather than in separate columns.

For example, an Evri invoice has a single "Description" column with entries like:

"Scottish Highlands & Islands Packet Despatch A (1g to 2000g)"

The AI recognises the naming convention and creates pattern-matching rules that extract:

These rules are applied to every row automatically. If the description says "Northern Ireland Parcel Despatch A (2001g to 15000g)", the same rules classify it as Northern Ireland, Parcel (heavy), 2-15kg, Standard.

Caching — Learn Once, Apply Forever

Once the system has seen a courier's invoice format, it never needs to call the AI again for that format. The mapping rules are cached and reused:

Same courier, same month — instant (cache hit, ~20 milliseconds)
Same courier, different month — usually instant (same format month-to-month)
New courier or new format — one AI call (~12 seconds), then cached

The cache automatically invalidates if the output schema changes, ensuring stale rules don't produce incorrect data.

How It Handles Different Couriers

Each courier structures their invoices differently. The system adapts to whatever structure it finds.

Despite these very different formats, both produce the same normalised output — every line item has a tracking number (where available), service type, amount, category, and confidence score.

What We've Proven

The POC has been tested against real invoices from DPD and Evri.

Metric	DPD (49 pages)	Evri (2 pages)
Line items extracted	1,998 of ~2,015	All
Amount accuracy	99.2% (£17,655 of £17,807)	100%
First-run processing	~2.5 minutes (extraction + AI)	~15 seconds
Subsequent runs	~20 milliseconds	~20 milliseconds
Service types detected	4 (Standard, Next Day, Express 10:30, Saturday)	3 (Courier, Standard, Next Day)
Regions classified	UK Mainland	UK Mainland, Scottish Highlands, Northern Ireland, Channel Islands, Isle of Man

The 0.8% gap in DPD data comes from a known limitation: when a table row spans a page boundary in the PDF, the ML extraction occasionally splits it into two partial rows. This is an extraction-layer issue, not a mapping issue.

Production Architecture Direction

The POC validates the approach. For production, the plan is to build on Claimit's existing infrastructure:

Key decisions for production

Temporal for orchestration — Claimit already runs Temporal in production. Invoice processing is a natural fit: long-running, needs retry logic, needs progress visibility, has distinct phases with different failure modes.
Separate Docling worker — The ML extraction is CPU-heavy and takes minutes. It runs as its own container on a dedicated task queue, just like Claimit's existing browser workers run separately from email processing workers.
Azure PostgreSQL for storage — Normalised line items go into a relational database for querying, aggregation, and cross-invoice analysis. The schema is simple: an invoices table and a line_items table.
Mapper cache stays lightweight — Cached mapping rules don't need a database. Redis or blob storage is sufficient for what amounts to a few KB of JSON per courier format.

Open Questions

These are the decisions to make before building the production system:

Question	Options	Consideration
Docling scaling	Always-on vs scale-to-zero	Invoices arrive monthly in batches. 30-second cold start acceptable per batch?
LLM provider	OpenRouter / Direct API / Self-hosted	Cost vs reliability vs latency trade-offs at volume
Accuracy monitoring	Manual review vs automated reconciliation	Can we auto-compare ETL totals against invoice footer totals?
Multi-tenant	Shared vs isolated databases	Different clients have different couriers and access requirements
Historical data	Backfill vs forward-only	Do we need to process archived invoices or just new ones?

Invoice ETL Pipeline — How It Works

What It Does

The Three Phases

Extract — Reading the Document

Transform — Understanding the Format

Load — Structured Output

How It Learns Invoice Formats

Classification Rules — The Secret Sauce

Caching — Learn Once, Apply Forever

How It Handles Different Couriers

What We've Proven

Production Architecture Direction

Key decisions for production

Open Questions

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share