By Patrick McCurley

Invoice ETL Pipeline — How It Works

By Patrick McCurley · Created Mar 16, 2026 public

An overview of how the invoice data extraction pipeline turns courier invoices into structured, queryable pricing data. Built as a proof of concept, tested against real DPD and Evri invoices.

What It Does

The pipeline takes a courier invoice (PDF, CSV, or Excel) and extracts every individual line item — tracking numbers, service types, weights, amounts, surcharges — into a single normalised format. Whether it's a 49-page DPD invoice with 2,000 consignments or a single-page Evri summary, the output is the same structured data.

The Three Phases

Every invoice goes through three distinct steps: Extract, Transform, Load.

Extract — Reading the Document

A machine learning model (Docling) reads the invoice and identifies every table in the document. It doesn't just read text — it understands table structure, detecting headers, rows, column boundaries, and which page each table is on.

For a 49-page DPD invoice, this produces around 55 separate tables. Many of these have identical column structures repeated across pages (the same line-item format on every page), while a few are different (surcharge summaries, tax breakdowns, payment details).

Transform — Understanding the Format

This is where the intelligence lives. Rather than asking an AI to process every single row (slow, expensive, and inconsistent), the system takes a smarter approach:

  1. Groups identical tables — those 55 tables collapse to just 3-4 unique structures
  2. Shows the AI a small sample — just the column headers and 8 representative rows from across the entire invoice
  3. The AI returns rules, not data — "Column A is the tracking number, Column B is the service type, if the description contains 'Scottish Highlands' then the destination region is Scotland"
  4. The server applies those rules to every row — deterministically, instantly, consistently

The result: the AI call takes about 12 seconds and costs a fraction of a penny. But the rules it produces can process thousands of rows in milliseconds. And because the rules are cached, the next invoice from the same courier is processed instantly — no AI call needed.

Load — Structured Output

Every line item is normalised into a common format regardless of which courier it came from. A DPD consignment and an Evri parcel end up with the same fields: tracking number, service type, amount, category, weight, destination.

How It Learns Invoice Formats

The system doesn't need to be pre-configured for each courier. When it encounters a new invoice format, the AI analyses the table structure and creates a complete set of rules.

The AI produces four types of rules:

Classification Rules — The Secret Sauce

This is where the system gets particularly clever. Many courier invoices embed rich information inside free-text descriptions rather than in separate columns.

For example, an Evri invoice has a single "Description" column with entries like:

"Scottish Highlands & Islands Packet Despatch A (1g to 2000g)"

The AI recognises the naming convention and creates pattern-matching rules that extract:

These rules are applied to every row automatically. If the description says "Northern Ireland Parcel Despatch A (2001g to 15000g)", the same rules classify it as Northern Ireland, Parcel (heavy), 2-15kg, Standard.

Caching — Learn Once, Apply Forever

Once the system has seen a courier's invoice format, it never needs to call the AI again for that format. The mapping rules are cached and reused:

The cache automatically invalidates if the output schema changes, ensuring stale rules don't produce incorrect data.

How It Handles Different Couriers

Each courier structures their invoices differently. The system adapts to whatever structure it finds.

Despite these very different formats, both produce the same normalised output — every line item has a tracking number (where available), service type, amount, category, and confidence score.

What We've Proven

The POC has been tested against real invoices from DPD and Evri.

Metric DPD (49 pages) Evri (2 pages)
Line items extracted 1,998 of ~2,015 All
Amount accuracy 99.2% (£17,655 of £17,807) 100%
First-run processing ~2.5 minutes (extraction + AI) ~15 seconds
Subsequent runs ~20 milliseconds ~20 milliseconds
Service types detected 4 (Standard, Next Day, Express 10:30, Saturday) 3 (Courier, Standard, Next Day)
Regions classified UK Mainland UK Mainland, Scottish Highlands, Northern Ireland, Channel Islands, Isle of Man

The 0.8% gap in DPD data comes from a known limitation: when a table row spans a page boundary in the PDF, the ML extraction occasionally splits it into two partial rows. This is an extraction-layer issue, not a mapping issue.

Production Architecture Direction

The POC validates the approach. For production, the plan is to build on Claimit's existing infrastructure:

Key decisions for production

Open Questions

These are the decisions to make before building the production system:

Question Options Consideration
Docling scaling Always-on vs scale-to-zero Invoices arrive monthly in batches. 30-second cold start acceptable per batch?
LLM provider OpenRouter / Direct API / Self-hosted Cost vs reliability vs latency trade-offs at volume
Accuracy monitoring Manual review vs automated reconciliation Can we auto-compare ETL totals against invoice footer totals?
Multi-tenant Shared vs isolated databases Different clients have different couriers and access requirements
Historical data Backfill vs forward-only Do we need to process archived invoices or just new ones?