Invoice ETL Pipeline — How It Works
An overview of how the invoice data extraction pipeline turns courier invoices into structured, queryable pricing data. Built as a proof of concept, tested against real DPD and Evri invoices.
What It Does
The pipeline takes a courier invoice (PDF, CSV, or Excel) and extracts every individual line item — tracking numbers, service types, weights, amounts, surcharges — into a single normalised format. Whether it's a 49-page DPD invoice with 2,000 consignments or a single-page Evri summary, the output is the same structured data.
The Three Phases
Every invoice goes through three distinct steps: Extract, Transform, Load.
Extract — Reading the Document
A machine learning model (Docling) reads the invoice and identifies every table in the document. It doesn't just read text — it understands table structure, detecting headers, rows, column boundaries, and which page each table is on.
For a 49-page DPD invoice, this produces around 55 separate tables. Many of these have identical column structures repeated across pages (the same line-item format on every page), while a few are different (surcharge summaries, tax breakdowns, payment details).
Transform — Understanding the Format
This is where the intelligence lives. Rather than asking an AI to process every single row (slow, expensive, and inconsistent), the system takes a smarter approach:
- Groups identical tables — those 55 tables collapse to just 3-4 unique structures
- Shows the AI a small sample — just the column headers and 8 representative rows from across the entire invoice
- The AI returns rules, not data — "Column A is the tracking number, Column B is the service type, if the description contains 'Scottish Highlands' then the destination region is Scotland"
- The server applies those rules to every row — deterministically, instantly, consistently
The result: the AI call takes about 12 seconds and costs a fraction of a penny. But the rules it produces can process thousands of rows in milliseconds. And because the rules are cached, the next invoice from the same courier is processed instantly — no AI call needed.
Load — Structured Output
Every line item is normalised into a common format regardless of which courier it came from. A DPD consignment and an Evri parcel end up with the same fields: tracking number, service type, amount, category, weight, destination.
How It Learns Invoice Formats
The system doesn't need to be pre-configured for each courier. When it encounters a new invoice format, the AI analyses the table structure and creates a complete set of rules.
The AI produces four types of rules:
- Column Mappings — which source column maps to which output field (e.g. "Consignment Number" becomes
trackingNumber) - Classification Rules — pattern-matching to derive service type, region, category from text
- Skip Rules — identify rows to ignore (totals, subtotals, headers, metadata)
- Default Values — static fields like carrier name and currency
Classification Rules — The Secret Sauce
This is where the system gets particularly clever. Many courier invoices embed rich information inside free-text descriptions rather than in separate columns.
For example, an Evri invoice has a single "Description" column with entries like:
"Scottish Highlands & Islands Packet Despatch A (1g to 2000g)"
The AI recognises the naming convention and creates pattern-matching rules that extract:
These rules are applied to every row automatically. If the description says "Northern Ireland Parcel Despatch A (2001g to 15000g)", the same rules classify it as Northern Ireland, Parcel (heavy), 2-15kg, Standard.
Caching — Learn Once, Apply Forever
Once the system has seen a courier's invoice format, it never needs to call the AI again for that format. The mapping rules are cached and reused:
- Same courier, same month — instant (cache hit, ~20 milliseconds)
- Same courier, different month — usually instant (same format month-to-month)
- New courier or new format — one AI call (~12 seconds), then cached
The cache automatically invalidates if the output schema changes, ensuring stale rules don't produce incorrect data.
How It Handles Different Couriers
Each courier structures their invoices differently. The system adapts to whatever structure it finds.
Despite these very different formats, both produce the same normalised output — every line item has a tracking number (where available), service type, amount, category, and confidence score.
What We've Proven
The POC has been tested against real invoices from DPD and Evri.
| Metric | DPD (49 pages) | Evri (2 pages) |
|---|---|---|
| Line items extracted | 1,998 of ~2,015 | All |
| Amount accuracy | 99.2% (£17,655 of £17,807) | 100% |
| First-run processing | ~2.5 minutes (extraction + AI) | ~15 seconds |
| Subsequent runs | ~20 milliseconds | ~20 milliseconds |
| Service types detected | 4 (Standard, Next Day, Express 10:30, Saturday) | 3 (Courier, Standard, Next Day) |
| Regions classified | UK Mainland | UK Mainland, Scottish Highlands, Northern Ireland, Channel Islands, Isle of Man |
The 0.8% gap in DPD data comes from a known limitation: when a table row spans a page boundary in the PDF, the ML extraction occasionally splits it into two partial rows. This is an extraction-layer issue, not a mapping issue.
Production Architecture Direction
The POC validates the approach. For production, the plan is to build on Claimit's existing infrastructure:
Key decisions for production
Temporal for orchestration — Claimit already runs Temporal in production. Invoice processing is a natural fit: long-running, needs retry logic, needs progress visibility, has distinct phases with different failure modes.
Separate Docling worker — The ML extraction is CPU-heavy and takes minutes. It runs as its own container on a dedicated task queue, just like Claimit's existing browser workers run separately from email processing workers.
Azure PostgreSQL for storage — Normalised line items go into a relational database for querying, aggregation, and cross-invoice analysis. The schema is simple: an
invoicestable and aline_itemstable.Mapper cache stays lightweight — Cached mapping rules don't need a database. Redis or blob storage is sufficient for what amounts to a few KB of JSON per courier format.
Open Questions
These are the decisions to make before building the production system:
| Question | Options | Consideration |
|---|---|---|
| Docling scaling | Always-on vs scale-to-zero | Invoices arrive monthly in batches. 30-second cold start acceptable per batch? |
| LLM provider | OpenRouter / Direct API / Self-hosted | Cost vs reliability vs latency trade-offs at volume |
| Accuracy monitoring | Manual review vs automated reconciliation | Can we auto-compare ETL totals against invoice footer totals? |
| Multi-tenant | Shared vs isolated databases | Different clients have different couriers and access requirements |
| Historical data | Backfill vs forward-only | Do we need to process archived invoices or just new ones? |