By Patrick McCurley

Vision Model Benchmark — Full Pipeline Results

By Patrick McCurley · Created Mar 19, 2026 · Updated Mar 20, 2026 public

End-to-end benchmark of the vision extraction pipeline against a 36-page DPD invoice (1,407 line items, £13,653.10 expected total). Tests the complete architecture: Docling table extraction, LLM page classification, vision-based line item extraction, and Sonnet summary extraction for surcharges.

Pipeline Architecture

The pipeline splits work between a cheap model (line items on data pages) and Sonnet (surcharges on summary pages). An LLM classifier decides which pages go where.

Why Two Models?

Pages 2-32 contain individual shipment rows. The "Amount" column shows base charges only (before surcharges). Page 33 contains aggregate surcharge totals (Fuel 12.15%, Third Party Collection, etc.) that only exist as summary figures — never per-item. A cheap model handles the bulk data extraction, while Sonnet reads the complex summary page.

Full Invoice Results

All 4 models ran against all 36 pages with identical pipeline settings: 2-page batches, sequential processing, same Docling tables and header corrections. Summary extraction (Sonnet, £1,856.63) is constant across all runs.

Summary Extraction (Constant Across All Runs)

Sonnet reads pages 1, 33, 34, 35 and extracts 14 items totalling £1,856.63 in surcharges:

Surcharge	Amount
Fuel and Energy (3 service tiers)	£1,548.83
Third Party Collection	£187.50
Nothing to Collect (3 entries)	£37.50
ROI Two Day Parcels (2 shipments)	£34.40
Fourth Party Collection	£15.00
Clearance (2 entries)	£20.00
Non Coms Handling	£6.75
Congestion	£6.65

These surcharges are percentage-based fees applied to the invoice total — they never appear in per-item "Amount" columns on pages 2-32. This is why raw line-item extraction tops out at ~86% (£11,796) before summary extraction adds the remaining ~14%.

Why GPT-5.4-mini Fails at Scale

GPT-5.4-mini was the 2-page spot test champion (92/92 items, fastest, perfect accuracy). At full-invoice scale it non-deterministically drops items:

The failure mode is early stopping — the model simply stops generating the JSON array partway through, returning valid JSON for the items it did produce. This happens randomly across batches and is not reproducible on the same batch. For production use, this is disqualifying.

Cost Comparison

Actual measured costs from the full 36-page benchmark (line item extraction + shared Sonnet summary):

Gemini 3 Flash delivers 100% accuracy at $0.18 — that's 21x cheaper than Sonnet with identical results.

Recommendation

Do not use GPT-5.4-mini for production extraction. Despite strong 2-page performance, it has a non-deterministic early stopping bug that silently drops items at scale. It costs 6.5x more than Gemini 3 Flash for worse accuracy.

Summary Extraction is Non-Negotiable

The line-item model (any of them) can only extract base charges from per-item rows. Surcharges like Fuel & Energy (12.15%) exist only as aggregate totals on summary pages. Without the Sonnet summary step, every model tops out at ~86% regardless of how accurately it reads individual rows.

Key learning: The 86% accuracy ceiling was never a model quality issue — it was a missing pipeline step. Adding summary extraction took all models from ~86% to 95-100%.

Vision Model Benchmark — Full Pipeline Results

Pipeline Architecture

Why Two Models?

Full Invoice Results

Summary Extraction (Constant Across All Runs)

Why GPT-5.4-mini Fails at Scale

Cost Comparison

Recommendation

Summary Extraction is Non-Negotiable

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share