Spot test comparing 5 vision models on 2 pages of a DPD invoice, plus a Claude native PDF reading baseline. Goal: find the best price/accuracy trade-off for production vision-based extraction.
Test: Pages 5-6 of DPD invoice 3006995.I61645007.pdf — 92 line items of dense tabular shipment data.
Test Setup
All models received identical input: 2 PNG images rendered at 1.5× scale from the docling server's /render-pages endpoint, plus the same structured extraction prompt targeting the MASTER_SCHEMA_PROMPT format.
Results
Three models produced identical results (92 items, £652.55) — a strong consensus that this is the correct answer. Two models fell short.
Baseline Comparison
A Claude Sonnet subagent reading the same pages via native PDF vision (not rendered PNGs) reported 92 items, £598.60 — a £53.95 shortfall. The 3-model consensus at £652.55 is more trustworthy: three independently-trained models (Anthropic, OpenAI, Qwen) receiving identical PNG input all agree.
Analysis
The Winners
GPT-5.4-mini is the standout: perfect accuracy, fastest (42s), and moderate cost ($0.075/batch). Projected for a full 36-page invoice:
| Metric | Sonnet 4.6 (proven) | GPT-5.4-mini (projected) | Savings |
|---|---|---|---|
| Accuracy | £13,653.10 exact | Expected same | — |
| Time | ~10.5 min | ~3.5 min | 3× faster |
| Cost | $3.45 | ~$1.15 | 67% cheaper |
Qwen 3.5-35b is the budget option: perfect accuracy at $0.014/batch (cheapest of all), but 2.4× slower than GPT-5.4-mini.
The Failures
GPT-5.4-nano skipped 13 rows (86% item count) — too small a model for dense tabular vision. Not viable.
Qwen 3.5-flash found all 92 rows but misread amounts, ending up £109 short (83% accuracy). The speed-optimized variant sacrifices precision on small numbers.
Why 3-Model Consensus Matters
Cost Projection — Full Invoice (36 pages)
At 6 pages per batch (6 batches total), projected costs for the full DPD invoice:
Recommendation
For production: Use GPT-5.4-mini as the primary vision model. It's the sweet spot — perfect accuracy, fastest response time, and 67% cheaper than Sonnet.
Fallback: Keep Claude Sonnet 4.6 as a fallback if OpenAI has availability issues. Same accuracy, just slower and pricier.
Budget option: Qwen 3.5-35b at $0.08/invoice is remarkable value, but the 101s latency per batch means ~17 minutes for a full invoice. Could work for background/batch processing where latency doesn't matter.
Next step: Run GPT-5.4-mini on the full 36-page invoice to confirm the projection holds at scale before integrating into the transform endpoint.