Vision Model Benchmark — Multi-Model Comparison

Spot test comparing 5 vision models on 2 pages of a DPD invoice, plus a Claude native PDF reading baseline. Goal: find the best price/accuracy trade-off for production vision-based extraction.

Test: Pages 5-6 of DPD invoice 3006995.I61645007.pdf — 92 line items of dense tabular shipment data.

Test Setup

All models received identical input: 2 PNG images rendered at 1.5× scale from the docling server's /render-pages endpoint, plus the same structured extraction prompt targeting the MASTER_SCHEMA_PROMPT format.

Results

Three models produced identical results (92 items, £652.55) — a strong consensus that this is the correct answer. Two models fell short.

Baseline Comparison

A Claude Sonnet subagent reading the same pages via native PDF vision (not rendered PNGs) reported 92 items, £598.60 — a £53.95 shortfall. The 3-model consensus at £652.55 is more trustworthy: three independently-trained models (Anthropic, OpenAI, Qwen) receiving identical PNG input all agree.

Analysis

The Winners

GPT-5.4-mini is the standout: perfect accuracy, fastest (42s), and moderate cost ($0.075/batch). Projected for a full 36-page invoice:

Metric	Sonnet 4.6 (proven)	GPT-5.4-mini (projected)	Savings
Accuracy	£13,653.10 exact	Expected same	—
Time	~10.5 min	~3.5 min	3× faster
Cost	$3.45	~$1.15	67% cheaper

Qwen 3.5-35b is the budget option: perfect accuracy at $0.014/batch (cheapest of all), but 2.4× slower than GPT-5.4-mini.

The Failures

GPT-5.4-nano skipped 13 rows (86% item count) — too small a model for dense tabular vision. Not viable.

Qwen 3.5-flash found all 92 rows but misread amounts, ending up £109 short (83% accuracy). The speed-optimized variant sacrifices precision on small numbers.

Why 3-Model Consensus Matters

Cost Projection — Full Invoice (36 pages)

At 6 pages per batch (6 batches total), projected costs for the full DPD invoice:

Recommendation

For production: Use GPT-5.4-mini as the primary vision model. It's the sweet spot — perfect accuracy, fastest response time, and 67% cheaper than Sonnet.

Fallback: Keep Claude Sonnet 4.6 as a fallback if OpenAI has availability issues. Same accuracy, just slower and pricier.

Budget option: Qwen 3.5-35b at $0.08/invoice is remarkable value, but the 101s latency per batch means ~17 minutes for a full invoice. Could work for background/batch processing where latency doesn't matter.

Next step: Run GPT-5.4-mini on the full 36-page invoice to confirm the projection holds at scale before integrating into the transform endpoint.