emberflow
Vision Model Benchmark — Multi-Model Comparison

Vision Model Benchmark — Multi-Model Comparison

Spot test comparing 5 vision models on 2 pages of a DPD invoice, plus a Claude native PDF reading baseline. Goal: find the best price/accuracy trade-off for production vision-based extraction.

Test: Pages 5-6 of DPD invoice 3006995.I61645007.pdf — 92 line items of dense tabular shipment data.

Test Setup

All models received identical input: 2 PNG images rendered at 1.5× scale from the docling server's /render-pages endpoint, plus the same structured extraction prompt targeting the MASTER_SCHEMA_PROMPT format.

Results

Three models produced identical results (92 items, £652.55) — a strong consensus that this is the correct answer. Two models fell short.

Baseline Comparison

A Claude Sonnet subagent reading the same pages via native PDF vision (not rendered PNGs) reported 92 items, £598.60 — a £53.95 shortfall. The 3-model consensus at £652.55 is more trustworthy: three independently-trained models (Anthropic, OpenAI, Qwen) receiving identical PNG input all agree.

Analysis

The Winners

GPT-5.4-mini is the standout: perfect accuracy, fastest (42s), and moderate cost ($0.075/batch). Projected for a full 36-page invoice:

Metric Sonnet 4.6 (proven) GPT-5.4-mini (projected) Savings
Accuracy £13,653.10 exact Expected same
Time ~10.5 min ~3.5 min 3× faster
Cost $3.45 ~$1.15 67% cheaper

Qwen 3.5-35b is the budget option: perfect accuracy at $0.014/batch (cheapest of all), but 2.4× slower than GPT-5.4-mini.

The Failures

GPT-5.4-nano skipped 13 rows (86% item count) — too small a model for dense tabular vision. Not viable.

Qwen 3.5-flash found all 92 rows but misread amounts, ending up £109 short (83% accuracy). The speed-optimized variant sacrifices precision on small numbers.

Why 3-Model Consensus Matters

Cost Projection — Full Invoice (36 pages)

At 6 pages per batch (6 batches total), projected costs for the full DPD invoice:

Recommendation

For production: Use GPT-5.4-mini as the primary vision model. It's the sweet spot — perfect accuracy, fastest response time, and 67% cheaper than Sonnet.

Fallback: Keep Claude Sonnet 4.6 as a fallback if OpenAI has availability issues. Same accuracy, just slower and pricier.

Budget option: Qwen 3.5-35b at $0.08/invoice is remarkable value, but the 101s latency per batch means ~17 minutes for a full invoice. Could work for background/batch processing where latency doesn't matter.

Next step: Run GPT-5.4-mini on the full 36-page invoice to confirm the projection holds at scale before integrating into the transform endpoint.

Appearance

Font family
Choose a reading font
Font size
Adjust reading size
100%