By Patrick McCurley

Vision Pipeline Findings — Root Cause & Fix

By Patrick McCurley · Created Mar 19, 2026 public

The Root Cause

The 300-460% overcount was NOT model hallucination. It was summary pages being extracted as line items.

What Was Happening

The 36-page DPD invoice has three types of pages:

Pages Type Content Count
2-32 Line items Individual shipment charges (£4-25 each) ~1,400 rows
1 Summary Invoice overview: "Consignments £29,579" Totals only
33-35 Summary Surcharge breakdown, VAT analysis, payment summary Aggregate totals
36 Empty Blank page Nothing

When all pages were sent to the vision model, pages 1 and 33-35 produced items like:

"Fuel and Energy Charge"    → totalAmount: £1,385.32  (surcharge TOTAL, not a line item)
"Consignments"              → totalAmount: £29,579.00 (invoice TOTAL)
"VAT"                       → totalAmount: £7,540.06  (VAT TOTAL)

These aggregate amounts were summed alongside the real per-shipment charges (£4.42, £5.20, etc.), inflating the total by 3-5×.

The Fix: LLM Page Classification

One cheap LLM call (~$0.001) classifies each distinct table layout as "line_items" or "summary":

What the LLM receives

I have a courier invoice PDF with 36 pages. Here are the distinct table layouts:

Layout 1 (pages 2-32, 1420 rows):
  Headers: Collection Date | Consignment Number | Reference | ...

Layout 2 (page 1, 11 rows):
  Headers: Current Charges | Invoice Number 61645007

Layout 3 (page 33, 4 rows):
  Headers: Code | Description | Surcharge Rate Code | ...

Layout 4 (page 34, 2 rows):
  Headers: Number | Carriage | Miscellaneous Charges | ...

Layout 5 (page 35, 2 rows):
  Headers: Payment Reference | Document Type | ...

Classify each as "line_items" or "summary".

What it returns

[
  {"layout": 1, "type": "line_items"},
  {"layout": 2, "type": "summary"},
  {"layout": 3, "type": "summary"},
  {"layout": 4, "type": "summary"},
  {"layout": 5, "type": "summary"}
]

Pages 1, 33, 34, 35 excluded. 32 data pages re-batched into 16 batches of 2.

Updated Vision Pipeline Architecture

Results After Fix

Metric Before (all pages) After (data pages only)
Batches processed 18 16
Pages excluded 0 4 (pages 1, 33, 34, 35)
Total items 1,432 1,407
Total amount £76,335 £11,796
Accuracy 467% 86.4%
Per-batch amounts ✗ 3 batches wildly wrong ✓ All batches correct

Per-Batch Breakdown (after fix)

All batches now show reasonable per-item averages:

Pages Items Total Avg/item
2-3 86 £1,414 £16.44 (By 10:30 service)
4-5 92 £908 £9.87
6-7 92 £613 £6.66
8-9 92 £588 £6.39
10-29 920 ~£6,500 ~£7
30-31 86 £793 £9.22 (offshore mix)
32+36 39 £811 £20.79 (offshore/tail)

Remaining Gap: 86% → 100%

The 14% undercount is NOT from wrong amounts — it's from the model extracting fewer items than exist:

  1. Early stopping — some batches return 50 items when 92 are visible. The model decides it's "done" mid-page. finish_reason: stop at only 6,606 tokens.
  2. Missing offshore/ROI items — pages 31-32 have different service types (offshore surcharges) that the model sometimes skips.

These are model-capability issues, not architectural problems. Sonnet 4.6 doesn't have them (proven 100%). The architecture is now sound — the question is whether to accept 86% from GPT-5.4-mini or spend $3.45 on Sonnet.

Summary

The "model hallucination" was actually a page classification bug. The model extracted amounts correctly — but summary/total pages containing aggregate values (£29K, £48K) were processed alongside individual shipment rows (£4-25), inflating the total by 4-5×.

The fix was one cheap LLM call (~$0.001) to classify which page layouts are summaries vs line items, then excluding summary pages from extraction batches. Total added cost: negligible.