By Patrick McCurley

Vision Extraction Debug — Where GPT-5.4-mini Fails

By Patrick McCurley · Created Mar 19, 2026 public

Detailed investigation into why GPT-5.4-mini produces correct results in isolated tests but hallucinates amounts (300-460%) when running through the full pipeline on the 36-page DPD invoice.

The Paradox

Individual batch tests return perfect results. The full pipeline returns 4-5× overcounting. Same model, same prompt template, same headers.

The Exact Prompt (pages 5-6 batch)

This is the verbatim prompt sent to GPT-5.4-mini for the pages 5-6 batch, captured from the production pipeline:

You are extracting line items from a DPD courier invoice (pages 5-6).

The tables on these pages:
Columns: Collection Date | Consignment Number | Reference | Collection Details |
         Delivery Details | Service | Parcels | Surcharge | Weight (Kg) | Vat Code | Amount
Field mapping: trackingNumber ← 'Consignment Number' (column 2);
               shipmentDate ← 'Collection Date' (column 1);
               totalAmount ← 'Amount' (column 11, rightmost monetary values);
               serviceType ← 'Service' (column 6);
               description ← 'Delivery Details' (column 5)

Extract EVERY line item visible in these invoice pages.
Each row in the tables represents a shipment or charge.

Return a JSON array of objects matching this schema:
{
  description: string,
  quantity: number,
  totalAmount: number,          // Total cost for this line
  category: "base_service" | "surcharge" | "collection" | "tax" | "discount" | "regional",
  carrier?: string,
  trackingNumber?: string,
  shipmentDate?: string,        // YYYY-MM-DD
  subcategory?: string,
  serviceType?: string,
  weightBand?: string,
  destinationCountry?: string,
  destinationRegion?: string,
  isSurcharge: boolean,
  isDiscount: boolean,
  confidence: number            // 0-1
}

Rules:
- Extract ALL rows — do not skip any, do not summarize
- Map table columns to schema fields based on their names
- totalAmount should be the charge/amount column (not weight, not quantity)
- If a page is a summary page or header page with no line items, return []

Return ONLY the JSON array, no markdown fences, no explanation.

The prompt includes:

Correct column headers (extracted via vision header correction — one cheap call per layout)
Explicit field mappings (totalAmount ← 'Amount' column 11, rightmost monetary values)
Full 13-field MASTER_SCHEMA_PROMPT
Clear rules about what totalAmount means

What the Model Sees

The model receives 2 PNG images of invoice pages rendered at 1.5× scale. Each page shows a dense table with ~46 rows. The table has these visible columns:

Collection  Consignment  Reference  Collection  Delivery  Service    Parcels  Surcharge  Weight  Vat   Amount
Date        Number                  Details     Details                                  (Kg)    Code
────────────────────────────────────────────────────────────────────────────────────────────────────────────
26/10/2025  6951848404/0 #771901    Ipswich     KT20      NEXT DAY   1        AS         4.0     S     4.42
26/10/2025  6951848489/0 #772061    Ipswich     NG5       NEXT DAY   1        AS         4.0     S     5.47
26/10/2025  6951848533/0 #771954    Ipswich     PO20      NEXT DAY   1        AS         4.0     S     5.20

The key challenge: Amount (£4.42) and Weight (4.0) are adjacent numeric columns with similar-looking values.

What It Returns (pages 5-6, isolated test)

Correct. Every item matches the Sonnet 4.6 POC ground truth:

Tracking	Mini Amount	POC Amount	Match
6951848404/0	£24.37	£24.37	✓
6951848489/0	£5.47	£5.47	✓
6951848533/0	£5.20	£5.20	✓
6951848541/0	£6.55	£6.55	✓
6951848549/0	£5.74	£5.74	✓
...	...	...	all ✓

92 items, £652.55 — penny-perfect match.

What Goes Wrong at Scale

When the same prompt runs across 18 batches (36 pages ÷ 2), the pipeline logs show:

Every batch: headers=yes, mapping=yes
Pages 5-6 response: 92 items, £652.55     ← CORRECT
Full pipeline total: 1,432 items, £76,335  ← 467% overcount

The per-batch header injection is confirmed working. Every batch receives correct headers and field mappings. Yet the total is 4.7× too high.

Diagnosing the Overcount

Isolated batch tests on the last pages:

Pages	Items	Total	Avg/item	Status
29-30	92	£808	£8.78	✓ Normal
31-32	77	£1,131	£14.69	✓ Offshore (higher rates)
33-34	14	£1,857	£132.62	⚠ Surcharge summaries
35-36	0	£0	—	✓ Correctly empty

Page 33 is a known problem — the model reads surcharge summary totals (e.g. "Fuel and Energy Charge: £1,385.32") as individual line items. But that only accounts for ~£1,900 of overcounting.

The remaining £61,000 of overcount has no visible source in isolated tests. Every batch tested individually returns correct or near-correct results.

The Core Mystery

All Configurations Tested

* "Overfitted manual" means hardcoded DPD-specific column names in the prompt — not generalizable to other couriers.

What Works vs What Doesn't

Hypothesis: Concurrent Request Inconsistency

The most likely explanation is that GPT-5.4-mini via OpenRouter behaves differently under concurrent load. Evidence:

Sequential isolated test (1 batch at a time): Perfect accuracy
Pipeline with concurrency=3 (3 batches in flight): 300-460% overcount
Both use identical prompts — captured and verified

OpenRouter routes to multiple provider backends. Under concurrent load, different batches may hit different model instances (different quantization levels, different hardware). A single sequential test always hits the same warm instance.

Recommendations

For production accuracy: Use Sonnet 4.6 for vision fallback ($3.45, proven 100%)
For cost optimization: Rely on the table path with Phase 3 header correction (95-99% on DPD, $0.08)
Vision header correction is valuable regardless — it produces perfect column names and field mappings that benefit any downstream model
Test concurrency=1 with GPT-5.4-mini before ruling it out — the issue may be purely a concurrent routing problem

Next Steps

Test GPT-5.4-mini with concurrency=1 (sequential batches) through the pipeline
Test Sonnet 4.6 with correct headers + field mappings (should be even cheaper than the POC since headers reduce confusion)
Consider a hybrid: GPT-5.4-mini for header correction, Sonnet for extraction

Vision Extraction Debug — Where GPT-5.4-mini Fails

The Paradox

The Exact Prompt (pages 5-6 batch)

What the Model Sees

What It Returns (pages 5-6, isolated test)

What Goes Wrong at Scale

Diagnosing the Overcount

Isolated batch tests on the last pages:

The Core Mystery

All Configurations Tested

What Works vs What Doesn't

Hypothesis: Concurrent Request Inconsistency

Recommendations

Next Steps

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share