By Patrick McCurley

Vision Extraction Debug — Where GPT-5.4-mini Fails

By Patrick McCurley · Created Mar 19, 2026 public

Detailed investigation into why GPT-5.4-mini produces correct results in isolated tests but hallucinates amounts (300-460%) when running through the full pipeline on the 36-page DPD invoice.

The Paradox

Individual batch tests return perfect results. The full pipeline returns 4-5× overcounting. Same model, same prompt template, same headers.

The Exact Prompt (pages 5-6 batch)

This is the verbatim prompt sent to GPT-5.4-mini for the pages 5-6 batch, captured from the production pipeline:

You are extracting line items from a DPD courier invoice (pages 5-6).

The tables on these pages:
Columns: Collection Date | Consignment Number | Reference | Collection Details |
         Delivery Details | Service | Parcels | Surcharge | Weight (Kg) | Vat Code | Amount
Field mapping: trackingNumber ← 'Consignment Number' (column 2);
               shipmentDate ← 'Collection Date' (column 1);
               totalAmount ← 'Amount' (column 11, rightmost monetary values);
               serviceType ← 'Service' (column 6);
               description ← 'Delivery Details' (column 5)

Extract EVERY line item visible in these invoice pages.
Each row in the tables represents a shipment or charge.

Return a JSON array of objects matching this schema:
{
  description: string,
  quantity: number,
  totalAmount: number,          // Total cost for this line
  category: "base_service" | "surcharge" | "collection" | "tax" | "discount" | "regional",
  carrier?: string,
  trackingNumber?: string,
  shipmentDate?: string,        // YYYY-MM-DD
  subcategory?: string,
  serviceType?: string,
  weightBand?: string,
  destinationCountry?: string,
  destinationRegion?: string,
  isSurcharge: boolean,
  isDiscount: boolean,
  confidence: number            // 0-1
}

Rules:
- Extract ALL rows — do not skip any, do not summarize
- Map table columns to schema fields based on their names
- totalAmount should be the charge/amount column (not weight, not quantity)
- If a page is a summary page or header page with no line items, return []

Return ONLY the JSON array, no markdown fences, no explanation.

The prompt includes:

  1. Correct column headers (extracted via vision header correction — one cheap call per layout)
  2. Explicit field mappings (totalAmount ← 'Amount' column 11, rightmost monetary values)
  3. Full 13-field MASTER_SCHEMA_PROMPT
  4. Clear rules about what totalAmount means

What the Model Sees

The model receives 2 PNG images of invoice pages rendered at 1.5× scale. Each page shows a dense table with ~46 rows. The table has these visible columns:

Collection  Consignment  Reference  Collection  Delivery  Service    Parcels  Surcharge  Weight  Vat   Amount
Date        Number                  Details     Details                                  (Kg)    Code
────────────────────────────────────────────────────────────────────────────────────────────────────────────
26/10/2025  6951848404/0 #771901    Ipswich     KT20      NEXT DAY   1        AS         4.0     S     4.42
26/10/2025  6951848489/0 #772061    Ipswich     NG5       NEXT DAY   1        AS         4.0     S     5.47
26/10/2025  6951848533/0 #771954    Ipswich     PO20      NEXT DAY   1        AS         4.0     S     5.20

The key challenge: Amount (£4.42) and Weight (4.0) are adjacent numeric columns with similar-looking values.

What It Returns (pages 5-6, isolated test)

Correct. Every item matches the Sonnet 4.6 POC ground truth:

Tracking Mini Amount POC Amount Match
6951848404/0 £24.37 £24.37
6951848489/0 £5.47 £5.47
6951848533/0 £5.20 £5.20
6951848541/0 £6.55 £6.55
6951848549/0 £5.74 £5.74
... ... ... all ✓

92 items, £652.55 — penny-perfect match.

What Goes Wrong at Scale

When the same prompt runs across 18 batches (36 pages ÷ 2), the pipeline logs show:

Every batch: headers=yes, mapping=yes
Pages 5-6 response: 92 items, £652.55     ← CORRECT
Full pipeline total: 1,432 items, £76,335  ← 467% overcount

The per-batch header injection is confirmed working. Every batch receives correct headers and field mappings. Yet the total is 4.7× too high.

Diagnosing the Overcount

Isolated batch tests on the last pages:

Pages Items Total Avg/item Status
29-30 92 £808 £8.78 ✓ Normal
31-32 77 £1,131 £14.69 ✓ Offshore (higher rates)
33-34 14 £1,857 £132.62 ⚠ Surcharge summaries
35-36 0 £0 ✓ Correctly empty

Page 33 is a known problem — the model reads surcharge summary totals (e.g. "Fuel and Energy Charge: £1,385.32") as individual line items. But that only accounts for ~£1,900 of overcounting.

The remaining £61,000 of overcount has no visible source in isolated tests. Every batch tested individually returns correct or near-correct results.

The Core Mystery

All Configurations Tested

* "Overfitted manual" means hardcoded DPD-specific column names in the prompt — not generalizable to other couriers.

What Works vs What Doesn't

Hypothesis: Concurrent Request Inconsistency

The most likely explanation is that GPT-5.4-mini via OpenRouter behaves differently under concurrent load. Evidence:

  1. Sequential isolated test (1 batch at a time): Perfect accuracy
  2. Pipeline with concurrency=3 (3 batches in flight): 300-460% overcount
  3. Both use identical prompts — captured and verified

OpenRouter routes to multiple provider backends. Under concurrent load, different batches may hit different model instances (different quantization levels, different hardware). A single sequential test always hits the same warm instance.

Recommendations

  1. For production accuracy: Use Sonnet 4.6 for vision fallback ($3.45, proven 100%)
  2. For cost optimization: Rely on the table path with Phase 3 header correction (95-99% on DPD, $0.08)
  3. Vision header correction is valuable regardless — it produces perfect column names and field mappings that benefit any downstream model
  4. Test concurrency=1 with GPT-5.4-mini before ruling it out — the issue may be purely a concurrent routing problem

Next Steps