Vision Extraction Debug — Where GPT-5.4-mini Fails
Detailed investigation into why GPT-5.4-mini produces correct results in isolated tests but hallucinates amounts (300-460%) when running through the full pipeline on the 36-page DPD invoice.
The Paradox
Individual batch tests return perfect results. The full pipeline returns 4-5× overcounting. Same model, same prompt template, same headers.
The Exact Prompt (pages 5-6 batch)
This is the verbatim prompt sent to GPT-5.4-mini for the pages 5-6 batch, captured from the production pipeline:
You are extracting line items from a DPD courier invoice (pages 5-6).
The tables on these pages:
Columns: Collection Date | Consignment Number | Reference | Collection Details |
Delivery Details | Service | Parcels | Surcharge | Weight (Kg) | Vat Code | Amount
Field mapping: trackingNumber ← 'Consignment Number' (column 2);
shipmentDate ← 'Collection Date' (column 1);
totalAmount ← 'Amount' (column 11, rightmost monetary values);
serviceType ← 'Service' (column 6);
description ← 'Delivery Details' (column 5)
Extract EVERY line item visible in these invoice pages.
Each row in the tables represents a shipment or charge.
Return a JSON array of objects matching this schema:
{
description: string,
quantity: number,
totalAmount: number, // Total cost for this line
category: "base_service" | "surcharge" | "collection" | "tax" | "discount" | "regional",
carrier?: string,
trackingNumber?: string,
shipmentDate?: string, // YYYY-MM-DD
subcategory?: string,
serviceType?: string,
weightBand?: string,
destinationCountry?: string,
destinationRegion?: string,
isSurcharge: boolean,
isDiscount: boolean,
confidence: number // 0-1
}
Rules:
- Extract ALL rows — do not skip any, do not summarize
- Map table columns to schema fields based on their names
- totalAmount should be the charge/amount column (not weight, not quantity)
- If a page is a summary page or header page with no line items, return []
Return ONLY the JSON array, no markdown fences, no explanation.The prompt includes:
- Correct column headers (extracted via vision header correction — one cheap call per layout)
- Explicit field mappings (
totalAmount ← 'Amount' column 11, rightmost monetary values) - Full 13-field MASTER_SCHEMA_PROMPT
- Clear rules about what totalAmount means
What the Model Sees
The model receives 2 PNG images of invoice pages rendered at 1.5× scale. Each page shows a dense table with ~46 rows. The table has these visible columns:
Collection Consignment Reference Collection Delivery Service Parcels Surcharge Weight Vat Amount
Date Number Details Details (Kg) Code
────────────────────────────────────────────────────────────────────────────────────────────────────────────
26/10/2025 6951848404/0 #771901 Ipswich KT20 NEXT DAY 1 AS 4.0 S 4.42
26/10/2025 6951848489/0 #772061 Ipswich NG5 NEXT DAY 1 AS 4.0 S 5.47
26/10/2025 6951848533/0 #771954 Ipswich PO20 NEXT DAY 1 AS 4.0 S 5.20The key challenge: Amount (£4.42) and Weight (4.0) are adjacent numeric columns with similar-looking values.
What It Returns (pages 5-6, isolated test)
Correct. Every item matches the Sonnet 4.6 POC ground truth:
| Tracking | Mini Amount | POC Amount | Match |
|---|---|---|---|
| 6951848404/0 | £24.37 | £24.37 | ✓ |
| 6951848489/0 | £5.47 | £5.47 | ✓ |
| 6951848533/0 | £5.20 | £5.20 | ✓ |
| 6951848541/0 | £6.55 | £6.55 | ✓ |
| 6951848549/0 | £5.74 | £5.74 | ✓ |
| ... | ... | ... | all ✓ |
92 items, £652.55 — penny-perfect match.
What Goes Wrong at Scale
When the same prompt runs across 18 batches (36 pages ÷ 2), the pipeline logs show:
Every batch: headers=yes, mapping=yes
Pages 5-6 response: 92 items, £652.55 ← CORRECT
Full pipeline total: 1,432 items, £76,335 ← 467% overcountThe per-batch header injection is confirmed working. Every batch receives correct headers and field mappings. Yet the total is 4.7× too high.
Diagnosing the Overcount
Isolated batch tests on the last pages:
| Pages | Items | Total | Avg/item | Status |
|---|---|---|---|---|
| 29-30 | 92 | £808 | £8.78 | ✓ Normal |
| 31-32 | 77 | £1,131 | £14.69 | ✓ Offshore (higher rates) |
| 33-34 | 14 | £1,857 | £132.62 | ⚠ Surcharge summaries |
| 35-36 | 0 | £0 | — | ✓ Correctly empty |
Page 33 is a known problem — the model reads surcharge summary totals (e.g. "Fuel and Energy Charge: £1,385.32") as individual line items. But that only accounts for ~£1,900 of overcounting.
The remaining £61,000 of overcount has no visible source in isolated tests. Every batch tested individually returns correct or near-correct results.
The Core Mystery
All Configurations Tested
* "Overfitted manual" means hardcoded DPD-specific column names in the prompt — not generalizable to other couriers.
What Works vs What Doesn't
Hypothesis: Concurrent Request Inconsistency
The most likely explanation is that GPT-5.4-mini via OpenRouter behaves differently under concurrent load. Evidence:
- Sequential isolated test (1 batch at a time): Perfect accuracy
- Pipeline with concurrency=3 (3 batches in flight): 300-460% overcount
- Both use identical prompts — captured and verified
OpenRouter routes to multiple provider backends. Under concurrent load, different batches may hit different model instances (different quantization levels, different hardware). A single sequential test always hits the same warm instance.
Recommendations
- For production accuracy: Use Sonnet 4.6 for vision fallback ($3.45, proven 100%)
- For cost optimization: Rely on the table path with Phase 3 header correction (95-99% on DPD, $0.08)
- Vision header correction is valuable regardless — it produces perfect column names and field mappings that benefit any downstream model
- Test concurrency=1 with GPT-5.4-mini before ruling it out — the issue may be purely a concurrent routing problem
Next Steps
- Test GPT-5.4-mini with
concurrency=1(sequential batches) through the pipeline - Test Sonnet 4.6 with correct headers + field mappings (should be even cheaper than the POC since headers reduce confusion)
- Consider a hybrid: GPT-5.4-mini for header correction, Sonnet for extraction