Vision Trace Forensics — Per-Batch Analysis
Sequential trace of GPT-5.4-mini processing all 18 batches (2 pages each) of DPD invoice 3006995.I61645007.pdf. Every batch ran one at a time — no concurrency. Full prompt, full response, all items logged.
Summary: 3 Types of Failure
Out of 12 completed batches, we found three distinct failure modes:
Key finding: The amounts are almost always correct. Only 7 out of ~1,009 items had wrong amounts (0.7% error rate), and those were just swaps with adjacent rows. The real problems are incomplete extraction and misread tracking numbers.
Per-Batch Results
| Batch | Pages | Items | Total | Comp. Tokens | Finish | Amount Errors | Tracking Errors |
|---|---|---|---|---|---|---|---|
| 1 | 1-2 | 40 | £756.40 | 5,285 | stop | 0 | 0 |
| 2 | 3-4 | 91 | £1,188.29 | 12,198 | stop | 0 | 1 |
| 3 | 5-6 | 50 | £386.90 | 6,606 | stop | 0 | 0 |
| 4 | 7-8 | 92 | £614.96 | 11,873 | stop | 0 | 1 |
| 5 | 9-10 | 92 | £592.34 | 11,597 | stop | 0 | 3 |
| 6 | 11-12 | 92 | £652.82 | 12,414 | stop | 3 | 14 |
| 7 | 13-14 | 92 | £660.11 | 11,420 | stop | 0 | 0 |
| 8 | 15-16 | 92 | £600.44 | 11,231 | stop | 0 | 22 |
| 9 | 17-18 | 92 | £667.67 | 11,818 | stop | 0 | 1 |
| 10 | 19-20 | 92 | £765.68 | 12,103 | stop | 0 | 0 |
| 11 | 21-22 | 92 | £656.87 | 11,505 | stop | 0 | 1 |
| 12 | 23-24 | 92 | £651.47 | 11,768 | stop | 4 | 1 |
Running total after 12 batches: 1,009 items, £8,193.95 (60% of expected £13,653.10)
Projected full run: ~1,500 items, ~£12,300 — roughly 90% accuracy. This is an undercount, not an overcount.
Failure Mode 1: Early Stop (Batch 3, Pages 5-6)
The model extracted 50 items then stopped, despite 42 more rows being clearly visible on page 6. finish_reason: stop — the model chose to end, not a token limit.
What the model saw
Pages 5-6 each show ~46 rows of shipment data in a clear table format. All rows are "Next Day" service with standard amounts (£4.42-£8.85).
What it returned
50 items with correct amounts — every extracted item matches the POC ground truth. It just stopped early.
Item 1: 6951848404/0 | £24.37 | Next Day | KT20 ← correct
Item 2: 6951848489/0 | £5.47 | Next Day | NG5 ← correct
...
Item 50: 6939648211/0 | £6.55 | Next Day | TW12 ← correct, then STOPPEDLast item is tracking 6939648211 on page 6. The remaining ~42 rows on page 6 were not extracted.
Token analysis
- Completion tokens: 6,606 (vs ~11,500 for successful 92-item batches)
- Response chars: 21,098 (vs ~36,000 for full batches)
- The model used ~57% of the tokens a full extraction would need
Diagnosis: The model hit some internal threshold and decided it had extracted "enough". With 13 fields × 50 items = 650 field values, it may have estimated it was near the end of the table. This is a well-documented issue with smaller models on repetitive structured output — they "get bored" and stop generating.
Failure Mode 2: Truncated Tracking Numbers (Batch 6, Pages 11-12)
The model extracted all 92 items but misread 14 tracking numbers by dropping a digit.
Example
| PDF shows | Model returned |
|---|---|
6939772770 |
693977277/0 |
6939773740 |
693977374/0 |
6939774950 |
693977495/0 |
6939775280 |
693977528/0 |
The pattern: the model reads 6939772770 as 693977277 + /0 — it splits the last digit 0 into the /0 suffix. The actual format in the PDF is 6939772770/0 (10 digits + /0), but the model reads 9 digits and appends /0.
Impact
These items have correct amounts but wrong tracking numbers. The dedup key uses tracking numbers, so these appear as "new" items that don't match the POC reference. In the pipeline, they wouldn't cause overcounting (the amounts are right) — but they'd prevent dedup from catching genuine duplicates at batch boundaries.
Failure Mode 3: Adjacent Amount Swaps (Batch 6 & 12)
A small number of items (7 total across all batches) have their amounts swapped with an adjacent row.
Examples from Batch 6
| Tracking | Trace Amount | POC Amount | Diff |
|---|---|---|---|
| 6939830996/0 | £5.47 | £5.20 | +£0.27 |
| 6939831065/0 | £5.20 | £5.47 | -£0.27 |
| 6939831137/0 | £5.47 | £5.20 | +£0.27 |
The errors cancel out — the total per batch is correct even though individual items are wrong. This is the model confusing which row gets £5.20 vs £5.47 when they're adjacent in a dense table.
Good Batch for Comparison (Batch 7, Pages 13-14)
92 items, £660.11, zero errors, all tracking numbers correct. 11,420 completion tokens, finish=stop.
Item 1: 6939928644/0 | £5.20 | Next Day | EX33
Item 2: 6939928757/0 | £7.09 | Next Day | BS40
...
Item 92: 6940005922/0 | £5.74 | Next Day | CF11This is what every batch should look like. The model CAN do it — it's just inconsistent.
The Prompt (identical for all batches)
You are extracting line items from a DPD courier invoice (pages {X}-{Y}).
The tables on these pages:
Columns: Collection Date | Consignment Number | Reference |
Collection Details | Delivery Details | Service |
Parcels | Surcharge | Weight (Kg) | Vat Code | Amount
Field mapping: trackingNumber ← 'Consignment Number' (column 2);
shipmentDate ← 'Collection Date' (column 1);
totalAmount ← 'Amount' (column 11, rightmost monetary
charge column; NOT weight or parcel count);
serviceType ← 'Service' (column 6);
description ← 'Delivery Details' (column 5)
Extract EVERY line item visible in these invoice pages.
Return a JSON array of objects matching this schema:
{
description: string,
quantity: number,
totalAmount: number, // Total cost for this line
category: "base_service" | "surcharge" | ...,
carrier?: string,
trackingNumber?: string,
shipmentDate?: string, // YYYY-MM-DD
subcategory?: string,
serviceType?: string,
weightBand?: string,
destinationCountry?: string,
destinationRegion?: string,
isSurcharge: boolean,
isDiscount: boolean,
confidence: number // 0-1
}
Rules:
- Extract ALL rows — do not skip any, do not summarize
- Map table columns to schema fields based on their names
- totalAmount should be the charge/amount column (not weight, not quantity)
- If a page is a summary page, return []
Return ONLY the JSON array.Why the Pipeline Shows 400%+ When the Trace Shows 90%
This sequential trace produced ~90% accuracy (undercounting). But the pipeline consistently reports 300-460% (overcounting). The difference:
- The pipeline runs 3 batches concurrently — same prompt, same model, but concurrent requests through OpenRouter may be routed to different model instances
- Non-determinism: Even sequentially, batch 3 got 50 items while the same pages returned 92 items in isolated tests. The model gives different answers on different calls
- The overcounting in the pipeline isn't amounts being wrong — it might be the quantity field being used as a multiplier somewhere, since the model fills
quantitywith parcel counts (e.g. qty=11 for a multi-parcel shipment), and something downstream might multiplytotalAmount × quantity
Open question: Is there code in the pipeline that multiplies
totalAmount × quantity? ThelogResultSummaryfunction just sumstotalAmountdirectly. But thevSummary.totalAmountis what gets reported — need to verify this isn't being inflated.
Conclusions
- Amounts are correct — GPT-5.4-mini reads the Amount column accurately (0.7% error rate on values)
- Item coverage is inconsistent — sometimes 92/92, sometimes 50/92 on the same pages
- Tracking numbers are occasionally truncated — drops the last digit before /0
- The 300-460% pipeline overcount is NOT from the model returning wrong amounts — it must be from how results are processed, or from non-deterministic behaviour producing wildly different results on concurrent runs