By Patrick McCurley

Vision Trace Forensics — Per-Batch Analysis

By Patrick McCurley · Created Mar 19, 2026 public

Sequential trace of GPT-5.4-mini processing all 18 batches (2 pages each) of DPD invoice 3006995.I61645007.pdf. Every batch ran one at a time — no concurrency. Full prompt, full response, all items logged.

Summary: 3 Types of Failure

Out of 12 completed batches, we found three distinct failure modes:

Key finding: The amounts are almost always correct. Only 7 out of ~1,009 items had wrong amounts (0.7% error rate), and those were just swaps with adjacent rows. The real problems are incomplete extraction and misread tracking numbers.

Per-Batch Results

Batch	Pages	Items	Total	Comp. Tokens	Finish	Amount Errors	Tracking Errors
1	1-2	40	£756.40	5,285	stop	0	0
2	3-4	91	£1,188.29	12,198	stop	0	1
3	5-6	50	£386.90	6,606	stop	0	0
4	7-8	92	£614.96	11,873	stop	0	1
5	9-10	92	£592.34	11,597	stop	0	3
6	11-12	92	£652.82	12,414	stop	3	14
7	13-14	92	£660.11	11,420	stop	0	0
8	15-16	92	£600.44	11,231	stop	0	22
9	17-18	92	£667.67	11,818	stop	0	1
10	19-20	92	£765.68	12,103	stop	0	0
11	21-22	92	£656.87	11,505	stop	0	1
12	23-24	92	£651.47	11,768	stop	4	1

Running total after 12 batches: 1,009 items, £8,193.95 (60% of expected £13,653.10)

Projected full run: ~1,500 items, ~£12,300 — roughly 90% accuracy. This is an undercount, not an overcount.

Failure Mode 1: Early Stop (Batch 3, Pages 5-6)

The model extracted 50 items then stopped, despite 42 more rows being clearly visible on page 6. finish_reason: stop — the model chose to end, not a token limit.

What the model saw

Pages 5-6 each show ~46 rows of shipment data in a clear table format. All rows are "Next Day" service with standard amounts (£4.42-£8.85).

What it returned

50 items with correct amounts — every extracted item matches the POC ground truth. It just stopped early.

Item 1:  6951848404/0 | £24.37 | Next Day | KT20  ← correct
Item 2:  6951848489/0 | £5.47  | Next Day | NG5   ← correct
...
Item 50: 6939648211/0 | £6.55  | Next Day | TW12  ← correct, then STOPPED

Last item is tracking 6939648211 on page 6. The remaining ~42 rows on page 6 were not extracted.

Token analysis

Completion tokens: 6,606 (vs ~11,500 for successful 92-item batches)
Response chars: 21,098 (vs ~36,000 for full batches)
The model used ~57% of the tokens a full extraction would need

Diagnosis: The model hit some internal threshold and decided it had extracted "enough". With 13 fields × 50 items = 650 field values, it may have estimated it was near the end of the table. This is a well-documented issue with smaller models on repetitive structured output — they "get bored" and stop generating.

Failure Mode 2: Truncated Tracking Numbers (Batch 6, Pages 11-12)

The model extracted all 92 items but misread 14 tracking numbers by dropping a digit.

Example

PDF shows	Model returned
`6939772770`	`693977277/0`
`6939773740`	`693977374/0`
`6939774950`	`693977495/0`
`6939775280`	`693977528/0`

The pattern: the model reads 6939772770 as 693977277 + /0 — it splits the last digit 0 into the /0 suffix. The actual format in the PDF is 6939772770/0 (10 digits + /0), but the model reads 9 digits and appends /0.

Impact

These items have correct amounts but wrong tracking numbers. The dedup key uses tracking numbers, so these appear as "new" items that don't match the POC reference. In the pipeline, they wouldn't cause overcounting (the amounts are right) — but they'd prevent dedup from catching genuine duplicates at batch boundaries.

Failure Mode 3: Adjacent Amount Swaps (Batch 6 & 12)

A small number of items (7 total across all batches) have their amounts swapped with an adjacent row.

Examples from Batch 6

Tracking	Trace Amount	POC Amount	Diff
6939830996/0	£5.47	£5.20	+£0.27
6939831065/0	£5.20	£5.47	-£0.27
6939831137/0	£5.47	£5.20	+£0.27

The errors cancel out — the total per batch is correct even though individual items are wrong. This is the model confusing which row gets £5.20 vs £5.47 when they're adjacent in a dense table.

Good Batch for Comparison (Batch 7, Pages 13-14)

92 items, £660.11, zero errors, all tracking numbers correct. 11,420 completion tokens, finish=stop.

Item 1:  6939928644/0 | £5.20  | Next Day | EX33
Item 2:  6939928757/0 | £7.09  | Next Day | BS40
...
Item 92: 6940005922/0 | £5.74  | Next Day | CF11

This is what every batch should look like. The model CAN do it — it's just inconsistent.

The Prompt (identical for all batches)

You are extracting line items from a DPD courier invoice (pages {X}-{Y}).

The tables on these pages:
Columns: Collection Date | Consignment Number | Reference |
         Collection Details | Delivery Details | Service |
         Parcels | Surcharge | Weight (Kg) | Vat Code | Amount
Field mapping: trackingNumber ← 'Consignment Number' (column 2);
               shipmentDate ← 'Collection Date' (column 1);
               totalAmount ← 'Amount' (column 11, rightmost monetary
               charge column; NOT weight or parcel count);
               serviceType ← 'Service' (column 6);
               description ← 'Delivery Details' (column 5)

Extract EVERY line item visible in these invoice pages.

Return a JSON array of objects matching this schema:
{
  description: string,
  quantity: number,
  totalAmount: number,         // Total cost for this line
  category: "base_service" | "surcharge" | ...,
  carrier?: string,
  trackingNumber?: string,
  shipmentDate?: string,       // YYYY-MM-DD
  subcategory?: string,
  serviceType?: string,
  weightBand?: string,
  destinationCountry?: string,
  destinationRegion?: string,
  isSurcharge: boolean,
  isDiscount: boolean,
  confidence: number           // 0-1
}

Rules:
- Extract ALL rows — do not skip any, do not summarize
- Map table columns to schema fields based on their names
- totalAmount should be the charge/amount column (not weight, not quantity)
- If a page is a summary page, return []

Return ONLY the JSON array.

Why the Pipeline Shows 400%+ When the Trace Shows 90%

This sequential trace produced ~90% accuracy (undercounting). But the pipeline consistently reports 300-460% (overcounting). The difference:

The pipeline runs 3 batches concurrently — same prompt, same model, but concurrent requests through OpenRouter may be routed to different model instances
Non-determinism: Even sequentially, batch 3 got 50 items while the same pages returned 92 items in isolated tests. The model gives different answers on different calls
The overcounting in the pipeline isn't amounts being wrong — it might be the quantity field being used as a multiplier somewhere, since the model fills quantity with parcel counts (e.g. qty=11 for a multi-parcel shipment), and something downstream might multiply totalAmount × quantity

Open question: Is there code in the pipeline that multiplies totalAmount × quantity? The logResultSummary function just sums totalAmount directly. But the vSummary.totalAmount is what gets reported — need to verify this isn't being inflated.

Conclusions

Amounts are correct — GPT-5.4-mini reads the Amount column accurately (0.7% error rate on values)
Item coverage is inconsistent — sometimes 92/92, sometimes 50/92 on the same pages
Tracking numbers are occasionally truncated — drops the last digit before /0
The 300-460% pipeline overcount is NOT from the model returning wrong amounts — it must be from how results are processed, or from non-deterministic behaviour producing wildly different results on concurrent runs

Vision Trace Forensics — Per-Batch Analysis

Summary: 3 Types of Failure

Per-Batch Results

Failure Mode 1: Early Stop (Batch 3, Pages 5-6)

What the model saw

What it returned

Token analysis

Failure Mode 2: Truncated Tracking Numbers (Batch 6, Pages 11-12)

Example

Impact

Failure Mode 3: Adjacent Amount Swaps (Batch 6 & 12)

Examples from Batch 6

Good Batch for Comparison (Batch 7, Pages 13-14)

The Prompt (identical for all batches)

Why the Pipeline Shows 400%+ When the Trace Shows 90%

Conclusions

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share