By Patrick McCurley

Vision Trace Forensics — Per-Batch Analysis

By Patrick McCurley · Created Mar 19, 2026 public

Sequential trace of GPT-5.4-mini processing all 18 batches (2 pages each) of DPD invoice 3006995.I61645007.pdf. Every batch ran one at a time — no concurrency. Full prompt, full response, all items logged.

Summary: 3 Types of Failure

Out of 12 completed batches, we found three distinct failure modes:

Key finding: The amounts are almost always correct. Only 7 out of ~1,009 items had wrong amounts (0.7% error rate), and those were just swaps with adjacent rows. The real problems are incomplete extraction and misread tracking numbers.

Per-Batch Results

Batch Pages Items Total Comp. Tokens Finish Amount Errors Tracking Errors
1 1-2 40 £756.40 5,285 stop 0 0
2 3-4 91 £1,188.29 12,198 stop 0 1
3 5-6 50 £386.90 6,606 stop 0 0
4 7-8 92 £614.96 11,873 stop 0 1
5 9-10 92 £592.34 11,597 stop 0 3
6 11-12 92 £652.82 12,414 stop 3 14
7 13-14 92 £660.11 11,420 stop 0 0
8 15-16 92 £600.44 11,231 stop 0 22
9 17-18 92 £667.67 11,818 stop 0 1
10 19-20 92 £765.68 12,103 stop 0 0
11 21-22 92 £656.87 11,505 stop 0 1
12 23-24 92 £651.47 11,768 stop 4 1

Running total after 12 batches: 1,009 items, £8,193.95 (60% of expected £13,653.10)

Projected full run: ~1,500 items, ~£12,300 — roughly 90% accuracy. This is an undercount, not an overcount.

Failure Mode 1: Early Stop (Batch 3, Pages 5-6)

The model extracted 50 items then stopped, despite 42 more rows being clearly visible on page 6. finish_reason: stop — the model chose to end, not a token limit.

What the model saw

Pages 5-6 each show ~46 rows of shipment data in a clear table format. All rows are "Next Day" service with standard amounts (£4.42-£8.85).

What it returned

50 items with correct amounts — every extracted item matches the POC ground truth. It just stopped early.

Item 1:  6951848404/0 | £24.37 | Next Day | KT20  ← correct
Item 2:  6951848489/0 | £5.47  | Next Day | NG5   ← correct
...
Item 50: 6939648211/0 | £6.55  | Next Day | TW12  ← correct, then STOPPED

Last item is tracking 6939648211 on page 6. The remaining ~42 rows on page 6 were not extracted.

Token analysis

Diagnosis: The model hit some internal threshold and decided it had extracted "enough". With 13 fields × 50 items = 650 field values, it may have estimated it was near the end of the table. This is a well-documented issue with smaller models on repetitive structured output — they "get bored" and stop generating.

Failure Mode 2: Truncated Tracking Numbers (Batch 6, Pages 11-12)

The model extracted all 92 items but misread 14 tracking numbers by dropping a digit.

Example

PDF shows Model returned
6939772770 693977277/0
6939773740 693977374/0
6939774950 693977495/0
6939775280 693977528/0

The pattern: the model reads 6939772770 as 693977277 + /0 — it splits the last digit 0 into the /0 suffix. The actual format in the PDF is 6939772770/0 (10 digits + /0), but the model reads 9 digits and appends /0.

Impact

These items have correct amounts but wrong tracking numbers. The dedup key uses tracking numbers, so these appear as "new" items that don't match the POC reference. In the pipeline, they wouldn't cause overcounting (the amounts are right) — but they'd prevent dedup from catching genuine duplicates at batch boundaries.

Failure Mode 3: Adjacent Amount Swaps (Batch 6 & 12)

A small number of items (7 total across all batches) have their amounts swapped with an adjacent row.

Examples from Batch 6

Tracking Trace Amount POC Amount Diff
6939830996/0 £5.47 £5.20 +£0.27
6939831065/0 £5.20 £5.47 -£0.27
6939831137/0 £5.47 £5.20 +£0.27

The errors cancel out — the total per batch is correct even though individual items are wrong. This is the model confusing which row gets £5.20 vs £5.47 when they're adjacent in a dense table.

Good Batch for Comparison (Batch 7, Pages 13-14)

92 items, £660.11, zero errors, all tracking numbers correct. 11,420 completion tokens, finish=stop.

Item 1:  6939928644/0 | £5.20  | Next Day | EX33
Item 2:  6939928757/0 | £7.09  | Next Day | BS40
...
Item 92: 6940005922/0 | £5.74  | Next Day | CF11

This is what every batch should look like. The model CAN do it — it's just inconsistent.

The Prompt (identical for all batches)

You are extracting line items from a DPD courier invoice (pages {X}-{Y}).

The tables on these pages:
Columns: Collection Date | Consignment Number | Reference |
         Collection Details | Delivery Details | Service |
         Parcels | Surcharge | Weight (Kg) | Vat Code | Amount
Field mapping: trackingNumber ← 'Consignment Number' (column 2);
               shipmentDate ← 'Collection Date' (column 1);
               totalAmount ← 'Amount' (column 11, rightmost monetary
               charge column; NOT weight or parcel count);
               serviceType ← 'Service' (column 6);
               description ← 'Delivery Details' (column 5)

Extract EVERY line item visible in these invoice pages.

Return a JSON array of objects matching this schema:
{
  description: string,
  quantity: number,
  totalAmount: number,         // Total cost for this line
  category: "base_service" | "surcharge" | ...,
  carrier?: string,
  trackingNumber?: string,
  shipmentDate?: string,       // YYYY-MM-DD
  subcategory?: string,
  serviceType?: string,
  weightBand?: string,
  destinationCountry?: string,
  destinationRegion?: string,
  isSurcharge: boolean,
  isDiscount: boolean,
  confidence: number           // 0-1
}

Rules:
- Extract ALL rows — do not skip any, do not summarize
- Map table columns to schema fields based on their names
- totalAmount should be the charge/amount column (not weight, not quantity)
- If a page is a summary page, return []

Return ONLY the JSON array.

Why the Pipeline Shows 400%+ When the Trace Shows 90%

This sequential trace produced ~90% accuracy (undercounting). But the pipeline consistently reports 300-460% (overcounting). The difference:

  1. The pipeline runs 3 batches concurrently — same prompt, same model, but concurrent requests through OpenRouter may be routed to different model instances
  2. Non-determinism: Even sequentially, batch 3 got 50 items while the same pages returned 92 items in isolated tests. The model gives different answers on different calls
  3. The overcounting in the pipeline isn't amounts being wrong — it might be the quantity field being used as a multiplier somewhere, since the model fills quantity with parcel counts (e.g. qty=11 for a multi-parcel shipment), and something downstream might multiply totalAmount × quantity

Open question: Is there code in the pipeline that multiplies totalAmount × quantity? The logResultSummary function just sums totalAmount directly. But the vSummary.totalAmount is what gets reported — need to verify this isn't being inflated.

Conclusions

  1. Amounts are correct — GPT-5.4-mini reads the Amount column accurately (0.7% error rate on values)
  2. Item coverage is inconsistent — sometimes 92/92, sometimes 50/92 on the same pages
  3. Tracking numbers are occasionally truncated — drops the last digit before /0
  4. The 300-460% pipeline overcount is NOT from the model returning wrong amounts — it must be from how results are processed, or from non-deterministic behaviour producing wildly different results on concurrent runs