By Patrick McCurley

Email Backoff Investigation — Why Unhandled Emails Get Reprocessed Every 15 Minutes

By Patrick McCurley · Created Mar 10, 2026 public

The 3-Tier Architecture

graph TD
    S[Schedule: every 15 min] --> EP[EmailPollerWorkflow]
    EP -->|fire-and-forget per courier| C1[EmailPollerCourierWorkflow DPD]
    EP -->|fire-and-forget per courier| C2[EmailPollerCourierWorkflow Asendia]
    EP -->|fire-and-forget per courier| C3[EmailPollerCourierWorkflow ...]

    C1 -->|awaits each email sequentially| P1[ProcessEmailWorkflow email-abc]
    C1 -->|awaits each email sequentially| P2[ProcessEmailWorkflow email-def]
    C1 -->|awaits each email sequentially| P3[ProcessEmailWorkflow email-ghi]

    style S fill:#fff3e0
    style EP fill:#e1f5fe
    style C1 fill:#e8f5e9
    style C2 fill:#e8f5e9
    style C3 fill:#e8f5e9
    style P1 fill:#f3e5f5
    style P2 fill:#f3e5f5
    style P3 fill:#f3e5f5

How a Single Email Gets Processed

sequenceDiagram
    participant Batch as EmailPollerCourierWorkflow
    participant PEW as ProcessEmailWorkflow
    participant Act as ClassifyAndProcessEmail Activity
    participant LLM as LLM Classification
    participant Graph as MS Graph API

    Batch->>PEW: ExecuteChildWorkflowAsync (AWAITS result)
    Note over Batch: Blocked until child completes

    PEW->>Act: Classify email
    Act->>Graph: Fetch email body
    Graph-->>Act: Email content
    Act->>LLM: Classify rejection reason
    LLM-->>Act: Reason 50 (Delivered to safe place), 99% confidence

    Act-->>PEW: WasMarkedAsRead=false, UnhandledReason="not in enabled list"

    alt SkipUnhandledDelay = false (normal)
        PEW->>PEW: Wait 4 hours (UnhandledRetryDelay)
        Note over PEW: Workflow stays ALIVE for 4h
        Note over PEW: Same workflow ID blocks re-processing
    else SkipUnhandledDelay = true (batch mode — current)
        PEW-->>Batch: Complete immediately
        Note over PEW: Workflow DONE — ID is free
        Note over Batch: Continues to next email
    end

The Backoff Mechanism (When It Works)

The 4-hour backoff relies on Temporal workflow ID dedup:

stateDiagram-v2
    [*] --> Running: Poll starts ProcessEmailWorkflow (email-abc)
    Running --> Unhandled: Email classified but not actionable
    Unhandled --> Waiting4h: Wait 4 hours (workflow stays alive)

    Waiting4h --> Waiting4h: Next poll tries same ID → WorkflowAlreadyStartedException → SKIP

    Waiting4h --> Completed: 4 hours elapsed
    Completed --> [*]: Next poll can start fresh workflow

While the workflow is alive (waiting), any attempt to start a new workflow with the same deterministic ID (email-{sha256(messageId)}) throws WorkflowAlreadyStartedException — and the batch skips it.

Why It's Broken in Batch Mode

The batch courier workflow (EmailPollerCourierWorkflow) awaits each child sequentially. If a child waited 4 hours, the entire batch would be blocked.

To avoid this, the batch sets SkipUnhandledDelay = true:

// EmailPollerCourierWorkflow.cs — lines 113-136
var emailRequest = new ProcessSingleEmailWorkflowRequest
{
    // ... other fields ...
    SkipUnhandledDelay = true  // ← THIS IS THE PROBLEM
};

await Workflow.ExecuteChildWorkflowAsync(          // ← Awaits child
    (ProcessEmailWorkflow wf) => wf.RunAsync(emailRequest),
    new ChildWorkflowOptions
    {
        Id = workflowId,
        ExecutionTimeout = TimeSpan.FromMinutes(2),          // ← 2 min timeout
        ParentClosePolicy = ParentClosePolicy.Terminate,     // ← Kill child if parent dies
        IdReusePolicy = WorkflowIdReusePolicy.AllowDuplicate // ← Allow reuse after completion
    });

This creates the infinite loop:

graph TD
    A[Poll runs every 15 min] --> B[Batch fetches unread emails]
    B --> C[Start ProcessEmailWorkflow email-abc]
    C --> D[LLM classifies: reason 50, not actionable]
    D --> E{SkipUnhandledDelay?}
    E -->|true batch mode| F[Complete immediately]
    F --> G[Workflow ID is free]
    G --> H[Next poll: 15 min later]
    H --> B

    E -->|false normal| I[Wait 4 hours]
    I --> J[ID blocks re-processing]
    J --> K[After 4h: complete]
    K --> H

    style F fill:#fce4ec
    style G fill:#fce4ec
    style I fill:#e8f5e9
    style J fill:#e8f5e9

The Cost

Every 15-minute cycle for an unhandled email:

  1. Graph API call to fetch email body
  2. LLM call to classify the email category
  3. LLM call to classify the rejection reason
  4. Slack notification (every time!)
  5. Temporal workflow overhead

For the example claim 155052273934731: 160 events = ~53 LLM calls over 8 days, with no change in outcome.

The Fix: Fire-and-Forget Children

Switch from ExecuteChildWorkflowAsync (blocks parent) to StartChildWorkflowAsync (fire-and-forget):

sequenceDiagram
    participant Batch as EmailPollerCourierWorkflow
    participant P1 as ProcessEmailWorkflow (email-abc)
    participant P2 as ProcessEmailWorkflow (email-def)

    Note over Batch: Start children WITHOUT awaiting results

    Batch->>P1: StartChildWorkflowAsync (fire-and-forget)
    Note over Batch: Returns immediately
    Batch->>P2: StartChildWorkflowAsync (fire-and-forget)
    Note over Batch: Returns immediately
    Batch->>Batch: All started, batch completes

    Note over P1: ParentClosePolicy = Abandon
    Note over P1: Children continue independently

    P1->>P1: Classify → Unhandled
    P1->>P1: Wait 4 hours (backoff works!)
    Note over P1: Workflow alive → ID blocks reuse

    P2->>P2: Classify → Processed
    P2->>P2: Complete immediately

Key changes: