Email Backoff Investigation — Why Unhandled Emails Get Reprocessed Every 15 Minutes
The 3-Tier Architecture
graph TD
S[Schedule: every 15 min] --> EP[EmailPollerWorkflow]
EP -->|fire-and-forget per courier| C1[EmailPollerCourierWorkflow DPD]
EP -->|fire-and-forget per courier| C2[EmailPollerCourierWorkflow Asendia]
EP -->|fire-and-forget per courier| C3[EmailPollerCourierWorkflow ...]
C1 -->|awaits each email sequentially| P1[ProcessEmailWorkflow email-abc]
C1 -->|awaits each email sequentially| P2[ProcessEmailWorkflow email-def]
C1 -->|awaits each email sequentially| P3[ProcessEmailWorkflow email-ghi]
style S fill:#fff3e0
style EP fill:#e1f5fe
style C1 fill:#e8f5e9
style C2 fill:#e8f5e9
style C3 fill:#e8f5e9
style P1 fill:#f3e5f5
style P2 fill:#f3e5f5
style P3 fill:#f3e5f5How a Single Email Gets Processed
sequenceDiagram
participant Batch as EmailPollerCourierWorkflow
participant PEW as ProcessEmailWorkflow
participant Act as ClassifyAndProcessEmail Activity
participant LLM as LLM Classification
participant Graph as MS Graph API
Batch->>PEW: ExecuteChildWorkflowAsync (AWAITS result)
Note over Batch: Blocked until child completes
PEW->>Act: Classify email
Act->>Graph: Fetch email body
Graph-->>Act: Email content
Act->>LLM: Classify rejection reason
LLM-->>Act: Reason 50 (Delivered to safe place), 99% confidence
Act-->>PEW: WasMarkedAsRead=false, UnhandledReason="not in enabled list"
alt SkipUnhandledDelay = false (normal)
PEW->>PEW: Wait 4 hours (UnhandledRetryDelay)
Note over PEW: Workflow stays ALIVE for 4h
Note over PEW: Same workflow ID blocks re-processing
else SkipUnhandledDelay = true (batch mode — current)
PEW-->>Batch: Complete immediately
Note over PEW: Workflow DONE — ID is free
Note over Batch: Continues to next email
endThe Backoff Mechanism (When It Works)
The 4-hour backoff relies on Temporal workflow ID dedup:
stateDiagram-v2
[*] --> Running: Poll starts ProcessEmailWorkflow (email-abc)
Running --> Unhandled: Email classified but not actionable
Unhandled --> Waiting4h: Wait 4 hours (workflow stays alive)
Waiting4h --> Waiting4h: Next poll tries same ID → WorkflowAlreadyStartedException → SKIP
Waiting4h --> Completed: 4 hours elapsed
Completed --> [*]: Next poll can start fresh workflowWhile the workflow is alive (waiting), any attempt to start a new workflow with the same deterministic ID (email-{sha256(messageId)}) throws WorkflowAlreadyStartedException — and the batch skips it.
Why It's Broken in Batch Mode
The batch courier workflow (EmailPollerCourierWorkflow) awaits each child sequentially. If a child waited 4 hours, the entire batch would be blocked.
To avoid this, the batch sets SkipUnhandledDelay = true:
// EmailPollerCourierWorkflow.cs — lines 113-136
var emailRequest = new ProcessSingleEmailWorkflowRequest
{
// ... other fields ...
SkipUnhandledDelay = true // ← THIS IS THE PROBLEM
};
await Workflow.ExecuteChildWorkflowAsync( // ← Awaits child
(ProcessEmailWorkflow wf) => wf.RunAsync(emailRequest),
new ChildWorkflowOptions
{
Id = workflowId,
ExecutionTimeout = TimeSpan.FromMinutes(2), // ← 2 min timeout
ParentClosePolicy = ParentClosePolicy.Terminate, // ← Kill child if parent dies
IdReusePolicy = WorkflowIdReusePolicy.AllowDuplicate // ← Allow reuse after completion
});This creates the infinite loop:
graph TD
A[Poll runs every 15 min] --> B[Batch fetches unread emails]
B --> C[Start ProcessEmailWorkflow email-abc]
C --> D[LLM classifies: reason 50, not actionable]
D --> E{SkipUnhandledDelay?}
E -->|true batch mode| F[Complete immediately]
F --> G[Workflow ID is free]
G --> H[Next poll: 15 min later]
H --> B
E -->|false normal| I[Wait 4 hours]
I --> J[ID blocks re-processing]
J --> K[After 4h: complete]
K --> H
style F fill:#fce4ec
style G fill:#fce4ec
style I fill:#e8f5e9
style J fill:#e8f5e9The Cost
Every 15-minute cycle for an unhandled email:
- Graph API call to fetch email body
- LLM call to classify the email category
- LLM call to classify the rejection reason
- Slack notification (every time!)
- Temporal workflow overhead
For the example claim 155052273934731: 160 events = ~53 LLM calls over 8 days, with no change in outcome.
The Fix: Fire-and-Forget Children
Switch from ExecuteChildWorkflowAsync (blocks parent) to StartChildWorkflowAsync (fire-and-forget):
sequenceDiagram
participant Batch as EmailPollerCourierWorkflow
participant P1 as ProcessEmailWorkflow (email-abc)
participant P2 as ProcessEmailWorkflow (email-def)
Note over Batch: Start children WITHOUT awaiting results
Batch->>P1: StartChildWorkflowAsync (fire-and-forget)
Note over Batch: Returns immediately
Batch->>P2: StartChildWorkflowAsync (fire-and-forget)
Note over Batch: Returns immediately
Batch->>Batch: All started, batch completes
Note over P1: ParentClosePolicy = Abandon
Note over P1: Children continue independently
P1->>P1: Classify → Unhandled
P1->>P1: Wait 4 hours (backoff works!)
Note over P1: Workflow alive → ID blocks reuse
P2->>P2: Classify → Processed
P2->>P2: Complete immediatelyKey changes:
ExecuteChildWorkflowAsync→StartChildWorkflowAsync(don't await result)ParentClosePolicy.Terminate→ParentClosePolicy.Abandon(children survive parent)- Remove
SkipUnhandledDelay = true(let the 4h delay work) - Remove
ExecutionTimeout: 2 min(children need up to 4.5h) - Tradeoff: batch loses processed/failed counters (only knows started/skipped)