Retries and Backoff
Jobs fail
The email service returns a 503. The PDF generator runs out of memory. The inventory API times out. Jobs fail for the same reasons HTTP requests fail — transient errors, service outages, network issues.
The difference: an HTTP request must respond to the user immediately. A job can wait and try again. This makes retries natural for background jobs.
How retries work in the queue
The failJob function from the Database-Backed Queues lesson already implements retries:
export function failJob(jobId: string, error: string): void {
db.prepare(
`
UPDATE jobs
SET status = CASE
WHEN attempts + 1 >= max_attempts THEN 'failed'
ELSE 'pending'
END,
attempts = attempts + 1,
last_error = ?,
locked_by = NULL,
locked_at = NULL,
updated_at = datetime('now'),
scheduled_at = CASE
WHEN attempts + 1 >= max_attempts THEN scheduled_at
ELSE datetime('now', '+' || (attempts + 1) * 30 || ' seconds')
END
WHERE id = ?
`,
).run(error, jobId);
} When a job fails, two things happen based on the attempt count:
Retries remaining (attempts + 1 < max_attempts): The status goes back to "pending" and scheduled_at is set to a future time. The worker will pick it up again later.
No retries left (attempts + 1 >= max_attempts): The status becomes "failed" permanently. The job will not be retried.
Backoff strategy
The delay formula (attempts + 1) * 30 seconds creates linear backoff:
| Attempt | Delay | Total elapsed |
|---|---|---|
| 1 | 30 seconds | 30s |
| 2 | 60 seconds | 1.5 min |
| 3 | 90 seconds | 3 min |
| 4 | 120 seconds | 5 min |
| 5 | Failed | — |
[!NOTE] The Error Handling course’s Retries lesson used exponential backoff for HTTP requests (1s, 2s, 4s, 8s). For background jobs, linear backoff is often sufficient because the worker is not blocking a user request. The delays are longer (30s+) because background jobs deal with outages, not brief blips.
Choosing max_attempts
Too few (1-2): A transient error permanently fails the job. The email is never sent.
Too many (50+): A job with a permanent error (invalid email address, deleted record) retries for hours, wasting resources.
Good defaults: notifications (5 attempts), document generation (3 attempts), external API sync (5-10 attempts), payment processing (3 attempts then escalate to a human).
Per-job retry configuration
enqueue("send_email", payload, { maxAttempts: 10 }); // Emails recover often
enqueue("generate_invoice", payload, { maxAttempts: 3 }); // PDF failures are usually bugs
enqueue("charge_card", payload, { maxAttempts: 3 }); // Payments need careful retries Tracking failures
The last_error column stores the most recent error message:
SELECT type, attempts, last_error, scheduled_at
FROM jobs
WHERE status = 'pending' AND attempts > 0; You can see which jobs are struggling, why they are failing, and when they will be retried.
Exercises
Exercise 1: Enqueue a job that always fails. Set max_attempts = 3. Watch the worker retry it 3 times with increasing delays.
Exercise 2: Query the jobs table during retries. Verify attempts increases and scheduled_at moves forward.
Exercise 3: Change the backoff formula from linear to exponential. Compare the retry timing.
Why does a failed job go back to 'pending' status instead of a separate 'retry' status?