Retries and Backoff

Jobs fail

The email service returns a 503. The PDF generator runs out of memory. The inventory API times out. Jobs fail for the same reasons HTTP requests fail — transient errors, service outages, network issues.

The difference: an HTTP request must respond to the user immediately. A job can wait and try again. This makes retries natural for background jobs.

How retries work in the queue

The failJob function from the Database-Backed Queues lesson already implements retries:

export function failJob(jobId: string, error: string): void {
  db.prepare(
    `
    UPDATE jobs
    SET status = CASE
          WHEN attempts + 1 >= max_attempts THEN 'failed'
          ELSE 'pending'
        END,
        attempts = attempts + 1,
        last_error = ?,
        locked_by = NULL,
        locked_at = NULL,
        updated_at = datetime('now'),
        scheduled_at = CASE
          WHEN attempts + 1 >= max_attempts THEN scheduled_at
          ELSE datetime('now', '+' || (attempts + 1) * 30 || ' seconds')
        END
    WHERE id = ?
  `,
  ).run(error, jobId);
}

When a job fails, two things happen based on the attempt count:

Retries remaining (attempts + 1 < max_attempts): The status goes back to "pending" and scheduled_at is set to a future time. The worker will pick it up again later.

No retries left (attempts + 1 >= max_attempts): The status becomes "failed" permanently. The job will not be retried.

Backoff strategy

The delay formula (attempts + 1) * 30 seconds creates linear backoff:

Attempt	Delay	Total elapsed
1	30 seconds	30s
2	60 seconds	1.5 min
3	90 seconds	3 min
4	120 seconds	5 min
5	Failed	—

[!NOTE] The Error Handling course’s Retries lesson used exponential backoff for HTTP requests (1s, 2s, 4s, 8s). For background jobs, linear backoff is often sufficient because the worker is not blocking a user request. The delays are longer (30s+) because background jobs deal with outages, not brief blips.

Choosing max_attempts

Too few (1-2): A transient error permanently fails the job. The email is never sent.

Too many (50+): A job with a permanent error (invalid email address, deleted record) retries for hours, wasting resources.

Good defaults: notifications (5 attempts), document generation (3 attempts), external API sync (5-10 attempts), payment processing (3 attempts then escalate to a human).

Per-job retry configuration

enqueue("send_email", payload, { maxAttempts: 10 }); // Emails recover often
enqueue("generate_invoice", payload, { maxAttempts: 3 }); // PDF failures are usually bugs
enqueue("charge_card", payload, { maxAttempts: 3 }); // Payments need careful retries

Tracking failures

The last_error column stores the most recent error message:

SELECT type, attempts, last_error, scheduled_at
FROM jobs
WHERE status = 'pending' AND attempts > 0;

You can see which jobs are struggling, why they are failing, and when they will be retried.

Exercises

Exercise 1: Enqueue a job that always fails. Set max_attempts = 3. Watch the worker retry it 3 times with increasing delays.

Exercise 2: Query the jobs table during retries. Verify attempts increases and scheduled_at moves forward.

Exercise 3: Change the backoff formula from linear to exponential. Compare the retry timing.

Why does a failed job go back to 'pending' status instead of a separate 'retry' status?

Access Required