Retries

Transient failures

Our error handling system can catch errors and return proper responses. But some failures are temporary. A payment gateway times out because of a brief network blip. A database query fails because of a momentary lock. An API returns 503 because the service is restarting after a deployment.

These are called transient failures. The system was not broken. It was momentarily unavailable. If you try the exact same request a second later, it works.

Right now, our app treats every failure the same way: catch it, return an error, done. But for transient failures, that is wasteful. The user sees an error and has to click “try again” manually. If we had just waited a moment and retried automatically, the request would have succeeded.

A simple retry

The idea is straightforward. Try the operation. If it fails, try again. If it fails again, try one more time. Only give up after a certain number of attempts.

Code along

// src/retry.ts
async function withRetry<T>(fn: () => Promise<T>, maxRetries: number = 3): Promise<T> {
  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err instanceof Error ? err : new Error(String(err));
      console.log(`Attempt ${attempt}/${maxRetries} failed: ${lastError.message}`);
    }
  }

  throw lastError;
}

// Usage: retry a database read (safe because reads are idempotent)
const product = await withRetry(() => fetchProduct("prod-1"));

Let’s walk through what happens. withRetry takes a function fn and a maximum number of retries. It runs a loop from 1 to maxRetries. On each iteration, it calls fn() and awaits the result. If it succeeds, it returns immediately. If it throws, the error is saved in lastError, a message is logged, and the loop continues to the next attempt.

If all attempts fail, the function throws lastError, the error from the final attempt. The caller sees the error as if there had been no retry at all, but the operation had multiple chances to succeed.

This works, but there is a problem. If the payment service is overloaded and struggling to respond, what are we doing? Immediately sending more requests. That makes the overload worse.

Exponential backoff

Instead of retrying immediately, we wait between attempts. And we wait longer each time. The first retry waits 1 second. The second waits 2 seconds. The third waits 4 seconds. This gives the failing service time to recover instead of piling more load on it.

Replace the simple version above with this improved implementation in src/retry.ts:

Code along

// src/retry.ts
export async function withRetry<T>(
  fn: () => Promise<T>,
  options: { maxRetries?: number; baseDelayMs?: number; maxDelayMs?: number } = {},
): Promise<T> {
  const { maxRetries = 3, baseDelayMs = 1000, maxDelayMs = 30000 } = options;
  let lastError: Error | undefined;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err instanceof Error ? err : new Error(String(err));

      if (attempt === maxRetries) break; // No more retries

      // Exponential backoff: 1s, 2s, 4s, 8s, ... capped at maxDelayMs
      const delay = Math.min(baseDelayMs * Math.pow(2, attempt - 1), maxDelayMs);

      // Add jitter: randomize +/-25% to prevent thundering herd
      const jitter = delay * (0.75 + Math.random() * 0.5);

      console.log(
        `Attempt ${attempt}/${maxRetries} failed. Retrying in ${Math.round(jitter)}ms...`,
      );
      await new Promise((r) => setTimeout(r, jitter));
    }
  }

  throw lastError;
}

The delay calculation is baseDelayMs * Math.pow(2, attempt - 1). With a base delay of 1000ms, that gives us 1000ms, 2000ms, 4000ms, 8000ms, and so on. The Math.min caps it at maxDelayMs so the delays do not grow forever.

But there is one more thing: jitter. The line delay * (0.75 + Math.random() * 0.5) randomizes the delay by plus or minus 25%. Why? Imagine 100 clients all fail at the same time. Without jitter, they all retry at exactly 1 second, then exactly 2 seconds, then exactly 4 seconds. They hit the service in synchronized waves. This is called the thundering herd problem. Jitter spreads the retries over a window, so the recovering service gets a steady trickle of requests instead of sudden bursts.

[!NOTE] The Real-Time APIs course’s WebSocket reconnection uses the same exponential backoff pattern. The principle is the same whether you are retrying an HTTP call or reconnecting a WebSocket.

Which operations are safe to retry

Here is a question that trips up a lot of developers: can you retry any operation?

No. Absolutely not.

Safe to retry (idempotent): GET requests, PUT requests, DELETE requests, database reads. These are idempotent, meaning doing them twice produces the same result as doing them once. Reading a product twice gives you the same product. Deleting a record twice leaves it deleted.

Not safe to retry (non-idempotent): POST requests that create resources, payment charges, sending emails. Retrying these might create duplicates. Charge a credit card twice? Send two confirmation emails? Create two orders? None of that is acceptable.

// SAFE: GET is idempotent, retrying reads the same data
const product = await withRetry(() => fetchProduct("prod-1"));

// DANGEROUS: POST might create a duplicate order
const order = await withRetry(() => createOrder(items)); // Two orders?

// SAFE IF: the service supports idempotency keys (from the REST course)
const charge = await withRetry(() => chargeCard(amount, token, { idempotencyKey: orderId }));

That last example is interesting. An idempotency key makes a non-idempotent operation safe to retry. The payment service sees the same key on the second request and says “I already processed this, here is the original result.” The REST API Design course covers idempotency keys in detail.

[!NOTE] The REST API Design course’s idempotency lesson explains why PUT and DELETE are safe to retry (they converge to the same state) and POST is not (each call creates a new resource). Idempotency keys make non-idempotent operations safe to retry.

Retry budgets

Do not retry forever. If the service is truly down (not a transient blip), retries just delay the inevitable failure. Set a budget:

// Retry budget: at most 3 retries, at most 10 seconds total
const charge = await withRetry(() => chargeCard(amount, token, { idempotencyKey: orderId }), {
  maxRetries: 3,
  baseDelayMs: 1000,
  maxDelayMs: 5000,
});

Keep maxRetries low, somewhere between 2 and 5. If three retries do not work, a fourth probably will not either. Let the error propagate so the caller can try a different strategy (like queuing for later, which we will cover soon).

Exercises

Exercise 1: Implement the withRetry function with exponential backoff. Call the simulated chargeCard function (which fails 20% of the time). Verify it succeeds after retries.

Exercise 2: Set maxRetries = 1. Call chargeCard repeatedly. How often does it fail now vs with 3 retries?

Exercise 3: Log the delay between retries. Verify the delays increase exponentially (1s, 2s, 4s).

Retries handle transient failures, but they still wait for each attempt to complete (or time out). What happens when an attempt takes 30 seconds? The next lesson covers timeouts, which make sure we never wait forever.

Why do retries use exponential backoff instead of a fixed delay?