Monitoring and Observability

Knowing if your queue is healthy

A job queue is invisible. Users do not see it. Developers do not see it (unless they query the database). Without monitoring, problems hide: jobs pile up, failures go unnoticed, the dead letter queue grows silently.

Key metrics

Queue depth: How many jobs are pending. If this number grows over time, workers are not keeping up.

Processing rate: How many jobs complete per minute. If this drops, something is wrong.

Failure rate: What percentage of jobs fail. A spike means a service is down or a bug was deployed.

Oldest pending job: How long the oldest job has been waiting. If it is hours old, the queue is stuck.

DLQ size: How many permanently failed jobs need attention.

A monitoring endpoint

route.get("/admin/queue/stats", {
  resolve: () => {
    const stats = db
      .prepare(
        `
      SELECT
        COUNT(*) FILTER (WHERE status = 'pending') AS pending,
        COUNT(*) FILTER (WHERE status = 'processing') AS processing,
        COUNT(*) FILTER (WHERE status = 'completed' AND completed_at > datetime('now', '-1 hour')) AS completed_last_hour,
        COUNT(*) FILTER (WHERE status = 'failed') AS failed,
        MIN(CASE WHEN status = 'pending' THEN scheduled_at END) AS oldest_pending
      FROM jobs
    `,
      )
      .get() as any;

    const dlqCount = db
      .prepare("SELECT COUNT(*) AS count FROM dead_letter_jobs WHERE reviewed = 0")
      .get() as { count: number };

    // Jobs per type
    const byType = db
      .prepare(
        `
      SELECT type, status, COUNT(*) AS count
      FROM jobs
      GROUP BY type, status
      ORDER BY type
    `,
      )
      .all();

    return Response.json({
      queue: {
        pending: stats.pending,
        processing: stats.processing,
        completedLastHour: stats.completed_last_hour,
        failed: stats.failed,
        oldestPending: stats.oldest_pending,
      },
      deadLetterQueue: {
        unreviewed: dlqCount.count,
      },
      byType,
    });
  },
});

[!NOTE] The Error Handling course’s Health Checks lesson built dependency health into /health. This monitoring endpoint extends that pattern to the job queue — exposing operational health that the /health endpoint can reference.

Alerting thresholds

Set alerts based on the metrics:

function checkQueueHealth(): { status: "healthy" | "degraded" | "unhealthy"; issues: string[] } {
  const issues: string[] = [];

  const stats = db
    .prepare(
      `
    SELECT
      COUNT(*) FILTER (WHERE status = 'pending') AS pending,
      MIN(CASE WHEN status = 'pending' THEN scheduled_at END) AS oldest
    FROM jobs
  `,
    )
    .get() as any;

  const dlq = db
    .prepare("SELECT COUNT(*) AS count FROM dead_letter_jobs WHERE reviewed = 0")
    .get() as { count: number };

  if (stats.pending > 1000) issues.push(`Queue depth: ${stats.pending} pending jobs`);

  if (stats.oldest) {
    const ageMs = Date.now() - new Date(stats.oldest).getTime();
    if (ageMs > 30 * 60 * 1000) {
      issues.push(`Oldest pending job is ${Math.round(ageMs / 60000)} minutes old`);
    }
  }

  if (dlq.count > 10) issues.push(`${dlq.count} unreviewed dead letter jobs`);

  return {
    status: issues.length === 0 ? "healthy" : issues.length > 2 ? "unhealthy" : "degraded",
    issues,
  };
}

Integrating with health checks

// In your health check from the Error Handling course
const queueHealth = checkQueueHealth();
checks.jobQueue = {
  status: queueHealth.status === "healthy" ? "up" : "warning",
  message: queueHealth.issues.join("; ") || undefined,
};

Cleanup: purging old completed jobs

Completed jobs accumulate forever. Purge them periodically:

registerCron("cleanup_completed_jobs", "0 3 * * *"); // Daily at 3 AM

// Handler
cleanup_completed_jobs: async () => {
  const result = db.prepare(
    "DELETE FROM jobs WHERE status = 'completed' AND completed_at < datetime('now', '-7 days')"
  ).run();
  console.log(`[CLEANUP] Deleted ${result.changes} completed jobs older than 7 days`);
},

[!NOTE] This uses the cron scheduling from the Recurring Jobs lesson — a background job that cleans up other background jobs. The queue maintains itself.

Exercises

Exercise 1: Build the monitoring endpoint. Enqueue 50 jobs. Check the queue depth before and after the worker processes them.

Exercise 2: Enqueue jobs that fail. Check the failure rate and DLQ size on the monitoring endpoint.

Exercise 3: Register a daily cleanup cron job. Insert old completed jobs. Run the cleanup. Verify they are deleted.

What is the most important metric for queue health?

Access Required