Monitoring and Observability
Knowing if your queue is healthy
A job queue is invisible. Users do not see it. Developers do not see it (unless they query the database). Without monitoring, problems hide: jobs pile up, failures go unnoticed, the dead letter queue grows silently.
Key metrics
Queue depth: How many jobs are pending. If this number grows over time, workers are not keeping up.
Processing rate: How many jobs complete per minute. If this drops, something is wrong.
Failure rate: What percentage of jobs fail. A spike means a service is down or a bug was deployed.
Oldest pending job: How long the oldest job has been waiting. If it is hours old, the queue is stuck.
DLQ size: How many permanently failed jobs need attention.
A monitoring endpoint
route.get("/admin/queue/stats", {
resolve: () => {
const stats = db
.prepare(
`
SELECT
COUNT(*) FILTER (WHERE status = 'pending') AS pending,
COUNT(*) FILTER (WHERE status = 'processing') AS processing,
COUNT(*) FILTER (WHERE status = 'completed' AND completed_at > datetime('now', '-1 hour')) AS completed_last_hour,
COUNT(*) FILTER (WHERE status = 'failed') AS failed,
MIN(CASE WHEN status = 'pending' THEN scheduled_at END) AS oldest_pending
FROM jobs
`,
)
.get() as any;
const dlqCount = db
.prepare("SELECT COUNT(*) AS count FROM dead_letter_jobs WHERE reviewed = 0")
.get() as { count: number };
// Jobs per type
const byType = db
.prepare(
`
SELECT type, status, COUNT(*) AS count
FROM jobs
GROUP BY type, status
ORDER BY type
`,
)
.all();
return Response.json({
queue: {
pending: stats.pending,
processing: stats.processing,
completedLastHour: stats.completed_last_hour,
failed: stats.failed,
oldestPending: stats.oldest_pending,
},
deadLetterQueue: {
unreviewed: dlqCount.count,
},
byType,
});
},
}); [!NOTE] The Error Handling course’s Health Checks lesson built dependency health into
/health. This monitoring endpoint extends that pattern to the job queue — exposing operational health that the/healthendpoint can reference.
Alerting thresholds
Set alerts based on the metrics:
function checkQueueHealth(): { status: "healthy" | "degraded" | "unhealthy"; issues: string[] } {
const issues: string[] = [];
const stats = db
.prepare(
`
SELECT
COUNT(*) FILTER (WHERE status = 'pending') AS pending,
MIN(CASE WHEN status = 'pending' THEN scheduled_at END) AS oldest
FROM jobs
`,
)
.get() as any;
const dlq = db
.prepare("SELECT COUNT(*) AS count FROM dead_letter_jobs WHERE reviewed = 0")
.get() as { count: number };
if (stats.pending > 1000) issues.push(`Queue depth: ${stats.pending} pending jobs`);
if (stats.oldest) {
const ageMs = Date.now() - new Date(stats.oldest).getTime();
if (ageMs > 30 * 60 * 1000) {
issues.push(`Oldest pending job is ${Math.round(ageMs / 60000)} minutes old`);
}
}
if (dlq.count > 10) issues.push(`${dlq.count} unreviewed dead letter jobs`);
return {
status: issues.length === 0 ? "healthy" : issues.length > 2 ? "unhealthy" : "degraded",
issues,
};
} Integrating with health checks
// In your health check from the Error Handling course
const queueHealth = checkQueueHealth();
checks.jobQueue = {
status: queueHealth.status === "healthy" ? "up" : "warning",
message: queueHealth.issues.join("; ") || undefined,
}; Cleanup: purging old completed jobs
Completed jobs accumulate forever. Purge them periodically:
registerCron("cleanup_completed_jobs", "0 3 * * *"); // Daily at 3 AM
// Handler
cleanup_completed_jobs: async () => {
const result = db.prepare(
"DELETE FROM jobs WHERE status = 'completed' AND completed_at < datetime('now', '-7 days')"
).run();
console.log(`[CLEANUP] Deleted ${result.changes} completed jobs older than 7 days`);
}, [!NOTE] This uses the cron scheduling from the Recurring Jobs lesson — a background job that cleans up other background jobs. The queue maintains itself.
Exercises
Exercise 1: Build the monitoring endpoint. Enqueue 50 jobs. Check the queue depth before and after the worker processes them.
Exercise 2: Enqueue jobs that fail. Check the failure rate and DLQ size on the monitoring endpoint.
Exercise 3: Register a daily cleanup cron job. Insert old completed jobs. Run the cleanup. Verify they are deleted.
What is the most important metric for queue health?