Benchmarking with wrk

Up to this point every number in this course has come from measurements on a specific machine. Now it is your turn. In this lesson you run the same benchmarks against the same four servers, on your own hardware, and fill in the numbers yourself. If the story we have been telling is true, the shape of your results will look the same as ours, even if the exact numbers are different.

The point is not to chase our numbers. Hardware varies and your laptop is not our mac mini. The point is to see which approaches scale flat as the dataset grows and which one falls off a cliff, and to make sure you can reproduce what you read.

Installing wrk

wrk is a small HTTP benchmarking tool written in C. It is the right tool for this because it is asynchronous, very lightweight, and uses almost no CPU relative to the load it generates. A few threads is enough to saturate anything we have built.

On macOS:

brew install wrk

On Debian or Ubuntu:

sudo apt-get install wrk

Verify it works:

wrk --version

Why wrk and not something Node-based

You might reach for autocannon or k6 first. Both are fine tools for their own reasons, but they are the wrong choice here.

autocannon runs on Node, which means it is a JavaScript event loop hammering a JavaScript server. When both sit on the same machine, the load generator and the server compete for the same CPU. Your throughput numbers end up partly measuring the load generator’s cost, not the server’s capacity. wrk is written in C and uses kqueue or epoll under the hood. It generates load without stealing CPU from what you are trying to measure.

k6 is excellent for realistic user-flow scripts with multiple steps, thinking time, and conditional logic. But here we are asking one simple question. How many GET /users/:id requests per second can this server handle? For that, wrk is lighter and simpler.

A wrk script that picks random ids

If you pointed wrk at a single URL, every request would hit the same user over and over. The OS page cache would warm up almost immediately, the Map.get call would return the same object every time, and you would end up measuring how fast your CPU can serve the hottest possible case. Not realistic.

What we want is every request to hit a different random id, drawn from the actual seeded dataset. wrk supports this through a Lua script. You point it at your script with -s and it uses the script to build each request.

-- random_ids.lua
math.randomseed(os.time())

local ids = {}
local f = io.open("ids.json", "r")
local content = f:read("*all")
f:close()

-- Crude JSON array parse: split on commas, strip quotes and brackets
for id in content:gmatch('"([^"]+)"') do
  table.insert(ids, id)
end

print("loaded " .. #ids .. " ids")

request = function()
  local id = ids[math.random(1, #ids)]
  return wrk.format("GET", "/users/" .. id)
end

When wrk starts, it runs the top of the script once. We seed the random number generator with the current time so runs are not identical. We open ids.json (the file our seed script wrote alongside the data), read all of it, and close the file.

The regex-looking line is a very simple JSON array parser. We look for every double-quoted string in the file and push it onto our ids table. That is all ids.json contains: a list of strings in a bracketed array, so this works.

The request function is what wrk calls for every request. It picks a random id from the loaded list and returns a GET /users/<that-id> request. That is the request that goes out over the wire.

The benchmark command

Four worker threads, fifty concurrent connections, ten seconds of load.

wrk -t4 -c50 -d10s -s random_ids.lua http://localhost:8081

The flags:

-t4, four threads. More than this on a few-core machine and the threads start contending for CPU with each other.
-c50, fifty open connections kept alive across the run. With HTTP/1.1 keepalive this means up to 50 in-flight requests at any moment.
-d10s, run for ten seconds. Longer runs smooth out warmup noise, but ten seconds is enough for these servers to stabilize.
-s random_ids.lua, the Lua script above.

A typical output (in-memory map at 1M records):

Running 10s test @ http://localhost:8081
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   711.17us    1.23ms  45.12ms   98.45%
    Req/Sec    18.12k     1.42k   21.03k    82.30%
  Requests/sec:  72073.67
  Transfer/sec:     18.31MB

Two numbers matter most for us. Requests/sec is the overall throughput across all threads. Latency Avg is the average response time. In production you would also care about p95 and p99 latency (the worst tail), but for comparing storage strategies, the average is enough.

A small shell script to keep runs consistent

You are going to run this a lot. Three dataset sizes, four approaches, twelve runs total. A small shell script keeps things reproducible.

#!/usr/bin/env bash
# bench.sh
set -euo pipefail

SIZE=${1:-10000}
APPROACH=${2:-map}

echo "=== Seeding $SIZE records ==="
node seed.ts $SIZE users.jsonl

if [ "$APPROACH" = "sqlite" ]; then
  rm -f users.db users.db-wal users.db-shm
  node import-jsonl.ts users.jsonl
fi

if [ "$APPROACH" = "binary" ]; then
  node build-index.ts
fi

echo "=== Starting server ($APPROACH) ==="
APPROACH=$APPROACH node src/server.ts &
SERVER_PID=$!

# Give the server a moment to warm up and load any in-memory state
sleep 2

echo "=== Running wrk ==="
wrk -t4 -c50 -d10s -s random_ids.lua http://localhost:8081

kill $SERVER_PID
wait $SERVER_PID 2>/dev/null || true

echo "=== Done ==="

It takes two arguments: the dataset size and the approach. It seeds fresh data, runs the one-time setup for SQLite or binary search if needed (re-import, rebuild index), starts the server in the background, waits two seconds for warmup, then runs wrk.

To switch approaches without four separate server files, we branch on an env var at the top of src/server.ts. While we are in there, we will also read the port from an env var so we can run two servers side by side later without editing code.

// src/server.ts
import { serve } from "srvx";
import { route, setup } from "@hectoday/http";
import { z } from "zod/v4";

const approach = process.env.APPROACH ?? "map";
const port = Number(process.env.PORT ?? 8081);
const store = await import(`./store-${approach}.ts`);
const { findUser, createUser } = store;

// ...rest unchanged, but at the bottom:
// serve({ port, fetch: app.fetch });

And we keep one file per strategy inside src/: store-linear.ts, store-map.ts, store-binary.ts, store-sqlite.ts. Each exports the same findUser and createUser functions. The dynamic import at the top of src/server.ts picks the right one based on the APPROACH environment variable. PORT defaults to 8081 but can be overridden for multi-server tests later in section 2.

Running the full matrix

Three sizes, four approaches, twelve runs.

for size in 10000 100000 1000000; do
  for approach in linear map binary sqlite; do
    echo ">>> $approach @ $size"
    ./bench.sh $size $approach
  done
done

This takes a few minutes. The 1M linear scan run is the slowest because it is, well, doing linear scans. At around 5 requests per second, it manages only a few dozen completed requests in a 10-second window, and wrk mostly shows spike latency numbers. Be patient.

Recording your numbers

The simplest thing is to jot down the Requests/sec line for each run. Here is the table you are filling in:

Records	Linear	Map	Binary	SQLite
10k
100k
1M

For comparison, here are the numbers we measured on an Apple Silicon Mac mini running Node 24. Your numbers will move by a factor of a few depending on hardware, but the shape should reproduce cleanly.

Records	Linear	Map	Binary	SQLite
10k	474	66,573	26,187	52,258
100k	49	65,466	25,278	47,436
1M	5	72,074	26,448	50,307

A couple of sanity checks to make on your own numbers. First, the three index-style approaches (Map, Binary, SQLite) should all be roughly flat across the scales. If any of them degrades by more than about 30 percent going from 10k to 1M, something is off in your setup. Second, SQLite and the in-memory map should be within a small constant factor of each other, somewhere between 1.3x and 2x. If SQLite comes out ten times slower than the map in your run, check that PRAGMA journal_mode = WAL actually took effect on the database you are hitting.

Why does the benchmark use a Lua script to pick random ids instead of just hitting the same URL repeatedly?

A few things that can go wrong

Numbers vary wildly between runs. If your machine is doing something in the background (Spotlight indexing, a build, a video call), expect noise. Close other apps and run each benchmark a few times, keeping the best run.

Latency spikes in the first second. Servers warm up. Node’s JIT compiler in V8 inlines hot paths after a few thousand iterations. The OS allocates and faults pages on first access. The two-second sleep in bench.sh gives the server some time to settle, but you can ignore the first second of any run if you are looking at fine-grained latency.

Linear scan does not feel slow at 10,000 records. The file is small enough that after the first request the whole thing is in the OS page cache, so every subsequent read is effectively a memory scan. A few hundred requests per second sounds bad, but for a lot of internal tools it is fine. The collapse only becomes brutal as the file grows past what the page cache can comfortably hold and the OS has to actually hit the disk on every request.

SQLite shows SQLITE_BUSY errors under heavy concurrent writes. Our benchmark is read-only, so this does not happen here. If you send a pile of POST traffic at high concurrency, you will see writes serialize. That is by design, and section 2 unpacks why.

What you have measured so far

A req/s number on its own is not actionable. You cannot look at “50,000 req/s” and know whether you need to scale up, scale out, or do nothing. You need to translate it into something product-oriented. How many users does this actually support?

That is the next lesson.

Access Required