Approach 3: binary search on disk
The in-memory map is extremely fast, but it has one clear cost. Every user in the file has to live in RAM. For a million small records that is fine. For ten million, or for records with big text fields, you are suddenly thinking about server memory. And every time the server restarts, you pay the full load cost before you can serve the first request.
So here is the question. What if the data is too big to fit in memory, but we still want fast lookups? Can we get something close to a Map.get without actually loading everything?
Yes. The trick has been around for decades. We sort the data, build a small index file next to it, and do something called binary search directly on disk. This lesson is denser than the previous two, so we are going to take it in small steps.
The idea
We end up with two files.
First, users.jsonl, the same JSONL data we had before, except now it is sorted by id.
Second, a new file called users.idx. This is a binary index. One entry per user, and every entry is exactly the same number of bytes long. Here is what one entry looks like:
<36-character id>:<20-digit byte offset>\n Let us count that. 36 bytes for the id (a UUID), 1 byte for the colon, 20 bytes for a zero-padded number, 1 byte for the newline. That is 58 bytes, always. The zero-padding matters: 42 becomes 00000000000000000042. Every entry, no matter the id or the offset, takes up the same space.
The byte offset is where the full user record lives inside users.jsonl. If an entry says offset 12345, we can jump directly to byte 12345 of the data file, read one line, and we have our user.
Why bother with fixed-width entries? Because when every entry is the same size, we can find entry number N by simple multiplication. Entry 1000 lives at byte 1000 * 58. There is no need to scan the file to find line boundaries. We can jump anywhere by doing arithmetic.
That property is what makes binary search practical here. Binary search needs random access: “give me the middle entry, now the middle of the left half, now the middle of the right half.” With fixed-width entries on disk, random access is a single read at a computed offset. Without that property, we would have to scan from the beginning to count lines.
Where we are starting from
Current project layout:
do-you-need-a-database/
├── package.json
├── seed.ts
├── users.jsonl ← data file from the last two lessons
├── ids.json ← seeded ids for benchmark scripts
└── src/
├── server.ts ← unchanged
└── store.ts ← currently the in-memory map implementation What this lesson adds or changes:
- New file at the project root:
build-index.ts. One-off script that sortsusers.jsonland writes a new fileusers.idxnext to it. Runs once, the same wayseed.tsdoes. src/store.tsgets replaced. ItsfindUserdoes binary search againstusers.idx. ItscreateUserthrows, on purpose.- New file at the root:
users.idx. Produced bybuild-index.ts, consumed bysrc/store.ts. src/server.tsdoes not change. The newfindUseris synchronous, and that is what the current handler already expects.
Target layout at the end of this lesson:
do-you-need-a-database/
├── package.json
├── seed.ts
├── build-index.ts ← NEW: sort users.jsonl + emit users.idx
├── users.jsonl ← now sorted by id
├── users.idx ← NEW: fixed-width index
├── ids.json
└── src/
├── server.ts ← unchanged
└── store.ts ← REPLACED: binary search on disk Re-seed fresh so we have a clean unsorted file to build from. The --watching dev server can stay on; we will rebuild the index and replace the store in place.
node seed.ts 10000 users.jsonl Step 1: Create build-index.ts at the project root
Create a new file called build-index.ts at the root of the project (alongside seed.ts, not inside src/). It is a one-off script, not application code, so it lives with the other scripts.
We will build its contents in three pieces. Everything in Step 1, Step 2, and Step 3 goes into this single build-index.ts file in the order shown.
Part 1 of build-index.ts: imports, constants, read the data
// build-index.ts
import { readFileSync, openSync, writeSync, closeSync } from "node:fs";
const DATA_FILE = "users.jsonl";
const INDEX_FILE = "users.idx";
const ENTRY_LEN = 58;
interface Row {
id: string;
}
const lines = readFileSync(DATA_FILE, "utf8")
.split("\n")
.filter((l) => l.trim()); Three constants. DATA_FILE and INDEX_FILE are the filenames we will read from and write to. ENTRY_LEN is 58, the fixed width of each index entry we worked out above.
Row is a tiny interface. It only has id, because for sorting and indexing that is all we need to see. The full user record is stored back in the data file; we do not need the parsed form here.
Then we read the whole data file into memory and split on newlines, skipping empty lines. Yes, we load the whole file. For building the index, loading it once is fine, even at a million records. This is a one-off build step, not a per-request operation.
Part 2 of build-index.ts: parse the ids for sorting
Add these lines directly below the ones you just wrote.
// build-index.ts (continued)
const records = lines.map((line) => ({
id: (JSON.parse(line) as Row).id,
line,
}));
records.sort((a, b) => (a.id < b.id ? -1 : a.id > b.id ? 1 : 0)); Each element of records is a little object with two fields: the parsed id (for the sort) and the original raw line (so we do not have to re-serialize the record when writing it back).
The sort call is a standard comparator. Return a negative number when a comes first, positive when b comes first, zero when equal. Plain string comparison works here because UUIDs sort lexicographically.
Part 3 of build-index.ts: write the sorted data and the index together
Add this last chunk at the bottom of build-index.ts. This is the heart of the script.
// build-index.ts (continued)
const dataFd = openSync(DATA_FILE, "w");
const idxFd = openSync(INDEX_FILE, "w");
const idxBuf = Buffer.alloc(ENTRY_LEN);
let offset = 0;
for (const { id, line } of records) {
const bytes = Buffer.from(line + "\n", "utf8");
writeSync(dataFd, bytes);
idxBuf.fill(0x20); // space-pad
idxBuf.write(id, 0, 36, "utf8");
idxBuf[36] = 0x3a; // ":"
idxBuf.write(String(offset).padStart(20, "0"), 37, 20, "utf8");
idxBuf[57] = 0x0a; // "\n"
writeSync(idxFd, idxBuf, 0, ENTRY_LEN);
offset += bytes.length;
}
closeSync(dataFd);
closeSync(idxFd);
console.log(`indexed ${records.length} records`); Let us walk this loop in slow motion, because every step matters.
Before the loop, we open both output files for writing. openSync returns a file descriptor, which is just a small integer the OS gives us so we can refer to an open file later. Buffer.alloc(ENTRY_LEN) allocates a 58-byte buffer we will reuse for every index entry. Reusing one buffer (instead of allocating a new one per entry) matters at a million records.
offset starts at 0. This is the running byte position in users.jsonl where the next record will land.
Inside the loop, for each record we do four things.
First, write the record to the data file. Buffer.from(line + "\n", "utf8") converts the line plus a trailing newline into a byte buffer. writeSync(dataFd, bytes) writes those bytes at the current end of the file. We will use bytes.length at the bottom of the loop to advance offset.
Second, build the index entry in the 58-byte buffer, one region at a time.
idxBuf.fill(0x20)fills the buffer with spaces (0x20is ASCII for space). This is a safety measure. We are going to overwrite every byte we care about, but starting from a known state avoids bugs if we miscount.idxBuf.write(id, 0, 36, "utf8")writes the 36-character id starting at byte 0.idxBuf[36] = 0x3asets byte 36 to a colon (:is0x3a).idxBuf.write(String(offset).padStart(20, "0"), 37, 20, "utf8")zero-pads the current byte offset to 20 characters ("42"becomes"00000000000000000042") and writes it at byte 37.idxBuf[57] = 0x0asets the last byte to a newline (\nis0x0a).
Third, write the 58-byte buffer to the index file.
Fourth, advance offset by the number of bytes we just wrote to the data file. The next record will land at this new offset.
After the loop, close both file descriptors. That flushes any OS-level buffering.
The easy bug to make
Read this closely, because it is the one thing that is really important:
Sort before writing. If we sort the records but write them using the original offsets from the unsorted file, every lookup will find the wrong record. The index has to point at the locations where records land in the new sorted file, not where they were in the old one. That is why we let offset advance as we write the sorted data, rather than reading it from the original lines.
In a real system you would also write to a temporary file and atomically rename it over the original so that a crash in the middle does not leave you with a half-rewritten file. We skip that here for clarity.
Step 2: Run build-index.ts and verify the output
You should now have a complete build-index.ts at the project root. Run it.
node build-index.ts Expected output:
indexed 10000 records Two files should now exist at the root. Verify them.
ls -la users.jsonl users.idx You should see both. users.jsonl is still there (it has been rewritten in sorted order). users.idx is new.
Sanity-check the index file size. For 10,000 records at 58 bytes per entry, users.idx should be exactly 10000 * 58 = 580000 bytes.
wc -c < users.idx
# 580000 If the number matches, the index is well-formed. If it is off, one of the writes in Part 3 produced a different number of bytes than we expect, and the binary-search lookups will land on garbage.
Step 3: Replace src/store.ts with the binary-search implementation
Open src/store.ts. It currently holds the in-memory map implementation from the last lesson. Delete the entire contents and rebuild it in pieces. Every code block in this step goes into this one file in the order shown.
Part 1 of src/store.ts: imports, constants, User interface
// src/store.ts
import { closeSync, existsSync, openSync, readSync, statSync } from "node:fs";
const DATA_FILE = "users.jsonl";
const INDEX_FILE = "users.idx";
const ENTRY_LEN = 58;
export interface User {
id: string;
name: string;
email: string;
created_at: string;
} Nothing exciting here. We pull in the filesystem helpers for positional reads (readSync, which reads bytes from a specific offset), file descriptor management (openSync, closeSync), and file metadata (statSync, so we can check how big the index is).
The constants mirror the ones in build-index.ts because both files need to agree on the filenames and the entry width. In a bigger project you would move these into a shared constants.ts module and import them here and there, so there is one source of truth. For this lesson, duplicating three lines keeps each file self-contained and easy to follow.
The User interface is the same shape we have used in every lesson.
Part 2 of src/store.ts: the function signature and bail-out
Add this below the User interface.
// src/store.ts (continued)
export function findUser(id: string): User | null {
if (!existsSync(INDEX_FILE)) return null;
const idxFd = openSync(INDEX_FILE, "r");
const dataFd = openSync(DATA_FILE, "r"); The function is synchronous. There is no stream to wait on. We take an id, return a User or null.
If the index does not exist yet, we bail out with null. That happens on a fresh server run where nobody has called build-index.ts yet.
Otherwise we open both files for reading. Each call gives us back a file descriptor. We will use them with readSync below.
Notice this function does not close yet. We will add a try / finally in the next part so that whatever happens in the search, we always release the descriptors.
Part 3 of src/store.ts: the try block and search bounds
Still inside findUser, continuing right after the two openSync lines.
// src/store.ts (continued, still inside findUser)
try {
const idxSize = statSync(INDEX_FILE).size;
const total = Math.floor(idxSize / ENTRY_LEN);
const buf = Buffer.alloc(ENTRY_LEN);
let lo = 0;
let hi = total - 1; We open a try block. The matching finally comes at the end and closes both file descriptors.
statSync gives us the size of the index file in bytes. Divide by ENTRY_LEN (58) and you get the number of entries.
buf is a 58-byte buffer we will read one index entry into, per iteration of the search loop. One allocation, reused on every iteration.
lo and hi are the bounds of our current search range. Binary search works by repeatedly cutting that range in half, keeping the half that could contain the target, until the range collapses. We start with the full range (0 to total - 1).
Part 4 of src/store.ts: the binary-search loop
Next, inside the try block.
// src/store.ts (continued, still inside the try block)
while (lo <= hi) {
const mid = Math.floor((lo + hi) / 2);
readSync(idxFd, buf, 0, ENTRY_LEN, mid * ENTRY_LEN);
const entryId = buf.toString("utf8", 0, 36).trimEnd();
const cmp = entryId < id ? -1 : entryId > id ? 1 : 0;
if (cmp === 0) {
const offset = parseInt(buf.toString("utf8", 37, 57), 10);
const dataBuf = Buffer.alloc(1024);
const bytesRead = readSync(dataFd, dataBuf, 0, 1024, offset);
const line = dataBuf.subarray(0, bytesRead).toString("utf8").split("\n")[0];
return JSON.parse(line) as User;
}
if (cmp < 0) lo = mid + 1;
else hi = mid - 1;
}
return null; This is the full search. Let us walk through it once.
The while (lo <= hi) loop keeps going as long as the range still has at least one element.
On each iteration we compute the midpoint index mid. Then we call readSync with five arguments: (fd, targetBuffer, bufferOffset, length, filePosition). In English: “read 58 bytes from idxFd, starting at byte offset mid * ENTRY_LEN in the file, into buf starting at position 0.” Because every entry is the same size, that multiplication gives us a byte-exact location. No scanning, just arithmetic.
We pull the id out of the first 36 bytes of the buffer (buf.toString("utf8", 0, 36)). The .trimEnd() peels off any space-padding left over from idxBuf.fill(0x20) in the builder.
We compare entryId to the target id. Three outcomes.
If they match (cmp === 0), we have found the right index entry. Now we go read the actual user record from users.jsonl. Bytes 37 through 57 of the entry are the zero-padded offset. We parse them as a base-10 integer, then use it to read a 1024-byte chunk from the data file at that offset. We convert the chunk to a string, take everything up to the first newline, and parse it as JSON. That is our user. Return it.
If entryId < id, our target is somewhere in the upper half, so we set lo = mid + 1. Next iteration looks at the upper half.
If entryId > id, the target is in the lower half, so hi = mid - 1.
If the loop exits with no match, we fall through and return null. The handler will turn that into a 404.
Part 5 of src/store.ts: the finally block and createUser stub
Last piece. Add this at the bottom of src/store.ts.
// src/store.ts (continued)
} finally {
closeSync(idxFd);
closeSync(dataFd);
}
}
export function createUser(_name: string, _email: string): User {
throw new Error(
"createUser not supported by binary-search-on-disk; new writes would break the sort order",
);
} The finally closes the try block we opened in Part 3 and the function body we opened in Part 2. It always runs (success, early return on a match, thrown error, or falling through to null) and it is the only thing that releases the file descriptors. Leaving this out would leak handles over time.
The final closing } matches the findUser function declaration.
Then we add createUser. It throws, on purpose. Adding a user would append to users.jsonl, but that new line would land at the end with a random UUID that is almost certainly not in sorted order. The index would then point at the wrong records. Section 2 will walk through exactly how real systems solve this (and you will rediscover LSM-trees in the process). For this lesson, we just refuse and move on.
When you save this file, your dev server (if npm run dev is still running) will restart and start serving the binary-search implementation.
Step 4: Try a lookup
If it is not already running, start the server.
npm run dev In another terminal, fetch a random user from the seeded set.
curl http://localhost:8081/users/$(jq -r '.[999]' ids.json) This should return a user. Linear scan would have had to walk through nearly all 10,000 records to find it. Binary search does it in about log2(10000) ≈ 13 seeks.
Try a user that does not exist.
curl -i http://localhost:8081/users/00000000-0000-0000-0000-000000000000 You get a 404. The binary search narrowed the range until lo > hi without a match, fell through to return null, and the handler turned that into a 404.
Try POSTing.
curl -i -X POST http://localhost:8081/users \
-H 'Content-Type: application/json' \
-d '{"name":"New","email":"[email protected]"}' You get a 500 with a message about new writes breaking the sort order. That is the stub createUser we wrote on purpose. We will revisit this in section 2.
Why the OS page cache makes this fast
Here is the part that is genuinely delightful once you see it.
After the server has been running for a few seconds, the OS page cache starts holding recently-read file pages in memory. The kernel does this automatically. You did not ask for it.
Now think about what binary search does. Every search starts by reading the middle entry of the index. Every single one. The second read is always in the middle of one of the halves. And so on.
So the same handful of “middle” pages of the index get hit on every lookup, regardless of which id you are looking for. Those pages stay hot in the OS cache. The first few seeks of every binary search are effectively memory reads, not disk reads. Only the last few seeks (the deep parts of the tree, into pages specific to a particular id) might actually touch disk. And even those often land on pages the OS has already prefetched.
This is why throughput stays flat as the dataset grows. Adding more records does not slow down a lookup, because the new records live in pages the OS does not bother caching until someone actually asks for them. The OS is doing exactly what the in-memory map does, except automatically and only for the parts that matter.
For one million records, the search visits about 20 entries. log2(1_000_000) is roughly 20. Twenty entries of 58 bytes is 1160 bytes of index reads per lookup, plus one read from the data file. That is the whole budget.
What writes look like in a real system
We threw on writes. A production version of this would have to handle them. You have two ways out, and neither is free.
Rebuild the index periodically. Append new writes to the data file as normal. Mark the index as stale. Every N writes or every N minutes, re-sort the data file and rebuild the index. Reads against a stale index either check an in-memory write buffer first or fall back to a linear scan for recent entries.
Keep an unsorted write-ahead buffer. New writes go into a small in-memory buffer and a separate write-ahead log file. Reads check the buffer first, then fall through to the binary-search path. Every so often, you merge the buffer into the sorted index.
If those patterns sound familiar, it is because they are exactly what an LSM-tree (log-structured merge tree) does. RocksDB, LevelDB, and Cassandra all use variants of this idea. We are not going to implement an LSM-tree right now. The “writes break the index” lesson later in section 2 is where we will work through the details. The point for now is that the moment you need fast random reads on disk and fast writes, you have rediscovered the database from first principles.
The numbers
Measured on Node 24, Apple Silicon Mac mini.
| Records | Requests/sec | Avg latency |
|---|---|---|
| 10k | 26,187 | 1.9ms |
| 100k | 25,278 | 1.9ms |
| 1M | 26,448 | 1.9ms |
Read the shape here, not the absolute numbers. Throughput barely moves as the dataset grows a hundred times. That is the magic of log n plus a hot page cache. The work per lookup is logarithmic in dataset size, and most of the seeks are already in memory.
Compare to the in-memory map. Binary search on disk is slower per request because every findUser here opens two file descriptors, reads bytes through the OS, and closes them again. Even when the pages are already cached, those syscalls have a cost. The in-memory map does none of that. It just dereferences a pointer.
Binary search on disk is not winning on throughput. It is winning on memory footprint. It uses only a few hundred KB of hot cache pages even for a million records, and there is no startup cost. For a dataset in the tens of millions where the full map would not fit in RAM, this is genuinely the right tool.
Why does binary search on disk stay flat as the dataset grows from 10k to 1M, despite reading from disk on every request?
When this is the right answer
You want this approach in a narrow but real case. The dataset is too big to fit in RAM, you only need lookups by primary key, and you read much more than you write.
That covers things like:
- A static asset CDN where each entry maps a URL to file metadata.
- An offline geocoding lookup where each entry maps a ZIP code to coordinates, with millions of entries.
- A read-mostly knowledge base that gets rebuilt nightly.
- An index embedded alongside a data file in a build artifact you ship to clients.
For most web applications, you never actually reach this case. You hit one of the other limits first (needing more than one query shape, needing writes, needing transactions), and at that point you reach for SQLite. Which is exactly what we do in the next lesson.