Writes break the index
Back in the binary search lesson we handled writes with a single hand-wavy sentence: “appending new records breaks the sort, so in a real system you would rebuild the index periodically or keep an unsorted write-ahead buffer and merge it in.”
That sentence is doing a lot of work. It is, in fact, the entire reason database storage engines are as complicated as they are. This lesson unpacks it, piece by piece, and by the end you will have re-invented something called an LSM-tree from scratch.
What breaks
Recall the binary search setup. Two files on disk. users.jsonl is the data, sorted by id. users.idx is a fixed-width index of <UUID>:<byte offset> entries, also sorted by id. Every lookup does a binary search on the index (O(log n) reads) and then one read from the data file.
Now imagine a POST /users comes in. The new user has a freshly generated UUID. That UUID almost certainly sorts somewhere in the middle of the existing ids.
We have three problems, and none of them are small.
-
Where do we put the new record in the data file? If we append it to the end, the file is no longer sorted. If we insert it in the middle, we have to shift every byte after the insertion point. For a million-record file, that is potentially gigabytes of disk writes for a single user.
-
What do we do with the index? Same story. The new entry needs to land at the right position to keep the file sorted. Inserting in the middle means rewriting everything after it.
-
How do we serve reads while this is happening? A binary search assumes the index is consistent at the moment we read it. If a write is partway through shuffling bytes around, a concurrent read can see the index in a half-updated state and return nonsense.
The naive fix is “rewrite both files on every write.” That turns out to be O(n) work per write. At one million records, every POST takes on the order of a second. We have just rebuilt linear scan, but for writes instead of reads. This is not a fix.
There is no clever tweak to the fixed-width sorted-array structure that makes writes O(log n). The structure is simply wrong for workloads that mix reads and writes. We need a different shape.
The fix everyone ends up with
The idea is to stop trying to do it all with one structure. Instead, use two layers. A small layer that absorbs writes quickly. A large layer that serves reads efficiently. Periodically merge the small one into the large one.
Reads: buffer (small, O(n)) + sorted index (large, O(log n))
Writes: always go to buffer O(1)
Merge: periodically fold buffer into sorted index, O(n+m) A read checks the buffer first. The buffer is small (say up to 10,000 entries), so even a linear scan or a Map.get is effectively instant. If the id is in the buffer, we return it. If it is not, we fall through to the binary search on the big sorted index.
Writes go straight to the buffer. No sorting needed. No file rewriting. It is essentially free.
When the buffer gets big enough, you merge. Combine the buffer with the existing sorted index, sort the whole thing, write out a fresh index, and swap it into place atomically. This operation is expensive, but it happens rarely, and you can do it in the background.
We are going to build this one piece at a time. The full code is a bit long, but each piece on its own is simple.
Step 1: Imports and shared state
Create a new file called src/store-binary-buffered.ts. Start with the imports and the constants that every piece below will reference.
// src/store-binary-buffered.ts
import {
openSync,
readSync,
writeSync,
statSync,
closeSync,
existsSync,
appendFileSync,
readFileSync,
} from "node:fs";
import { randomUUID } from "node:crypto";
const DATA_FILE = "users.jsonl";
const INDEX_FILE = "users.idx";
const BUFFER_FILE = "users.wal"; // write-ahead log
const ENTRY_LEN = 58;
const MERGE_THRESHOLD = 10_000;
export interface User {
id: string;
name: string;
email: string;
created_at: string;
} Three file paths. DATA_FILE and INDEX_FILE are the same sorted data and index files from the binary search lesson. BUFFER_FILE is new: a durable log for writes that have not been merged into the sorted files yet.
MERGE_THRESHOLD = 10_000 is the cutoff. Once the buffer has 10,000 entries, we trigger a merge. That is a deliberate number. Small enough that scanning the buffer on every read is still fast; large enough that merges are rare.
Step 2: The in-memory buffer
// In-memory buffer of recent writes
const buffer = new Map<string, User>(); One Map. Every new user lands here first. Every read checks here first. It is identical in shape to the in-memory map store from section 1, but it only holds users that have not yet made it into the big sorted index.
Step 3: Recover the buffer from disk on startup
An in-memory buffer by itself is not durable. If the process crashes, every pending write is gone. So we mirror the buffer into a write-ahead log file on disk. On startup, we replay the log back into the buffer.
// Load buffer from WAL on startup (in case of restart)
if (existsSync(BUFFER_FILE)) {
const contents = readFileSync(BUFFER_FILE, "utf8");
for (const line of contents.split("\n")) {
if (!line.trim()) continue;
const u = JSON.parse(line) as User;
buffer.set(u.id, u);
}
} Very similar to the loadUsers function from the in-memory map lesson. Read the whole file. Split on newlines. Parse every line as JSON. Drop it into the map.
The key move is that this runs at module load, before serve() ever accepts a request. By the time the HTTP server is ready, the buffer reflects every write that happened before the last shutdown.
Step 4: findUser, two layers
export function findUser(id: string): User | null {
// Check the buffer first (recent writes)
const fromBuffer = buffer.get(id);
if (fromBuffer) return fromBuffer;
// Fall through to the sorted index
return binarySearchIndex(id);
} Check the buffer first, because that is where the most recent writes live. If the id is in the buffer, return the user and stop. We do not need to touch the sorted index at all.
If the id is not in the buffer, fall through to the binary search on the big sorted index. We will reuse the lookup code from the binary search lesson.
Why buffer first? Because a read wants the newest version of a record. If you write a user, then immediately read them, the write is in the buffer but not yet in the sorted index. Checking the buffer first makes sure recent writes are visible.
Step 5: createUser, writing to both layers
export function createUser(name: string, email: string): User {
const user: User = {
id: randomUUID(),
name,
email,
created_at: new Date().toISOString(),
};
// Append to WAL for durability
appendFileSync(BUFFER_FILE, JSON.stringify(user) + "\n");
// Add to in-memory buffer
buffer.set(user.id, user);
// Trigger a merge if the buffer gets too big
if (buffer.size >= MERGE_THRESHOLD) {
mergeBufferIntoIndex();
}
return user;
} Three steps on every write.
Append the user to the WAL file on disk. This is what makes the write durable: if the process crashes right after this line, the record is still there on restart, and Step 3’s recovery code puts it back in the buffer.
Add the user to the in-memory buffer. This is what makes the write visible to subsequent reads.
Check the buffer size. If we have accumulated MERGE_THRESHOLD entries, trigger a merge. Most calls do not trigger a merge. On the rare ones that do, we pay the merge cost inline. (A real production implementation would run the merge in the background, not block the request. We will point at that below.)
The WAL write has to happen before the in-memory buffer update. If we updated the buffer first and the process crashed between the two lines, the write would be reflected in memory but not on disk, and on restart we would lose it.
Step 6: The binary search stub
function binarySearchIndex(id: string): User | null {
// ...same code as the binary search lesson
// (omitted for brevity)
return null;
} This is the exact same function from the binary-search lesson: open both files, walk the sorted index with binary search, read the record from the data file on hit. We are not rewriting it here because nothing about it has changed.
In a real implementation you would paste in that function or factor it into its own module and import it.
Step 7: The merge (sketched)
And now the hard part.
function mergeBufferIntoIndex() {
// 1. Read all entries from the existing sorted data file + index
// 2. Combine with everything in the buffer (buffer wins on conflicts)
// 3. Sort the combined set by id
// 4. Write a new data file and index file
// 5. Atomically swap (rename new files over old)
// 6. Clear the buffer and truncate the WAL
// ...details left as an exercise
} The comments describe the happy path. The happy path is the easy part. Getting this right in the presence of failures is what takes months.
Things we are waving at:
- What happens if the process crashes halfway through step 4? We now have a half-written new index file on disk, plus the old ones, plus the WAL. Recovery needs to figure out which set of files is authoritative.
- What happens if a read comes in while step 5 is in progress? Readers need a consistent view. Either they see the old files or the new ones, never a mix.
- How do we handle deletes? The buffer has no entry for a deleted id, but the sorted index still does. You need a “tombstone” record in the buffer that says “this id is deleted,” so the merge can drop it.
- How do we do step 1 efficiently? Re-reading the entire sorted file every merge is wasteful. Real LSM-trees use streaming iterators that walk the old sorted file and the buffer in parallel, writing the merged output as they go.
Each of those is its own subproject.
Step 8: Count the moving parts
Look at what we have just built.
A buffer in memory. A WAL file on disk, in sync with the buffer. The sorted index file. The data file underneath the index. Recovery logic that rebuilds the buffer from the WAL on startup. Atomic file swaps so readers never see half-merged state. A merge process running in the background or when the buffer overflows.
This is half a database engine. We are still missing a fair bit.
What we have just invented
This architecture has a name. It is called a log-structured merge tree, or LSM-tree for short. It was published as a paper in 1996 and is the storage engine behind a long list of real databases:
- LevelDB (the canonical modern implementation)
- RocksDB (Facebook’s fork of LevelDB, used everywhere)
- Cassandra
- HBase
- ScyllaDB
- BadgerDB
- The default storage engines in CockroachDB and TiDB
All of them share the basic shape we just sketched. A small fast layer for writes, a large sorted layer for reads, periodic compaction between them. The real implementations solve the hard problems we waved at. They use multiple sorted levels (L0, L1, L2, and so on) so merges happen in small chunks rather than one giant spike. They use bloom filters (a small probabilistic data structure) to skip levels that definitely do not contain a given key. They use tombstones to represent deletes. They tune compaction strategies for different workload shapes.
Getting those details right is a career.
What SQLite does instead
SQLite does not use an LSM-tree. It uses a B+ tree, which is the other major family of disk-based ordered indexes. B-trees handle in-place updates more gracefully than the flat sorted arrays we were working with. You can insert into the middle of a B-tree without rewriting everything, because the tree is structured in nodes that can be updated locally.
The two approaches have different strengths.
| Property | B-tree (SQLite, Postgres) | LSM-tree (RocksDB, Cassandra) |
|---|---|---|
| Random reads | Fast | Slower (have to check multiple layers) |
| Sequential writes | Slower | Very fast |
| Random writes | OK | Very fast (treated like sequential) |
| Disk space | Compact | Larger (compaction debt) |
| Read amplification | Low | Higher |
| Write amplification | Higher | Lower-ish (with good compaction) |
Neither is universally better. B-trees have been the default for decades because most workloads are read-heavy or balanced, and B-trees are simpler to reason about. LSM-trees have been taking over for specifically write-heavy workloads: analytics, time series, message queues. Some databases even let you mix. Postgres has B-tree indexes by default but offers GIN and BRIN for specific access patterns. ScyllaDB is an LSM with multiple compaction strategies you can pick between.
For the purposes of this lesson, the important takeaway is this. Once you need fast writes against a sorted index, you have rediscovered the database. Whether you land on an LSM-tree or a B-tree, you end up with multiple layers, durable write-ahead logs, atomic swaps, and background maintenance. All of which you could write yourself, and most of which someone has already written much better than you probably will.
Our binary search approach with a buffer, WAL, and periodic merge is essentially an LSM-tree. What is the main reason real LSM-trees are more complex than what we sketched?
When you would actually do this
Almost never. If your reads are slow because the index does not fit in RAM, the answer is SQLite (or a real LSM database for write-heavy workloads), not a hand-rolled merge buffer. The engineering cost of getting this right, including handling crashes during a merge, reasoning about read consistency during the atomic swap, debugging a corrupted index in production at 3am, is vastly higher than the cost of just running SQLite.
The reason this lesson exists is to make you appreciate what a database is actually doing for you. When SQLite gives you several thousand fully-durable inserts per second with consistent reads during writes, it is solving exactly this problem, across multiple layers, with decades of refinement. The fact that it does it in a single static library you can drop into any project is one of the more impressive engineering feats in software.
The full list of concerns
For completeness, here is the list of things a writeable sorted index has to handle. Our sketch handled some, hand-waved on others, and ignored a few entirely.
- Crash recovery. What happens if the process dies mid-merge?
- Concurrent reads during writes. Can a reader see a half-merged state?
- Deletes. How do you remove an entry that still exists in a sealed lower layer?
- Range scans. How do you efficiently iterate rows in
created_atorder, filtered bycreated_at > ?, across the in-memory buffer and the sorted index together? - Multiple writers. Can two writers add to the buffer at the same time?
- Disk space management. Old data has to be reclaimed without blocking reads.
- Read amplification. How many disk reads, worst case, to find a key?
- Write amplification. How many bytes does the disk see per byte you actually wrote?
Every database makes different choices on all of these, and those choices are what give each database its personality. Postgres fsyncs the WAL on every commit by default. SQLite has multiple WAL modes and pragmas. RocksDB has tunable compaction strategies. You do not need to make these choices yourself. That is what databases are for.
In the next lesson we look at what happens when more than one process tries to write at the same time, which is where flat files genuinely fall apart and where SQLite reveals an unexpected weakness of its own.