hectoday
DocsCoursesChangelog GitHub
DocsCoursesChangelog GitHub

Access Required

Enter your access code to view courses.

Invalid code

← All courses Do You Need a Database?

Storage Fundamentals

  • A Database is Just Files
  • The Setup
  • Approach 1: Linear Scan
  • Approach 2: In-Memory Map
  • Approach 3: Binary Search on Disk
  • SQLite as a Baseline
  • Benchmarking with wrk
  • Reading the Numbers
  • When You Actually Need a Database
  • Quiz: Storage Fundamentals (wip)

Writes and Durability

  • The Write Path
  • Append Throughput
  • Writes Break the Index
  • Concurrent Writers
  • Atomic Multi-Record Writes

When you actually need a database

We have spent the whole course showing how far you can get without a database. In the last lesson the math said a single Node process with an in-memory map could, in principle, handle more daily active users than most products will ever have. Great. But that is not the whole story.

Flat files, with or without an in-memory index, are not always the right answer. There are specific structural problems they cannot solve, no matter how fast your hardware is. This lesson is about those problems. If you know what they look like, you can recognize them the moment they show up in your own codebase. Before that, a database is overkill. After that, it is what lets you keep shipping without writing a database from scratch.

Here are the five constraints.

1. Your dataset does not fit in RAM

The in-memory map gave us around 72,000 req/s with sub-millisecond latency at every size we tested. There is one qualifier hiding in that sentence. At every size that fits in memory.

Our 1M-record benchmark used tiny records, about 100 bytes each. The whole map fit in maybe 200MB of RAM, including V8’s overhead for the hash table. On a modest server with 8GB of RAM you could load tens of millions of these records without running into trouble.

But records are not always 100 bytes. If you are storing articles with full text, blog posts with markdown, or orders with line items embedded as JSON, each record might be 10KB. Do the math. At 10KB per record, 1 million records is 10GB. At 10 million records, you are at 100GB, which is more memory than most servers have.

Once your working set exceeds RAM, you need a way to keep most of the data on disk and only pull pages into memory when they are actually needed. Every disk-based database is built around this idea. The data structures it uses (B-trees, LSM-trees) exist precisely because they work well when the data is too big to load.

Building this yourself is a project. A serious one. That is what a database is.

The signal: process.memoryUsage().heapUsed right after loadUsers() starts looking uncomfortable on the smallest machine you want to deploy on.

2. You need to query by more than one field

Right now the only fast operation is findUser(id). That is fast because we built the data structure (a Map, a sorted file, or a B-tree) specifically around the id field.

What happens when product asks for “find all users registered in the last 7 days”? Or “find all users with @example.com emails”? Or “find all orders for user X with a total over $100”?

You would have to scan the whole map. Or maintain a second map keyed by date. Or a third map keyed by email domain. Every new query pattern is another data structure to maintain, and another consistency invariant to keep correct when writes happen. A couple of these is tractable. Three or four and you have built a query engine, badly, and every bug in it is a data correctness bug in your app.

A relational database lets you create indexes on any field, combine them freely in queries, and have a query planner figure out which ones to use. A single CREATE INDEX users_email_idx ON users(email) statement gives you a fast lookup by email, with the database handling all the ongoing maintenance for you. That is an enormous amount of leverage for one line of SQL.

The signal: you find yourself adding a second Map<string, User> keyed by something other than id, and writing code to keep it in sync with the primary map.

3. You need joins

Our users.jsonl, products.jsonl, and orders.jsonl files store related data. Orders reference users and products by id. To return “an order with the user’s name and the product’s name” in one response, you have to load three records from three files and assemble them in application code.

For a single order this is cheap. For a list of 100 orders, you are now doing 200 extra lookups (one per user, one per product, before any deduplication). For a complex page that stitches together five kinds of related records, you can end up doing thousands of lookups per request.

Every one of those lookups is application code you wrote and have to maintain. Every one is a potential place for a bug. Every one is a database round trip you did on purpose.

A SQL JOIN lets the database do all of this in a single query, picking the best index access pattern for you. The performance characteristics are very different. A B-tree driven join inside the database is cheaper than the equivalent N+1 lookups in application code, often by a lot.

The signal: your handlers are starting to look like sequences of findX(id) calls, each one using the result of the previous one. That is the classic N+1 pattern.

4. Multiple processes need to write at the same time

This one is structural. The moment you cross into it, the in-memory map approach falls apart completely.

Inside a single Node process, the JavaScript runtime is single-threaded. There is no concurrency between handlers. The Map and the file stay consistent because there is only ever one writer at a time.

The moment you run two Node processes, for any reason (redundancy, horizontal scaling, a load balancer, a rolling deploy with overlap), they each have their own in-memory map. When process A appends to users.jsonl, process B does not know. Process B’s map is stale until it restarts. Process A and process B can return different answers for the exact same findUser(id) call. Your app becomes non-deterministic, which is a fancy way of saying “your users get different data depending on which server they hit.”

You can work around this. File watchers, polling, a pubsub channel for invalidation messages. But you are now building a distributed cache coherence protocol, which is one of the hardest problems in distributed systems and the source of an endless supply of production incidents.

The clean answer is to have a single source of truth that every process talks to over the network. Postgres, MySQL, Redis. They handle the coordination so you do not have to.

A small note on SQLite, which sits in the middle. Multiple processes can read a SQLite file at the same time just fine. Multiple writers serialize on a single file lock. That is good enough for a single-server deployment with one writer process and several readers. It is not the right answer for “five servers behind a load balancer all serving traffic.”

The signal: the moment you imagine running more than one instance of your server. Or the moment you autoscale. Or the moment you do a rolling deploy.

5. You need atomic writes across multiple records

Imagine a classic order endpoint. A customer places an order, which means inserting one orders record and decrementing the inventory count for each product in the order. Both writes need to succeed, or both need to fail. If the order is created but the inventory is not decremented, you have oversold a product. If the inventory is decremented but the order is not created, the customer has been charged for nothing.

With three separate JSONL files, there is no way to make these two writes atomic. The process can crash in the middle. The disk can fail after one write but before the other. You are exposed to inconsistency, and the exposure is not hypothetical. It is waiting for the exact wrong moment.

You could implement a transaction log yourself. Write a “pending operation” record first, then do both writes, then write a “completed” marker. On startup, replay any pending operations that did not complete. This is durable. It is also most of a database.

Databases solve this with ACID transactions:

BEGIN;
INSERT INTO orders (...) VALUES (...);
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ?;
COMMIT;

Either both statements take effect, or neither does. The database is responsible for atomicity and durability. You are responsible for writing the SQL. That is a much better division of labor.

The signal: you are writing comments like // TODO: what happens if the second write fails? in your handler code. The answer is “you have a bug.” The fix is transactions.

Which of these is NOT a sign that you need a real database?

What this course was really about

It is easy to read this lesson and conclude “okay, I should just use Postgres for everything.” That is not the point.

The point is that the choice should be deliberate. When you reach for Postgres on day one of a new project, you are paying real costs. Another service to deploy and monitor. A connection pool to tune. A schema migration tool to pick and configure. A backup strategy. A development environment that includes a database container. Those costs are worth it when you actually need them. They are not free, and they should not be paid by default.

For most applications, the order of operations looks like this:

  1. Start with a flat file or an in-memory map. Ship something.
  2. When you have a real query the map cannot answer, switch to SQLite. The migration is easy because JSONL imports trivially.
  3. When SQLite stops being enough (multiple writer processes, dataset larger than a disk can hold, need for replication), move to Postgres.
  4. When Postgres stops being enough (you have outgrown one machine), move to sharding, read replicas, or a distributed database.

You can spend years happily at any of these steps. Most successful products spend years at step 3. (Instagram is a famous example.) Many spend years at step 2, because SQLite is genuinely good. Many small things spend years at step 1 and never graduate.

The question is not “what database should I use.” The question is “what is my actual constraint right now, and what is the smallest thing that solves it?”

Coming up: writes

This whole section was about reads. We hammered each server with GET /users/:id and looked at the numbers. The story for writes is genuinely different, and most of the constraints we just listed (atomic multi-record writes, multiple writers, durability) only really show up when you try to put data into the system.

Section 2 is about writes. We will benchmark appendFileSync against node:sqlite inserts, look at what fsync actually costs (about a thousand times more than people expect), see why writes break the binary search index (and rediscover LSM-trees in the process), and put real numbers on what happens when two server processes try to write to the same SQLite file at the same time.

A preview. SQLite with batched transactions can hit close to 70,000 inserts per second. Naive appendFileSync with a fsync on every write collapses to barely a thousand. The gap between “fast” and “durable” is much wider than the read benchmarks suggested, and that gap is where most of the interesting engineering lives.

← Reading the Numbers Quiz: Storage Fundamentals (wip) →

© 2026 hectoday. All rights reserved.