Aggregation

JOINs are great at combining rows, but plenty of real questions are not about rows at all. They are about summaries. “How many books do we have?” “What is the average rating?” “Which genre has the most books?” You ask these questions constantly, in dashboards, in admin panels, and in reports. To answer them, you need to take many rows and reduce them down to a few numbers. That is what aggregation does.

In this lesson we will go through the core aggregate functions, then learn how GROUP BY turns them into per-group summaries, and finally how HAVING lets us filter on those summaries. We will also clear up a subtle trap around COUNT and NULL values.

[!NOTE] The examples below assume a slightly richer dataset than the three books you inserted back in the one-to-many lesson. To follow along exactly, insert a few more books across different genres (some with a description, some without) and a handful of reviews with different ratings. The exact numbers in the output blocks are illustrative; what matters is the shape of each query.

Aggregate functions

Let’s start with the basics. Aggregate functions take a column (or all rows) and return a single value.

COUNT(*) counts the number of rows:

Code along

SELECT COUNT(*) FROM books;  -- How many books?
-- 6

That query says “look at every row in books and tell me how many there are.”

COUNT(column) counts the number of non-NULL values in a specific column:

Code along

SELECT COUNT(description) FROM books;  -- How many books have a description?
-- 4 (books without a description are not counted)

Notice the difference already. COUNT(*) gave us 6 because there are 6 rows. COUNT(description) gives us 4 because two of those rows have a NULL in their description column. We will come back to this distinction at the end of the lesson because it trips people up.

SUM(column) adds up all values (imagine we had an order_items table for demonstration):

SELECT SUM(price) FROM order_items WHERE order_id = 1;
-- 45.99

[!NOTE] Our migrations use INTEGER PRIMARY KEY AUTOINCREMENT throughout, so the examples here use plain integer IDs like 1 or 2. In production apps, resources that get exposed in URLs (orders, books, reviews) typically use UUIDs for the reasons we covered in the primary keys lesson. The capstone at the end of the course shows that UUID approach in a production-style schema.

AVG(column) calculates the average:

Code along

SELECT AVG(rating) FROM reviews WHERE book_id = 1;
-- 4.5

MIN(column) and MAX(column) find the smallest and largest values:

Code along

SELECT MIN(published_at), MAX(published_at) FROM books;
-- 1929-09-27, 2013-05-14

So far so good. But running one aggregate over an entire table is only the beginning. The real power shows up when we split the data into groups first.

GROUP BY

GROUP BY splits rows into buckets based on a column, then runs the aggregate function on each bucket separately:

Code along

SELECT genre, COUNT(*) AS book_count
FROM books
GROUP BY genre;

| genre              | book_count |
|--------------------|------------|
| fiction            | 3          |
| science-fiction    | 1          |
| fantasy            | 1          |
| historical-fiction | 1          |

Here is what happens under the hood. SQLite scans the books table, groups rows that share the same genre, and then runs COUNT(*) on each group. You end up with one row per distinct genre, which is exactly what you want for a summary.

GROUP BY with JOINs

GROUP BY becomes even more useful once you combine it with JOINs:

Code along

SELECT books.title, AVG(reviews.rating) AS avg_rating, COUNT(reviews.id) AS review_count
FROM books
LEFT JOIN reviews ON books.id = reviews.book_id
GROUP BY books.id;

| title                    | avg_rating | review_count |
|--------------------------|------------|--------------|
| The Old Man and the Sea  | 4.5        | 2            |
| A Farewell to Arms       | NULL       | 0            |
| The Left Hand of ...     | 5.0        | 1            |

Read this carefully. We LEFT JOIN reviews to books so that every book shows up, even if it has zero reviews. Then we GROUP BY books.id so SQLite gives us one row per book. For each group, AVG(reviews.rating) gives the average rating, and COUNT(reviews.id) gives the number of reviews.

Notice “A Farewell to Arms” has avg_rating of NULL and review_count of 0. That is the LEFT JOIN doing its job: it kept the book in the result even though no reviews exist.

HAVING

Now a new problem. What if you only want genres that have more than one book? Your first instinct might be to write WHERE COUNT(*) > 1. That will not work, and it is worth understanding why.

WHERE filters individual rows before grouping happens. At that stage the count does not even exist yet, because groups have not been formed. So WHERE COUNT(*) > 1 is meaningless to SQLite at that point.

HAVING is the answer. It filters groups after aggregation:

Code along

SELECT genre, COUNT(*) AS book_count
FROM books
GROUP BY genre
HAVING book_count > 1;

The mental model is: WHERE runs first (before grouping), then GROUP BY forms the buckets, then the aggregates are computed, and only then does HAVING filter the resulting groups.

Combining WHERE and HAVING

You often want both. WHERE trims the rows up front, and HAVING trims the groups at the end:

Code along

-- Of fiction books only, which authors have more than 1?
SELECT authors.name, COUNT(books.id) AS fiction_count
FROM authors
JOIN books ON books.author_id = authors.id
WHERE books.genre = 'fiction'          -- filter rows BEFORE grouping
GROUP BY authors.id
HAVING fiction_count > 1;              -- filter groups AFTER grouping

Let’s trace through this query out loud. First, WHERE books.genre = 'fiction' throws away any book that is not fiction. Then GROUP BY authors.id groups the surviving rows by author. COUNT(books.id) computes how many fiction books each author has. Finally, HAVING fiction_count > 1 keeps only the authors with more than one fiction book.

This WHERE-then-HAVING pattern comes up constantly in real applications. Dashboards, analytics queries, and admin searches almost always have both kinds of filters.

COUNT(*) vs COUNT(column)

This one deserves its own section because it trips up so many developers. The difference between COUNT(*) and COUNT(column) only matters when NULLs or LEFT JOINs are involved, and those come up all the time.

Code along

-- COUNT(*) counts ALL rows, including those with NULLs
SELECT COUNT(*) FROM books;  -- 6

-- COUNT(description) counts only rows where description is NOT NULL
SELECT COUNT(description) FROM books;  -- 4 (2 books have no description)

-- COUNT with LEFT JOIN
SELECT authors.name, COUNT(books.id) AS book_count
FROM authors
LEFT JOIN books ON books.author_id = authors.id
GROUP BY authors.id;
-- COUNT(books.id) is 0 for authors with no books (books.id is NULL)
-- COUNT(*) would be 1 (the author row itself exists)

That last example is the big one. When you LEFT JOIN and an author has no books, SQLite still keeps the author in the result, and the books.id column comes back as NULL for that row. If you wrote COUNT(*), you would get 1 because a row exists. But COUNT(books.id) correctly gives 0 because the book id is NULL.

The rule is simple once you see it. Use COUNT(*) to count rows, and COUNT(column) to count non-NULL values in a column. When you are doing a LEFT JOIN and want zeros for unmatched rows, count a column from the right table.

Exercises

Exercise 1: Count books per author. Include authors with 0 books.

Exercise 2: Calculate the average rating per book. Include books with no reviews.

Exercise 3: Find genres with more than 1 book using HAVING.

Exercise 4: Find the author who has written the most books.

What is the difference between WHERE and HAVING?

JOINs and aggregation handle most querying needs. But some questions are layered. “Which books have more reviews than average?” needs you to compute the average first, then filter by it. That is where subqueries and CTEs come in, and that is the topic of the next lesson.

Access Required

Aggregation

Aggregate functions

GROUP BY

GROUP BY with JOINs

HAVING

Combining WHERE and HAVING

COUNT(*) vs COUNT(column)

Exercises