16.2 GROUP BY: Collapsing Rows into Groups

Alright, let’s get down to brass tacks. You’ve got a big ol’ table of data, and you want to summarize it. You don’t want to see every single row; you want to see the shape of the forest, not count every tree. That’s what GROUP BY is for. It’s the SQL clause that collapses rows into meaningful groups and lets you apply aggregate functions (SUM, AVG, COUNT, etc.) to each group individually. It’s the difference between a pile of individual receipts and a monthly financial statement.

Think of it like this: GROUP BY is the ultimate “sort and stack” command. You tell the database, “Hey, take all these rows, find the ones with the same value in this column, and squish them together into a single summary row for each unique value.”

The Mechanics: It’s a Two-Phase Operation

Understanding how it works will save you from countless headaches. A SELECT statement with a GROUP BY clause is executed in a very specific order:

FROM & WHERE: First, the database grabs all the rows from the table (and any joins) and filters them with the WHERE clause. This is your initial raw dataset.
GROUP BY: Next, it takes that filtered dataset and sorts all the rows into piles, or “groups,” based on the unique combinations of the columns you specified. All rows in a pile are identical in those grouped columns.
Aggregation: Then, for each individual pile, it calculates the aggregate functions (COUNT(pile), SUM(pile.salary), etc.). This is crucial: the aggregation happens per group.
HAVING: It then filters these already-grouped results based on the aggregated values. (We’ll get to HAVING in the next section; it’s the WHERE clause for your groups).
SELECT: Finally, it projects the columns you asked for. And this is the most important part…

The Golden Rule of GROUP BY

Here’s the rule, and it’s non-negotiable: Every column in your SELECT list that is not inside an aggregate function MUST appear in your GROUP BY clause.

The reason is beautifully logical. After grouping, what is a column like employees.name supposed to represent? You have a group of ten employees all in the ‘Engineering’ department. Which one of the ten names should it show? It can’t decide, so it makes you be explicit. You can only select the thing that defines the group (department) and calculations about the group (COUNT(*)).

Let’s see this in action. Imagine a sales table with sale_id, salesperson, region, and amount.

-- This works: We're selecting the grouped column and an aggregate.
SELECT salesperson, SUM(amount) as total_sales
FROM sales
GROUP BY salesperson;

-- This will FAIL in any respectable database:
-- 'region' is not grouped and not aggregated. Which region for the salesperson?
SELECT salesperson, region, SUM(amount) as total_sales
FROM sales
GROUP BY salesperson;

-- To fix it, you must add 'region' to the GROUP BY, creating groups for each unique (salesperson, region) combo.
SELECT salesperson, region, SUM(amount) as total_sales
FROM sales
GROUP BY salesperson, region;

Grouping on Multiple Columns: The Power Combo

You can group by more than one column to get increasingly specific summaries. The database creates a group for every unique combination of the values in those columns. The order you list them in the GROUP BY clause doesn’t matter for the result set, but it can affect the sort order if you don’t use an ORDER BY clause (which you always should, because being explicit is for winners).

-- How much did each salesperson sell in each region?
-- Groups are made for e.g. ('Alice', 'West'), ('Alice', 'East'), ('Bob', 'West'), etc.
SELECT salesperson, region, SUM(amount) as regional_sales
FROM sales
GROUP BY salesperson, region
ORDER BY salesperson, regional_sales DESC;

The NULL Pile: The Group You Didn’t Ask For

Here’s a fun edge case: what happens if you group by a column that contains NULL values? Turns out, NULL gets its own group. Because, in the eyes of GROUP BY, NULL is treated as a distinct value (or distinct lack thereof). All rows where the grouped column is NULL will be collapsed into a single summary row. This is usually the behavior you want, but it’s good to know it’s there so you aren’t surprised when a NULL group shows up in your results.

The Pitfall of SELECT DISTINCT

I see this all the time. A beginner wants a list of unique values and writes a SELECT DISTINCT a, b, c FROM table;. Then later, they need the count for each group and try to just slap a COUNT(*) onto that query. It doesn’t work. SELECT DISTINCT is about removing duplicates from the final result after all processing. GROUP BY is about creating groups for aggregation during processing. They solve different problems. If you need aggregates, you need GROUP BY. Don’t try to hack it with DISTINCT; you’ll just end up with a messy query and a confused optimizer.

-- This is a clumsy way to just get unique names.
SELECT DISTINCT salesperson FROM sales;

-- This is the right tool if you want to know *how many sales* each person made.
SELECT salesperson, COUNT(*) -- Counts rows in each group
FROM sales
GROUP BY salesperson;

In short, GROUP BY is your fundamental tool for transforming raw data into insightful summaries. Master its two-phase operation and the golden rule, and you’ve unlocked one of the most powerful features of SQL.