14.5 DISTINCT and DISTINCT ON: Deduplication Patterns
Right, so you’ve fetched some data and it looks… repetitive. Maybe you asked for a list of departments and got back 10,000 rows because you pulled it from the employees table. This is where DISTINCT waltzes in, ready to clean up the mess. It’s the SQL keyword for “get unique values,” and it does exactly what it says on the tin. But like any simple tool, it has hidden depths and a slightly more powerful, albeit quirky, cousin: DISTINCT ON. Let’s get into it.
The Basics of DISTINCT
DISTINCT is applied to the entire row. It looks at the complete set of columns you’ve selected and removes any rows that are exact duplicates across all of them. The syntax is brain-dead simple, which is why I love it.
-- Let's say you want to know which departments actually exist in your company.
-- This would give you a row for every single employee, repeating the department names.
SELECT department FROM employees;
-- This gives you the unique list. One row per department.
SELECT DISTINCT department FROM employees;
The key thing to remember: it operates on the combination of all selected columns. So if you select two columns, it will return unique pairs.
-- This tells you every unique manager/department combination.
SELECT DISTINCT manager_id, department FROM employees;
Here’s the first pitfall: DISTINCT is often used as a quick fix for a poorly written query that’s producing duplicates. Sometimes that’s valid (like our department list). Often, it’s a sign you forgot a JOIN condition or are grouping incorrectly. If you’re getting 100,000 duplicates and slap a DISTINCT on it to get 10 rows, you’re papering over a massive performance problem. The database still had to generate and then deduplicate all 100,000 rows. It’s a filter, not a magic wand.
DISTINCT ON: The Postgres Power-Up
Now for the good stuff, the thing that makes PostgreSQL developers smug: DISTINCT ON. It’s like DISTINCT but with a specific mission. It allows you to say, “For each unique value in this specific column or set of columns, give me back exactly one row. And oh, by the way, let me choose which row to return for each group.”
This is incredibly useful for solving the “fetch the most recent record for each group” problem, which is a classic database headache.
The syntax is particular and demands an ORDER BY clause to work correctly. You specify the columns to be distinct on in parentheses, and then you use ORDER BY to control which row is chosen from each group. The chosen row is the first one based on your ordering.
Let’s say you have a product_updates table and you want the latest update for each product.
-- The table might look like this:
-- | id | product_id | update_text | updated_at |
-- |----|------------|-------------|------------|
-- | 1 | 100 | 'Launched' | 2023-01-01 |
-- | 2 | 100 | 'Fixed bug' | 2023-01-05 | <-- We want this one for product 100
-- | 3 | 101 | 'Beta' | 2023-01-03 |
-- The wrong way (a common beginner mistake):
SELECT DISTINCT ON (product_id) *
FROM product_updates;
-- This would pick a row arbitrarily. Probably the first one it finds. Useless.
-- The correct, powerful way:
SELECT DISTINCT ON (product_id) *
FROM product_updates
ORDER BY product_id, updated_at DESC;
Let’s break down why this works. DISTINCT ON (product_id) tells PostgreSQL, “Form groups based on unique product_id values.” The ORDER BY product_id, updated_at DESC then does two things: it satisfies the requirement that the ORDER BY must contain the DISTINCT ON columns first, and it sorts the rows within each product_id group from most recent to oldest. DISTINCT ON then plucks the first row from each of these sorted groups—the one with the most recent updated_at.
The Quirks and Rough Edges
The designers were mostly brilliant here, but they left a few traps.
The ORDER BY is Mandatory (and Fussy): The
ORDER BYin aDISTINCT ONquery isn’t just for sorting the final output; it’s a control mechanism. It must start with the same expressions as theDISTINCT ONclause. If youDISTINCT ON (product_id), you mustORDER BY product_id, .... If you don’t, Postgres will throw a fit. This is actually a good thing—it forces you to think about how you’re selecting the row.It’s Not Standard SQL: This is a PostgreSQL extension. If you get used to it and then have to write for MySQL or SQL Server, you’ll be sad and have to rewrite your query using window functions (
ROW_NUMBER()) which is far more verbose. Consider this your reward for choosing a superior database.Performance Isn’t Free: While it’s often more efficient than other methods for this specific task, it still requires a sort. On huge datasets, you’ll need an index on
(product_id, updated_at DESC)to make it scream.
So, when do you use which? Use DISTINCT for getting unique values or value combinations. Use DISTINCT ON when you need to select a single row per group based on a specific order. It turns a complex correlated subquery or a window function into an elegant, one-line solution. And elegance, in the trenches of SQL, is a rare and beautiful thing.