18.3 Recursive CTEs: The WITH RECURSIVE Pattern
Alright, buckle up. We’re about to dive into the part of SQL that feels a little bit like magic and a whole lot like a potential foot-gun. Recursive CTEs. The name is intimidating, but the concept is actually quite elegant once you pull back the curtain. It’s SQL’s way of saying, “Fine, you want to do loops? Here’s your one and only loop. Don’t make me regret this.”
At its core, a recursive CTE is just a CTE that references itself. It’s used primarily for querying hierarchical or tree-structured data—think organizational charts, bill-of-materials explosions, or forum comment threads—directly in your relational database. It feels a bit like a party trick, but it’s an incredibly powerful one.
The basic structure is always the same, and it looks a little weird at first. You have a WITH RECURSIVE clause (yes, you need that keyword, even in PostgreSQL), and the CTE itself is made up of two distinct queries unioned together.
The Anatomy of a Recursive CTE
Here’s the canonical pattern. Stare at it for a second. I’ll explain each part.
WITH RECURSIVE my_recursive_cte AS (
-- Anchor Member: The starting point.
SELECT ...
FROM ...
WHERE ...
UNION ALL
-- Recursive Member: The part that references itself.
SELECT ...
FROM my_recursive_cte
JOIN ...
WHERE ...
)
-- Final SELECT that uses the result
SELECT * FROM my_recursive_cte;
Let’s break this down. The CTE is built from two separate SELECT statements glued together with a UNION ALL.
- The Anchor Member: This is your starting point, your base case. It’s a normal, non-recursive query that seeds the entire operation with the initial row(s). If you’re building an org chart, this is the query that finds the CEO.
- The Recursive Member: This is where the magic (and danger) happens. This
SELECTquery references the CTE itself (my_recursive_cte). It uses the results from the previous iteration to produce the next set of rows. In our org chart, this is the part that finds each manager’s direct reports. It joins the CTE to the actual employee table on the manager ID. The database takes the result from the anchor, runs the recursive member to get new rows, then takes those rows and runs the recursive member again… and again… until the recursive member returns an empty set.
The UNION ALL is crucial. You almost always want ALL because you’re building a set of distinct steps in a path, and duplicates are a sign of a problem in your data or logic. Using just UNION would needlessly deduplicate and murder your performance.
A Concrete Example: Walking a Tree
Let’s make this real. Imagine a simple table for a comment thread:
CREATE TABLE comments (
id SERIAL PRIMARY KEY,
parent_id INT REFERENCES comments(id),
author TEXT NOT NULL,
body TEXT NOT NULL
);
INSERT INTO comments (id, parent_id, author, body) VALUES
(1, NULL, 'Alice', 'Why is the sky blue?'),
(2, 1, 'Bob', 'Because of Rayleigh scattering!'),
(3, 1, 'Carol', 'It just is.'),
(4, 2, 'Dave', 'Rayleigh who?'),
(5, NULL, 'Erin', 'I like cats.');
Now, let’s write a recursive CTE to get the entire thread starting from comment #1, including its depth.
WITH RECURSIVE comment_thread AS (
-- Anchor: Find the root of the thread
SELECT
id,
parent_id,
author,
body,
0 AS depth -- Start depth at 0
FROM comments
WHERE id = 1
UNION ALL
-- Recursive: Find direct replies to the current set of comments
SELECT
c.id,
c.parent_id,
c.author,
c.body,
ct.depth + 1 -- Increment the depth
FROM comments c
INNER JOIN comment_thread ct ON c.parent_id = ct.id
)
SELECT id, parent_id, author, depth, body
FROM comment_thread
ORDER BY depth, id;
This will return:
id | parent_id | author | depth | body
---+-----------+--------+-------+-----------------------------------
1 | null | Alice | 0 | Why is the sky blue?
2 | 1 | Bob | 1 | Because of Rayleigh scattering!
3 | 1 | Carol | 1 | It just is.
4 | 2 | Dave | 2 | Rayleigh who?
See how it works? The anchor grabs comment #1. The recursive member then joins comments to the current comment_thread result (which is just comment #1) to find its children (comments #2 and #3). Those get added. On the next iteration, it joins to the new result set (comments #2 and #3) to find their children (comment #4). The next iteration finds no children for comment #4, so it stops.
The Pitfalls and the Gotchas
This is where I earn my keep. Recursive CTEs are brilliant but come with some serious “handle with care” labels.
- Infinite Loops: This is the big one. If your data has a cycle—a comment that somehow points to itself or a parent higher up the chain—your recursive member will never return an empty set. It will joyfully loop until it hits a system-defined limit. PostgreSQL will throw a
ERROR: maximum recursion depth exceededand politely shut you down. You can sometimes work around this by checking for visited nodes or usingUNIONto discard duplicates, but the real fix is to have sane data constraints. - Performance: This is essentially a loop in SQL. It’s not going to be as fast as a well-designed set-based operation. For very deep hierarchies, it can get slow. Always, always have an index on the foreign key column (
parent_idin our example). Without it, every single recursive step is a full table scan. Don’t be that person. - The Final SELECT is Key: Remember, the recursive CTE builds a result set. You almost always need to filter or order in the final
SELECTthat uses it. Putting aWHERE depth < 5clause in the finalSELECTis much more efficient than putting it in the recursive member, because the recursive member has to run to completion first. The finalSELECTjust trims the already-built result.
The designers gave us one loop. It’s a weird, set-based, declarative loop, but it’s a loop. Use it wisely. It’s the perfect tool for traversing graphs and hierarchies that you can’t—or shouldn’t—flatten out elsewhere. Just remember to check for cycles and index your keys, unless you enjoy watching query planners cry.