18.5 Cycle Detection in Recursive CTEs (PostgreSQL 14+)
Right, so you’ve built a recursive CTE to traverse a tree or a graph. It’s beautiful. It’s elegant. It works perfectly on your test data. You deploy it. A week later, your phone explodes at 3 AM because some smart aleck added a loop in the data, and your perfect query is now spinning in an infinite recursion, burning through CPU cycles like a crypto miner on a free electricity plan.
This, my friend, is why we have cycle detection. And as of PostgreSQL 14, we don’t have to hand-roll this tedious and error-prone safety check anymore. The database engine will do it for us, and it’s about time.
How the CYCLE Clause Actually Works
Think of the CYCLE clause as your query’s built-in immune system. It remembers every path it’s taken and slaps a big “DO NOT ENTER” sign on any path that would force it to re-tread old ground. Here’s the syntax. It looks a bit verbose at first, but you’ll learn to love its explicitness.
WITH RECURSIVE employee_paths AS (
-- Anchor member: Find the top-level boss
SELECT
id,
name,
manager_id,
ARRAY[id] AS path, -- We start the path with the initial ID
1 AS depth,
false AS is_cycle -- Initially, no cycle
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive member: Walk down the tree
SELECT
e.id,
e.name,
e.manager_id,
ep.path || e.id, -- Append the new employee's ID to the path
ep.depth + 1,
(e.id = ANY(ep.path)) AS is_cycle -- The manual, pre-PG14 way
FROM employees e
JOIN employee_paths ep ON e.manager_id = ep.id
WHERE NOT ep.is_cycle -- Filter out cycles manually
)
SELECT * FROM employee_paths;
The old way, shown above, was a pain. You had to build an array (path) to track your journey, manually check if the new ID was in that array (e.id = ANY(ep.path)), and then use that boolean to filter out subsequent rows. It works, but it’s clunky and it’s on you to get it right.
PostgreSQL 14 said, “Nah, we got this.” Behold:
WITH RECURSIVE employee_paths AS (
SELECT
id,
name,
manager_id,
1 AS depth
FROM employees
WHERE manager_id IS NULL
UNION ALL
SELECT
e.id,
e.name,
e.manager_id,
ep.depth + 1
FROM employees e
JOIN employee_paths ep ON e.manager_id = ep.id
)
CYCLE id SET is_cycle USING path
SELECT * FROM employee_paths;
The magic is in that CYCLE id SET is_cycle USING path line. Here’s the translation:
CYCLE id: “Hey PostgreSQL, use thisidcolumn to uniquely identify a row for the purposes of cycle detection.”SET is_cycle: “Create a new boolean output column calledis_cycle. It will betruefor any row that represents a cycle.”USING path: “To track the path you’re taking, create an implicit array column calledpath. You don’t have to manage it; I’ll handle it.”
The engine internally does what we used to do manually. It tracks the path of id values. When processing a new row in the recursive term, it checks if the new id is already in the accumulated path. If it is, it marks is_cycle as TRUE and, crucially, does not use that row as a basis for further recursion. It’s a dead end. This is the key—it doesn’t just identify cycles; it prevents infinite recursion by stopping at them.
Why Your Column Choice for CYCLE Matters
You might think, “I’ll just CYCLE id everywhere and call it a day.” Not so fast. The column you specify must be the one that uniquely defines a node in the context of the graph you’re traversing. The id column (a primary key) is usually a perfect choice.
But consider a more complex graph. What if you’re traversing a table of flight segments between airports?
WITH RECURSIVE flight_paths AS (
SELECT
departure_airport,
arrival_airport,
ARRAY[departure_airport] AS path
FROM flights
WHERE departure_airport = 'JFK'
UNION ALL
SELECT
f.departure_airport,
f.arrival_airport,
fp.path || f.arrival_airport
FROM flights f
JOIN flight_paths fp ON f.departure_airport = fp.arrival_airport
)
CYCLE arrival_airport SET is_cycle USING path
SELECT * FROM flight_paths;
Spot the problem? If you CYCLE arrival_airport, you’re saying a cycle occurs only if you land at the same airport twice. But what if you have a route JFK -> ATL -> LAX -> ATL? You’ve landed at ATL twice, which is a cycle. This works.
But what if your data uses airport codes that aren’t unique per airport? Or if your graph nodes are defined by a combination of columns? You must CYCLE on a set of columns that, together, form a unique identifier for a node in your specific graph. Sometimes you might even need to create a synthetic key for this purpose. The engine isn’t psychic; it can only detect cycles based on the exact values you tell it to care about.
The Invisible, Implicit PATH Column
The USING path part of the clause creates a hidden column. It’s not truly hidden—you can SELECT * and see it—but it’s implicit in the CTE’s definition. This is brilliantly convenient. You don’t have to declare it in the anchor member, you don’t have to manage the array appending logic in the recursive member. It’s all done for you.
This is a huge win for readability and maintenance. The query intent—the actual traversal logic—is no longer obfuscated by boilerplate cycle-detection code. The CYCLE clause declaratively states your intent, and the engine handles the imperative junk under the hood. This is how modern SQL should work.
Best Practices and the One Big “Gotcha”
First, the practice: Always use it. There is virtually no downside. The performance overhead is minimal compared to the cost of an infinite loop. It’s your cheapest insurance policy.
Now, the “gotcha”: The CYCLE clause only prevents infinite recursion within the CTE itself. It will not save you from a recursive CTE that is simply processing a very large, but acyclic, graph. If your company’s reporting hierarchy has 500,000 employees, your recursive CTE will still build a massive result set and potentially consume vast amounts of memory. Cycle detection stops loops; it doesn’t replace careful query design with LIMIT, WHERE clauses, or other filters to keep your result set sane. Don’t confuse cycle detection with a license to be reckless with your graph sizes.