17.2 Correlated Subqueries: Referencing the Outer Query
Right, let’s talk about correlated subqueries. This is where the magic happens, and also where you can accidentally summon a performance demon that will devour your server’s soul. The core idea is simple, but the implications are huge.
Unlike its simpler cousin, the scalar subquery, which executes once and returns a single value, a correlated subquery is a little gossip. It can’t do its job without information from the outside world—specifically, the current row being evaluated by the outer (or “main”) query. It’s executed not once, but once for every single row the outer query considers. Let that sink in. If your outer query returns 10,000 rows, your subquery runs 10,000 times. This is why we get the big bucks.
The mechanism for this gossipy exchange is the WHERE clause inside the subquery that references a column from the outer query. It’s what creates the correlation.
How the Correlation Actually Works
Think of it like a nested loop in code. For each row the outer query pulls from its table, the DB engine pauses, takes a value from that row (like employee_id), and shouts down into the subquery: “Hey! Here’s an employee_id! Now go find me the average salary for this guy’s department and tell me if his salary is above it!”
Here’s a classic example. Let’s find all employees who earn more than the average salary in their own department. A non-correlated subquery can’t do this because the average is different for each department.
SELECT employee_id,
first_name,
last_name,
department_id,
salary
FROM employees e_outer
WHERE salary > (
SELECT AVG(salary)
FROM employees e_inner
WHERE e_inner.department_id = e_outer.department_id -- This is the correlation!
);
See that e_outer.department_id reference inside the subquery? That’s the linchpin. The database can’t resolve that subquery until it has a specific row from e_outer to pull the department_id from. For a row from the Engineering department, it calculates the average salary for Engineering. For a row from Marketing, it calculates the average for Marketing. It’s a brutal, row-by-row grind.
The Performance Elephant in the Room
I hope you’re mentally screaming “N+1 PROBLEM!” because you should be. This is the SQL embodiment of it. The performance is often atrocious for large datasets. The optimizer can sometimes work miracles, but you should never bet on it. If you find yourself writing a correlated subquery on a table with millions of rows, stop. Just stop. Go get a coffee and reconsider your life choices. There’s almost always a better way, usually involving a JOIN to a derived table or a window function.
-- This is USUALLY a better approach than the correlated subquery above.
WITH dept_avgs AS (
SELECT department_id,
AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id
)
SELECT e.employee_id,
e.first_name,
e.last_name,
e.department_id,
e.salary
FROM employees e
JOIN dept_avgs d ON e.department_id = d.department_id
WHERE e.salary > d.avg_salary;
This runs the aggregation (AVG and GROUP BY) exactly once, which is infinitely more efficient than once-per-row.
But Sometimes, It’s the Right Tool
Despite the dire warnings, correlated subqueries aren’t evil. They’re just sharp. There are moments where their expressiveness is exactly what you need, especially when the logic is inherently row-by-row.
A perfect use case is with the EXISTS operator. Checking for existence is often very efficient because the database can stop the subquery the moment it finds a single matching row (it’s a “semi-join”). It doesn’t have to process the entire related table each time.
Let’s find all departments that actually have employees in them.
SELECT department_id,
department_name
FROM departments d
WHERE EXISTS (
SELECT 1
FROM employees e
WHERE e.department_id = d.department_id -- Correlation!
);
The SELECT 1 is a convention. It doesn’t matter what you select because EXISTS only cares about whether any rows are returned. This is efficient; for each department, it checks for at least one employee and bails immediately upon finding one.
The NULL Pitfall
Remember, if the value you’re correlating on is NULL, the entire subquery might return an empty set. For example, if you have an employee with a NULL department_id, the subquery WHERE e_inner.department_id = e_outer.department_id will effectively be WHERE e_inner.department_id = NULL. Since nothing equals NULL (not even another NULL), that subquery will return no rows. This can lead to unexpected results, as if that row never existed for the comparison. Always know your data and how NULL values will be handled in your logic.
The takeaway? Correlated subqueries are a powerful feature for expressing certain problems with beautiful, intuitive clarity. But you must wield them with respect for the computational cost they incur. Understand the nested loop happening under the hood. Test with realistic data volumes. And for the love of all that is holy, have an index on the column you’re correlating on inside the subquery—it can turn a full table scan per row into a quick index seek per row, which is the difference between a query that runs in seconds and one that runs until your cloud bill bankrupts the company.