17.4 IN vs EXISTS: Semantics, NULL Handling, and Performance
Right, let’s settle this. The IN vs. EXISTS debate is one of those classic database arguments that can turn a friendly lunch into a heated theological debate. It’s not about which one is universally “better”—that’s a rookie take. It’s about understanding their different DNA: how they think, how they handle the weird stuff (looking at you, NULL), and when one will run circles around the other.
The core of the issue is a difference in intent. IN is fundamentally about checking a value against a list. EXISTS is about checking for the existence of any row that meets a condition. This semantic difference dictates everything that follows.
The Logical Difference: List Membership vs. Existence
When you use IN, you’re effectively creating a list (from your subquery) and asking, “Is my value in this list?” The database often materializes this list (in memory, as a temp table, etc.) to perform the check.
-- IN: "Find me employees whose department_id is in the list of
-- all department_ids in the 'East Coast' region."
SELECT *
FROM employees e
WHERE e.department_id IN (
SELECT d.id
FROM departments d
WHERE d.region = 'East Coast'
);
EXISTS, on the other hand, doesn’t care about a list. It creates a correlated subquery and asks, “For this specific employee, does at least one row exist in the departments table that matches my condition?” It returns TRUE the microsecond it finds that first matching row. It’s a signal, not a scan.
-- EXISTS: "For each employee, check if there exists at least one
-- department row with a matching id *and* the 'East Coast' region."
SELECT *
FROM employees e
WHERE EXISTS (
SELECT 1 -- It's a signal. SELECT 1, SELECT id, SELECT 'blah'... it doesn't matter.
FROM departments d
WHERE d.id = e.department_id -- This is the correlation!
AND d.region = 'East Coast'
);
See the correlation? The EXISTS subquery is useless without that link to the outer table (d.id = e.department_id). This isn’t a standalone list; it’s a per-row question.
The NULL Gotcha: Where IN Implodes
This is the part where IN reveals its fatal flaw. NULL means “unknown value.” So what happens when your beautiful list of values contains a NULL? Logic breaks down.
The truth table for IN is brutal: value IN (a, b, NULL) is NULL (unknown) if the value isn’t found in the non-NULL part of the list. And since WHERE clauses only care about TRUE, your row gets filtered out. It vanishes. Poof.
-- Let's say our subquery returns (10, 22, NULL)
SELECT 'Found' WHERE 5 IN (10, 22, NULL); -- Returns nothing (NULL is not TRUE)
SELECT 'Found' WHERE 22 IN (10, 22, NULL); -- Returns 'Found'
EXISTS sidesteps this entire circus. It doesn’t compare values directly; it just checks for a row. If a row exists, it’s TRUE. If not, it’s FALSE. NULL values within the rows it’s checking don’t automatically sabotage the entire operation like they do with IN. The existence check is far more robust.
Performance: It’s About the Joins (and the Optimizer)
Here’s the dogma you’ve probably heard: “Always use EXISTS; it’s faster.” It’s often true, but it’s not a law of physics. The modern query optimizer is a terrifyingly smart beast. For simple, equivalent queries, it will frequently produce the exact same execution plan for IN and EXISTS. Don’t believe me? Run an EXPLAIN and see for yourself.
The performance difference becomes crucial in two main scenarios:
When the subquery result is huge:
INmight try to build that massive list in memory and then probe it.EXISTS, being a correlated nested loop, can often be more efficient because it doesn’t materialize the entire set upfront. It asks its simple yes/no question for each outer row and moves on.When the outer table is huge and the inner table is small: This is where a correlated
EXISTScan actually be a liability. For every single row in your massive outer table, it’s firing off a query against the inner table. If the inner table has great indexes to support the correlation (e.g., an index ondepartments(id, region)for our example), it’s lightning fast. If it doesn’t, it’s a disaster. AnINwith a small, pre-computed list might win here.
The best practice isn’t to guess. It’s to write the query that most clearly expresses your intent (90% of the time, this is EXISTS for existence checks) and then check the execution plan if performance is critical. Let the database tell you what it wants to do. Your job is to give it the best question to answer.