22.4 Sequential Scan vs Index Scan vs Index-Only Scan
Alright, let’s get our hands dirty. You’ve fired off a query and it’s taking longer than a Monday morning. You know you need to look at the query plan, but staring at that tree of nested loops and scans can feel like reading tea leaves. I’m here to teach you how to read them. The three big players you’ll see are Sequential Scans, Index Scans, and Index-Only Scans. Knowing the difference is the difference between guessing and knowing.
At its heart, it’s a story about I/O. Your database is constantly making a trade-off: is it cheaper to read a little bit from a lot of places, or to just read a whole lot from one place? Let’s break it down.
The Humble (and Brutish) Sequential Scan
This one’s simple. A sequential scan is your database saying, “Screw it, we’re reading the entire table.” It starts at the first page (or block) of the table and reads every single row, checking to see if it matches your WHERE clause.
EXPLAIN SELECT * FROM users WHERE last_login < '2020-01-01';
QUERY PLAN
----------------------------------------------------------
Seq Scan on users (cost=0.00..18334.00 rows=500000 width=164)
Filter: (last_login < '2020-01-01'::date)
See that? It’s going to read all 18,334 pages (or whatever the cost units represent) to find your half-million old users. The Filter line means it’s reading every row and then applying the condition. This is the default, the fallback, the “well, we have no better option” move. It’s efficient in exactly one scenario: when you need a large majority of the rows in the table. If you’re reading more than roughly 5-10% of a table, the planner will often say, “Forget the index, it’s cheaper to just read the whole thing sequentially.” The cost of jumping around via an index (random I/O) becomes higher than just reading the data in order (sequential I/O).
The Index Scan: The Clever Shortcut
An index scan is your database using, you know, an index. It’s smarter. It first goes to the index structure (usually a B-tree) to quickly find the physical locations of the rows you care about. Then it goes to those specific locations in the table (“heap”) to fetch the actual row data.
Let’s add an index and see the magic.
CREATE INDEX ON users(last_login);
EXPLAIN SELECT * FROM users WHERE last_login < '2020-01-01';
QUERY PLAN
----------------------------------------------------------
Index Scan using users_last_login_idx on users (cost=0.42..16764.42 rows=500000 width=164)
Index Cond: (last_login < '2020-01-01'::date)
Notice the change from Filter to Index Cond. The condition is now being applied inside the index scan itself; it only bothers to go to the heap for rows that we know will pass the condition. This is fantastic for highly selective queries. But here’s the hidden cost: an index scan still has to do a “heap fetch” for every single row it finds in the index. This is a random I/O operation. If your query is selecting a huge number of rows, the planner might still choose a sequential scan to avoid thousands of random I/O operations. This is a common point of confusion—just because you have an index doesn’t mean the planner will use it. It’s not stupid; it’s doing math.
The Index-Only Scan: The Holy Grail
This is where we win. An index-only scan is the database showing off. It happens when everything your query needs is already in the index. It never has to go back to the main table (the heap). It just reads the index, which is often a much smaller file, and returns the data. This is stupidly fast because it’s all sequential I/O on a compact structure.
Let’s see it in action. Notice we’re only selecting last_login and the primary key id, which is automatically included in the index.
EXPLAIN SELECT last_login, id FROM users WHERE last_login < '2020-01-01';
QUERY PLAN
----------------------------------------------------------
Index Only Scan using users_last_login_idx on users (cost=0.42..15664.42 rows=500000 width=12)
Index Cond: (last_login < '2020-01-01'::date)
Boom. Index Only Scan. But wait, there’s a catch—a classic PostgreSQL “gotcha.” For this to work, the index must be “visible.” PostgreSQL uses Multi-Version Concurrency Control (MVCC), which means it needs to know if a row is visible to your current transaction. The visibility information is stored only in the heap. So how can an index-only scan work without touching the heap?
It cheats. It uses a thing called the Visibility Map. As tables are vacuumed, PostgreSQL tracks which data pages contain only rows that are visible to all transactions. If the index entry points to a page marked all-visible in the map, it can skip the heap fetch. If not, it has to go back to the heap to check visibility. This is why your VACUUM strategy is so critically important. A poorly vacuumed database will see far fewer index-only scans, as the planner can’t trust the visibility map. It’s a thing of beauty when it works, but you have to hold up your end of the bargain with maintenance.