22.6 Planner Statistics: pg_statistic and ANALYZE

Right, let’s talk about the secret sauce that makes EXPLAIN anything more than a collection of hopeful guesses: the planner statistics. Without this, the query planner is just a person trying to navigate a pitch-black room by shouting “Marco!” and hoping the “Polo!” they get back is from the exit and not a brick wall. The pg_statistic system catalog and the ANALYZE command are the lights in that room.

When you run ANALYZE (or VACUUM ANALYZE), you’re not just “updating stats.” You’re sending a tiny, overworked statistician scurrying through your tables. This statistician doesn’t look at every single row—that would take forever on a large table. Instead, it takes a sample. It builds a profile of your data: what values are most common, how values are distributed, and how many distinct values there are. This profile is stored, in a highly condensed and somewhat esoteric form, in pg_statistic. You don’t usually query pg_statistic directly; its friendlier cousin, pg_stats, is a view that translates this arcane data into something vaguely human-readable.

Let’s see what our friendly neighborhood statistician has been up to.

-- Let's see what the planner knows about my 'users' table
SELECT attname, n_distinct, most_common_vals, most_common_freqs, histogram_bounds
FROM pg_stats
WHERE tablename = 'users';

-- attname   | n_distinct | most_common_vals | most_common_freqs | histogram_bounds
-------------+------------+------------------+-------------------+-----------------
-- country   |        12 | {US,GB,DE,FR}    | {0.4,0.2,0.1,0.1} | {AU,CA,CN,...,ZA}
-- is_active |         2 | {t}              | {0.95}            | NULL

See that? For the country column, the planner knows there are about 12 distinct values. It knows that 40% of rows are ‘US’, 20% are ‘GB’, and so on. The histogram_bounds show the ranges it’s created to estimate things like WHERE country > 'M'. For the boolean is_active column, it knows that true is overwhelmingly common (95% of rows). This is everything for the planner. If it thinks a condition will match 95% of a table, it will almost always choose a Sequential Scan. If it thinks it will match 0.1%, it will choose an Index Scan. Your job is to make sure it’s right.

Why Your Planner’s Guesses Are Sometimes Terrible

The sampling method is brilliant, but it’s not clairvoyant. It can be spectacularly wrong. The most common culprit? Data skew. If you have a status column where 99.9% of rows are 'processed' and a few are 'failed', the statistician might, by pure bad luck, sample a data page that’s all 'failed'. It will then think 'failed' is the common value and suggest a catastrophic Sequential Scan for a query fetching a couple of 'failed' records. The solution isn’t to scream at Postgres; it’s to run ANALYZE more often on volatile tables or increase the statistics target for that column.

-- Tell the statistician to take a bigger sample for this critical, skewed column
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
-- Then re-run ANALYZE
VACUUM ANALYZE orders;

The default default_statistics_target is 100. Crank it up to 1000 (the max) for problem columns. This makes ANALYZE sample more rows, building a more accurate—but still sampled—profile.

The Perils of Correlation and Physical Row Order

Here’s a subtle one that bites everyone eventually. The planner cares about the physical order of your data on disk. Why? For a WHERE id > 1000000 query on a sorted integer column, it can look at the histogram, see that 1000000 is in the upper quartile, and correctly guess it’s fetching roughly 25% of the table. But if you’re searching on a range for a column that is correlated with the physical row order (like a created_at timestamp on a table that’s mostly inserted-to in order), the planner can make even better guesses. The correlation statistic in pg_stats measures this (-1 to 1, 1 is perfectly correlated).

-- Check the correlation for a timestamp column
SELECT attname, correlation
FROM pg_stats
WHERE tablename = 'events' AND attname = 'created_at';

-- attname    | correlation
-------------+-------------
-- created_at |    0.998712

A value that high tells the planner, “This table is sorted almost perfectly by this column.” This makes index scans on range queries even more attractive. If you see a correlation of near 0, it means the values are randomly scattered across the table, and the planner knows a range query might have to jump all over the disk, making an index scan less appealing. This is why a CLUSTER operation on an index can sometimes speed things up; it physically reorders the table to match the index, driving the correlation for that column toward 1.

The Ghost of Data Past: Stale Statistics

This is the big one. The statistician only comes when you call it (ANALYZE). If you bulk-load a million new rows into your orders table, the planner’s knowledge is frozen in time. It’s still making plans based on the old, tiny table. It might look at a query for WHERE order_date > 'today' and think, “Hmm, that’s a very rare value, I’ll use the index,” when in reality, you just loaded a million orders from today and a full table scan would be the right choice. The autovacuum daemon usually takes care of this for you, but if you’ve just done a massive, non-rolling batch update, you need to manually run ANALYZE your_table to bring your planner out of the dark ages. Never trust an EXPLAIN on a table with stale statistics.