39.3 VACUUM ANALYZE: Updating Table Statistics
Right, let’s talk about VACUUM ANALYZE. You’ve probably heard of VACUUM for cleaning up dead rows, but ANALYZE is its brainy, data-obsessed sibling. While VACUUM is the janitor, ANALYZE is the statistician who tells the janitor where to find the biggest messes and, more importantly, tells the query planner how to navigate the entire building.
Here’s the deal: when you run ANALYZE (or VACUUM ANALYZE), you’re not cleaning up data. You’re commandeering a small sample of your table’s rows, running them through a statistical model, and updating the pg_statistic system catalog. This isn’t about exact counts; it’s about building a profile of your data’s distribution. The query planner eats this profile for breakfast. Without fresh stats, the planner is basically guessing, and it’s terrible at guessing. It might think your 100-million-row table has only 100 rows, leading it to choose a tragically naive nested loop join that will still be running when your coffee gets cold.
Why You Can’t Just Rely on Autovacuum
PostgreSQL’s autovacuum daemon is a fantastic piece of engineering, but it’s designed for the average database, not your database. It wakes up periodically based on a formula involving the number of inserted, updated, or deleted tuples (n_dead_tuples). For ANALYZE, the trigger is based on the number of rows inserted or updated.
The problem? This might not be frequent enough for a highly volatile table. If you have a large table where the data distribution changes dramatically with a new batch load every hour—say, a fact table in a data warehouse—waiting for autovacuum to get around to it could mean hours of painfully slow queries. You need to take matters into your own hands.
-- The classic one-two punch. This does the full cleanup AND updates stats.
VACUUM ANALYZE my_giant_table;
-- Sometimes you just need the stats. This is non-blocking for reads and much faster.
-- Use this after a major data load that doesn't create many dead tuples.
ANALYZE my_giant_table;
The Art of the Manual ANALYZE
Running a simple ANALYZE is fine, but you can be a connoisseur about it. The command accepts parameters to fine-tune its behavior.
-- Analyze just a few columns if you know only they've changed.
-- Saves time on wide tables with many inert columns.
ANALYZE my_table (column_a, column_b);
-- Drastically increase the sample size for a more accurate stat profile.
-- Useful for enormous tables where the standard sample might miss important skew.
ANALYZE VERBOSE my_table;
The VERBOSE option is your friend here. It’ll tell you exactly what it’s doing, which is great for peace of mind when you kick off a long-running ANALYZE on a multi-terabyte table.
When Statistics Go Bad: The Perils of Skewed Data
This is where the real fun begins. ANALYZE takes a random sample of your table. For most tables, this is statistically sound. But if your data is extremely skewed, the random sample can miss important outliers.
Imagine a users table where 99.9% of users are from ‘US’, but a few thousand are from ‘TL’ (Timor-Leste). If the random sample doesn’t grab any ‘TL’ rows, the planner will think there are zero users from Timor-Leste. A query filtering on country_code = 'TL' will be given a catastrophically low cost estimate. The planner might decide it’s fine to do a sequential scan for just a few rows right in the middle of your complex 8-table join, bringing the whole operation to its knees.
There’s no magic bullet here. If you have known, important outliers, you might need to increase the statistics target for that specific column. This tells ANALYZE to collect a more detailed histogram for it.
-- First, check the current target (default is 100)
SELECT attname, attstattarget FROM pg_attribute
WHERE attrelid = 'my_table'::regclass AND attname = 'problem_column';
-- Crank it up to 500 for more detailed stats on this specific column.
ALTER TABLE my_table ALTER COLUMN problem_column SET STATISTICS 500;
-- You MUST re-analyze for the new target to take effect.
ANALYZE my_table;
This isn’t a decision to make lightly—it makes ANALYZE work harder on that column—but for critical, highly skewed columns, it’s an essential tool.
Best Practices: Don’t Be a Hero
Your instinct might be to ANALYZE everything, all the time. Resist it. On a large, busy OLTP system, a full ANALYZE can contend with other queries for resources. The best practice is a pragmatic one: let autovacuum handle the steady-state workload for most tables. Intervene manually for your known problem children—the huge, fast-changing tables—immediately after major data loads or batch updates. Schedule these ANALYZE operations during a maintenance window if you can. And for the love of all that is holy, monitor the last_analyze and last_autoanalyze columns in pg_stat_all_tables to make sure your most important tables aren’t running on stale intelligence. Your query planner will thank you with faster, smarter execution plans.