39.6 pg_repack: Online Bloat Removal Without Full Locks
Alright, let’s talk about the big gun in your anti-bloat arsenal: pg_repack. You’ve met VACUUM, you know its limitations. It cleans up dead rows, but it leaves that gaping, reusable free space behind, like a teenager’s bedroom that’s been “tidied” by just shoving everything under the bed. The table and its indexes are still physically large on disk, consuming precious I/O and cache memory. To truly reclaim that space and get the table physically compact again, you historically had to use CLUSTER or VACUUM FULL, both of which require a full table lock that brings your writes to a grinding, screaming halt. Not ideal for a 24/7 application.
Enter pg_repack. This brilliant piece of external software (you’ll need to install it) performs nothing short of black magic. It rebuilds a table or an index completely, compacting it down to its minimal possible size, while still allowing concurrent reads and writes. It’s the equivalent of rebuilding the engine on your car while you’re still driving down the highway. The designers, I’m convinced, are wizards.
Here’s the high-level trick: pg_repack creates a completely new, pristine copy of the table (or index) in the background, all while a special trigger installed on the original table faithfully copies over every single change that happens during the operation. Once the new table is built and perfectly synced, it swiftly swaps the old physical files for the new ones in a transaction lock, which is so brief you’ll barely notice it. It’s a thing of beauty.
Installation and Setup
First, you need to get this tool on your machine. On Ubuntu/Debian, it’s a simple apt-get install postgresql-XX-pg-repack (replace XX with your PostgreSQL major version). For others, check the official repo. The crucial step is adding it to a database. Connect to the database you want to repack and run:
CREATE EXTENSION pg_repack;
This installs the necessary functions and, most importantly, the repack trigger function that makes the whole concurrent magic possible. You don’t run the tool as a SQL function; you run it as a command-line utility.
Basic Table Repacking
The most common operation is repacking a specific table to remove bloat. The command is straightforward. Let’s say our users table has ballooned after a massive soft-deletion operation.
pg_repack -h my-db-host.example.com -d my_database -t users
This will connect to my_database on the given host (using your default or supplied credentials) and rebuild the users table. While it’s running, your application can continue to SELECT, INSERT, UPDATE, and DELETE from the users table. It’s not a free lunch—there is overhead from the trigger copying changes, so you might see a slight performance dip—but it’s a world away from a complete write lock.
Handling Indexes and Full Databases
pg_repack isn’t just for tables. You can repack specific indexes, which is fantastic for indexes that get heavily modified.
# Repack a single index
pg_repack -h my-host -d my_db --index users_pkey
# Repack EVERYTHING in a database (use with extreme caution)
pg_repack -h my-host -d my_db
The “repack everything” command is powerful but dangerous. It will hammer your system with the overhead of the triggers on every single table simultaneously. Only do this during a designated maintenance window or on a replica. The designers gave you a chainsaw; please don’t try to juggle with it.
The Gotchas and The Glory
No tool is perfect, and pg_repack has its rough edges. First, it requires enough free disk space to essentially make a full copy of the object it’s repacking. If you’re already at 99% disk usage, you’re out of luck. Second, it doesn’t work on tables that lack a primary key or a usable unique index. It needs something to reliably identify rows for the sync process. This is the one design choice that feels a bit arbitrary; surely a CTID-based approach could be a fallback? But no, it insists. Always check first:
SELECT relname FROM pg_class WHERE relkind = 'r' AND relnamespace::regnamespace::text = 'public'
AND relname NOT IN (
SELECT DISTINCT table_name FROM information_schema.table_constraints WHERE constraint_type = 'PRIMARY KEY'
);
Also, be aware of long-running transactions. If a transaction holds a snapshot open for longer than the pg_repack operation takes, the repack can’t clean up its temporary files until that old snapshot is released. This will cause the repack to appear to “hang” at the very end. It’s not hung; it’s just being a good citizen and waiting. Always monitor for old backends (SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';) before starting a major repack job.
The glory, however, far outweighs the gotchas. The feeling of watching a 300GB table get compacted down to 80GB without dropping a single connection is a form of database administrator enlightenment. Use it wisely, but use it you must.