20.6 BRIN: Block Range Indexes for Naturally Ordered Large Tables

Right, BRIN indexes. Let’s talk about the index you use when you’ve given up on being a hero. You’ve got a table so monstrously large that the mere thought of building a standard B-tree index on it makes your wallet physically cringe and your maintenance window manager burst into tears. This is where BRIN, or Block Range INdexes, saunters in. It’s the index of last resort, and when it works, it feels like absolute black magic.

The core idea is brilliantly lazy: instead of indexing every single row (like a B-tree does), a BRIN index summarizes ranges of table blocks. A table block is typically 8KB. So, the index might say, “For blocks 1 through 128, the created_at timestamp falls between 2023-01-01 and 2023-01-02.” If your query is looking for 2023-01-01 14:00:00, the BRIN index knows it might be in that range of blocks. It then goes and scans all 128 of those blocks to find the exact rows. This is its superpower and its fatal flaw, all in one.

Why This “Dumb” Idea is So Damn Smart

The trade-off is simple: massive size savings for more table scanning. A BRIN index is often orders of magnitude smaller than a B-tree on the same column. We’re talking kilobytes instead of gigabytes. This means it builds in seconds, not hours. The magic happens when your data is physically ordered on disk by the column you’re indexing. If your billions_of_events table is loaded in event_time order, then the “summary” for each range of blocks is incredibly precise. A query for a specific time range might only need to read a handful of block ranges, making it stunningly fast. If your data is randomly scattered, the summary for each range is useless—“the values in these blocks are between 1 and 9,999,999”—and the index will cause a full table scan, but slower because it bothered to consult the index first.

Building a BRIN Index: The Basics

Let’s say we have a table of sensor readings that’s appended to by time. It’s a textbook case.

-- Create our fact table. Note the lack of a primary key; we're at this scale now.
CREATE TABLE sensor_data (
    sensor_id integer,
    reading_time timestamptz NOT NULL,
    reading_value double precision
);

-- Load a few hundred million rows in, in chronological order.

-- Now, the magic. The 'pages_per_range' is the key knob to turn.
CREATE INDEX idx_sensor_data_time ON sensor_data USING BRIN (reading_time);

By default, pages_per_range is 128. This means it summarizes 128 blocks (128 * 8KB = 1MB of data). This is a good starting point. If your data is very tightly correlated to its on-disk order, you can make the ranges larger for an even smaller index. If the correlation isn’t perfect, you might make it smaller for more precision.

Tuning the pages_per_range

This is the most important tuning parameter. You’re balancing index size against the number of false positives (blocks it has to scan that don’t contain your data).

-- For super well-ordered data, maybe we can summarize 256 pages at a time?
CREATE INDEX idx_sensor_data_time_large_range ON sensor_data USING BRIN (reading_time) WITH (pages_per_range=256);

-- Or if the data is a bit messy, let's get more granular with 32 pages.
CREATE INDEX idx_sensor_data_time_small_range ON sensor_data USING BRIN (reading_time) WITH (pages_per_range=32);

How do you know what to pick? Query the pg_stats view to see the correlation of your column (a value of 1.0 is perfectly correlated to its physical order, -1.0 is perfectly inverse). Then, test! See which setting gives you the best query performance for a typical range query.

SELECT tablename, attname, correlation
FROM pg_stats
WHERE tablename = 'sensor_data';

The Maintenance Gotcha: Autosummarization

Here’s the part the manual often glosses over. When you insert new data, the BRIN index doesn’t instantly summarize it. A background process called autovacuum eventually comes along and does this in a process called “summarization.” This means your spanking new data is not in the index until that happens. Your queries for the latest data will miss it. This is, frankly, a bit absurd for a time-series append pattern.

You can manually nudge it, though it’s a bit of a hack. The brin_summarize_new_values function will summarize the last few ranges.

-- Summarize the latest unsummarized ranges for our index
SELECT brin_summarize_new_values('idx_sensor_data_time');

For a high-insert system, you might need to schedule this more aggressively than autovacuum does. It’s a trade-off: more frequent summarization means your index is always fresh, but it adds overhead.

When to Use BRIN (And When to Run Away)

Use BRIN when:

The table is large. Don’t bother for a million rows; a B-tree is fine.
The data has a strong natural correlation with its physical order (e.g., timestamp-based append-only logs, a serial primary key).
You’re doing range queries (WHERE date > X AND date < Y) that align with that natural order.

Avoid BRIN like the plague when:

Your data is randomly ordered on disk. You’ll just get a terrible, terrible plan.
You need point queries (WHERE id = 1234). A BRIN will still scan a huge swath of the table. Use a B-tree.
You need to return results in a sorted order. BRIN doesn’t help with ORDER BY.

BRIN isn’t a silver bullet. It’s a scalpel for a very specific, data-warehouse-shaped problem. But when your data is ordered right and your queries are right, it performs miracles with a fraction of the resources. It’s the index that acknowledges that sometimes, being a little bit lazy and approximate is the most brilliant strategy of all.