Right, let’s talk sort keys. This isn’t some academic exercise; this is where your multi-million-row table goes from “agonizingly slow” to “blisteringly fast” or, if you get it wrong, “somehow even slower than before.” A sort key is how Redshift physically organizes your data on disk, and getting it right is the single biggest lever you can pull for performance. Think of it like the index in a massive reference book. If it’s sorted by topic, finding “quantum entanglement” is trivial. If it’s sorted by the number of times the letter ‘z’ appears on the page, you’re in for a long night.

The basic idea is simple: rows with similar sort key values are stored together in the same data blocks. When a query has a WHERE clause filter on the sort key, Redshift can use a range scan to skip entire chunks of data it knows it doesn’t need to read. It’s a superpower. But you’ve got two main ways to build this index: the reliable workhorse (Compound) and the high-maintenance diva (Interleaved). Choose wisely.

The Workhorse: Compound Sort Keys

A compound key is what it sounds like: you define an ordered list of columns. Redshift sorts the data by the first column, then the second within the first, then the third within the second, and so on. It’s a hierarchical sort, just like you’d do in an ORDER BY clause.

-- A classic example for an orders table
CREATE TABLE fact_orders (
    order_id BIGINT,
    order_date DATE,
    customer_id INTEGER,
    region VARCHAR(50),
    amount DECIMAL(18,2)
)
COMPOUND SORTKEY (order_date, region);

Why it works: This structure is brutally efficient for range-restricted queries on the prefix of the sort key. A query filtering on order_date is a home run. A query filtering on order_date and region is also a home run. A query filtering only on region? That’s a strikeout. The sort order is (date, region), so the region values are scattered across all the different dates. The query can’t skip any blocks and will result in a full table scan.

Best Practices & Pitfalls:

  1. Order Matters: Your most common, most selective filter should be first. SORTKEY (high_cardinality_column, low_cardinality_column) is usually better than the reverse.
  2. Don’t Get Greedy: Adding more than 3-4 columns to a compound sortkey often has diminishing returns. You’re unlikely to filter on the 5th column in the key often enough to justify the sorting overhead during COPY or INSERT.
  3. The VACUUM Tax: When you DELETE or UPDATE rows, Redshift marks them for deletion. To reclaim space and re-establish the sort order, you must run VACUUM. It’s non-optional for maintaining compound sort performance, and on large tables, it’s a resource-intensive operation. Schedule it during off-hours.

The Diva: Interleaved Sort Keys

An interleaved sort key is the “have your cake and eat it too” option, but the cake is made of dynamite and might blow up in your face. You still define multiple columns, but Redshift gives each column in the key equal weight. It uses a complex mathematical distribution to interleave the values, so a query filtering on any one of the key columns can, in theory, effectively skip blocks.

-- The same table, but now we want to be able to filter efficiently on EITHER date or region
CREATE TABLE fact_orders (
    order_id BIGINT,
    order_date DATE,
    customer_id INTEGER,
    region VARCHAR(50),
    amount DECIMAL(18,2)
)
INTERLEAVED SORTKEY (order_date, region);

Why it works (and why it often doesn’t): The magic is in the interleaving. The data isn’t sorted A->B->C; it’s sorted in a way that values for A, B, and C are all highly localized. This is fantastic for ad-hoc analytics where you have no idea which column the next query will filter on.

Best Practices & Pitfalls (a.k.a. The Warning Label):

  1. The VACUUM Problem is a Nightmare: Interleaved sorts are far more sensitive to data changes. The VACUUM operation required to maintain them is exponentially slower and more expensive than for a compound key. I’m not exaggerating. It can take days on a large table. If your data is updated frequently, an interleaved key will degrade so fast it’ll make your head spin.
  2. The Law of Diminishing Returns: The query performance benefit for any single column in an interleaved key is usually not as good as the performance of that same column as the first column in a compound key. You trade peak performance for flexibility.
  3. Cardinality is King: Interleaved sorting works best on columns with high cardinality (many distinct values). Putting a low-cardinality column (like a status flag with 3 values) in an interleaved key is a waste of a slot and will hurt overall performance.

So, Which One Do I Pick?

Here’s the direct, from-the-trenches advice:

  • Default to Compound. Always. It’s predictable, robust, and easier to maintain. If 80% of your queries filter on a date or timestamp column, that’s your first sort key. Don’t overthink it.
  • Use Interleaved extremely sparingly. Only consider it for large, mostly static tables that serve truly unpredictable query patterns, like a table powering a user-facing dashboard with dozens of filter options. And even then, monitor VACUUM times like a hawk.
  • You can analyze your choice. After loading data, use the SVV_INTERLEAVED_COLUMNS system view to check the “interleaved skew” for your key. A high value means the distribution is ineffective, and you should reconsider your column choices.
-- See how effective your interleaved sort really is
SELECT tbl_id, table_name, column_name, interleaved_skew
FROM svv_interleaved_columns
WHERE table_name = 'fact_orders'
ORDER BY interleaved_skew DESC;

The designers gave us two tools. One is a trusty hammer. The other is a laser-guided plasma cutter that requires a PhD to operate and might set your workshop on fire. Most of the time, you just need to hit the nail. Use the hammer.