21.3 Distribution Styles: EVEN, KEY, ALL
Alright, let’s talk about how Redshift physically arranges your data across its compute nodes. This isn’t some abstract concept; it’s the absolute bedrock of performance. Get this wrong, and you’ll be pouring money into a cluster that spends 90% of its time shuffling data around like a confused intern. We call this the distribution style.
Think of your Redshift cluster as a team of workers (the nodes). You have a massive table (a list of every sale your company has ever made) and you need to split it among them. How you do that—the distribution style—determines whether these workers can operate independently or if they’re constantly on the intercom asking each other for data. There are three ways to do this: EVEN, KEY, and ALL. Your job is to pick the right one.
The Default (And Often Worst) Choice: EVEN
Redshift will default to EVEN distribution if you don’t specify a style or key. It does exactly what it says: it spreads the table rows across all slices in the cluster in a round-robin fashion. It’s “fair” in a mindless, egalitarian way.
CREATE TABLE events_even (
event_id INTEGER,
user_id INTEGER,
event_date DATE,
payload VARCHAR(MAX)
) DISTSTYLE EVEN;
Why would you use this? Honestly, not often. It’s a decent last resort for standalone fact tables that don’t join to much else or for staging tables you’re just going to rewrite. The problem is joins. When you join this events_even table to, say, a users table distributed on user_id, Redshift has to physically relocate rows from both tables so that matching user_id values are on the same slice. This “redistribution” step is a massive performance killer. Avoid EVEN for large tables that are central to your join patterns.
The Workhorse: KEY Distribution
This is where the magic happens. You specify a column (or columns), and Redshift uses a hash algorithm on that column’s value to determine exactly which slice to store each row on. The genius part? Any other table distributed on the same key will have its matching rows on the same slice.
CREATE TABLE sales (
sale_id INTEGER,
customer_id INTEGER,
product_id INTEGER,
sale_date DATE,
amount DECIMAL(12,2)
) DISTSTYLE KEY DISTKEY (customer_id);
CREATE TABLE customers (
customer_id INTEGER DISTKEY, -- Also works as part of the column definition
name VARCHAR(255),
signup_date DATE
) DISTSTYLE KEY;
Now, when you run a join between sales and customers on customer_id, the executor node knows it can find all the data for any given customer locally on its slice. No inter-node chatter. No redistribution. It’s a thing of beauty. This is called collocation.
The Pitfall: The cardinality and skew of your distribution key matter. A lot. If you choose a DISTKEY like country_code where 90% of your rows have the value ‘US’, you’ve just created a “hot slice.” One node will do 90% of the work while the others twiddle their thumbs. You want a column with high cardinality (many unique values) and relatively uniform distribution. Primary keys like user_id or order_id are usually perfect candidates.
The Heavy Artillery: ALL Distribution
ALL distribution does something gloriously brute-force: it copies the entire table to every single compute node in the cluster.
CREATE TABLE dim_date (
date_id INTEGER,
full_date DATE,
day_of_week INTEGER,
is_holiday BOOLEAN
) DISTSTYLE ALL;
Why on earth would you do this? For small, critical dimension tables. Think dates, countries, product categories—tables that are a few megabytes at most but are joined to every massive fact table you have. By distributing ALL, you ensure that every node has a local copy of this dimension. When a join happens, there’s zero redistribution needed for this table; each node can just join its local fact rows to its local copy of the dimension. It’s a huge win.
The Massive Warning: Do NOT do this to a large table. I once saw a client DISTSTYLE ALL on a 500GB table. They effectively turned their 500GB cluster into a (nodes * 500GB) cluster, instantly consuming all storage and crashing the whole thing. It was a spectacular, expensive lesson. Use ALL only for genuine small dimensions. If you’re unsure, check the size with SELECT pg_table_size('<schema>.<table_name>') / (1024*1024) as size_mb; first.
How to Decide and What to Watch For
Your goal is to minimize data movement. The query planner in Redshift’s console is your best friend here. Run an EXPLAIN plan on your big, ugly queries and look for the dreaded “DS_DIST_ALL_INNER” or “DS_BCAST_INNER” labels. This is Redshift politely screaming that it’s redistributing or broadcasting entire tables to complete your join.
The golden rule? Model your largest fact tables with a DISTKEY on the column most commonly used to join them to other large tables. Make your small dimensions ALL. And for the love of all that is holy, never let Redshift silently default your billion-row table to EVEN. Always, always declare your distribution strategy explicitly. It’s the single most impactful design choice you’ll make.