21.2 Node Types: RA3 with Managed Storage vs DC2
Right, let’s settle the great Redshift node debate: RA3 versus DC2. This isn’t just a choice of hardware; it’s a fundamental decision about how you want to pay for and manage your data’s most expensive real estate: its storage. Get this wrong, and you’ll be writing a very large check to AWS for a service you’re not using efficiently. Get it right, and you look like a wizard.
The core distinction is beautifully simple: with DC2 nodes, you’re paying for both compute and the attached storage. It’s the old-school way. You buy the whole pizza. With RA3 nodes, you pay for the compute and then separately for the managed storage you actually use. You buy slices. This isn’t just a billing nicety; it’s an architectural revolution that dictates how you’ll scale.
The Legacy Workhorse: DC2 (Dense Compute)
Think of DC2 nodes as the dedicated, high-performance sports car of Redshift. When you provision a dc2.large or dc2.8xlarge node, you get a fixed amount of insanely fast, locally attached SSD storage. The bigger the node type, the more vCPUs, memory, and storage you get. They’re a packaged deal.
-- This creates a 4-node dc2.large cluster.
-- You get: 4 nodes * 0.16 TB SSD = 0.64 TB of total storage.
-- You pay for all of it, 24/7, whether you use it or not.
CREATE CLUSTER my_dc2_cluster
WITH
NODE TYPE 'dc2.large'
NUMERIC NUM_NODES 4;
The upside? Raw performance. The data is right there on the local SSDs of the compute nodes. There’s no network hop to get it. For workloads that are intensely compute-bound and need to scan data at ludicrous speed, DC2 can still scream. The downside is brutal inflexibility. If you need more storage, you have to add more compute nodes. You can’t just have a 2 TB table on a 4-node dc2.large cluster; it physically won’t fit. You’re forced to scale compute to get storage, which is like having to buy a new, more powerful engine just to get a bigger gas tank.
The Modern Standard: RA3 (Managed Storage)
RA3 nodes are where Redshift gets clever. Here, you pay for the compute (vCPUs, memory) separately from the storage. The local SSD on an RA3 node is just a massive, smart cache. Your durable data lives in a separate, managed S3 layer. Redshift uses a delightfully cunning trick called Automatic Table Optimization to decide what data to keep on the fast local SSDs (the “hot” data) and what to leave in S3 (the “cold” data). It’s constantly analyzing your queries and caching the right columnar blocks to make subsequent queries fly.
-- This creates a 2-node ra3.xlplus cluster.
-- You get: the compute power of 2 ra3.xlplus nodes.
-- You pay for that compute, PLUS a separate charge per GB
-- for the managed storage in S3 that you actually use.
CREATE CLUSTER my_ra3_cluster
WITH
NODE TYPE 'ra3.xlplus'
NUMERIC NUM_NODES 2;
The magic is in the scaling. Need more storage? Just… use it. You’re no longer capped by your local hardware. Your cluster can hold petabytes of data without adding a single node. Need more compute power to run complex queries faster? Add RA3 nodes. Your storage cost remains unchanged. This separation is a game-changer for 99% of use cases. You’re no longer held hostage by your storage needs.
So, Which One Do You Pick?
The answer is almost always RA3. Seriously. Unless you have a very specific, known workload that is a perfect fit for DC2, choosing RA3 is the default sane choice. It’s more cost-effective for most scenarios and provides operational flexibility that DC2 can only dream of.
Choose DC2 only if:
- Your entire dataset is small (under ~1TB) and fits comfortably on a few DC2 nodes with room to grow.
- Your workload is 100% about the absolute lowest-latency, sequential scan performance and you’ve proven that the network hop in RA3’s caching mechanism is a bottleneck. (Spoiler: for most people, it isn’t).
Choose RA3 if:
- Your data is growing unpredictably (so, always).
- You have a mix of hot and cold data (so, always).
- You want to scale compute and storage independently (so, always).
- You enjoy not setting money on fire.
The one “gotcha” with RA3 is that you need to trust its caching intelligence. If you have a truly random access pattern where you almost never query the same data twice, the cache hit rate will be poor and performance will suffer as it fetches everything from S3. But let’s be honest, that’s a wildly unusual workload. For everyone else, RA3’s caching is brilliantly effective. It’s the present and the future of Redshift, and your default button.