Data-Warehouse | mikePietsch.com

21.8 Redshift Data Sharing: Cross-Cluster and Cross-Account Queries

Right, so you’ve got your data loaded, your queries are humming along, and you’re feeling pretty good about your Redshift cluster. Then someone from the marketing team (bless their hearts) asks for direct, live access to your sales data. Your first instinct is to scream. Your second is to build a fragile pipeline of nightly extracts, which is just a different kind of scream. Enter Redshift data sharing, which is basically the database equivalent of saying, “Fine, here’s a live read-only feed, but you break it, you bought it.”

21.7 Loading Data: COPY Command from S3, Kinesis, and DMS

Right, let’s talk about getting your data into Redshift. This is where the rubber meets the road, and where many a well-intentioned data warehouse project goes to die a slow, painful death of timeouts and malformed data. I’m here to make sure that doesn’t happen to you. The COPY command is Redshift’s workhorse for bulk data ingestion. Forget INSERT for large datasets; that’s for chumps and small dimension tables. COPY is a massively parallel operation, pushing data directly to the compute nodes. It’s the difference between carrying a sofa up a flight of stairs by yourself versus having a team of movers with a pulley system. You want the team.

21.6 Redshift Serverless: Pay-Per-Query Without Cluster Management

Right, so you’re tired of babysitting a Redshift cluster. You’ve spent nights wondering if you over-provisioned for the quarterly report and under-provisioned for Black Friday, all while paying for the privilege of that anxiety. I get it. Enter Redshift Serverless: the “just leave me alone and let me run my queries” option. The promise is simple: you point your data at it, you query that data, and AWS charges you based on the amount of data scanned. No more choosing node types, no more counting cores, no more frantic scaling operations. It’s a consumption model, like your electricity bill. You don’t buy a power plant for your house; you just pay for the kilowatts you use. Redshift Serverless applies that same logic to petabyte-scale data warehousing, which is both brilliant and slightly terrifying when you think about your CFO seeing the bill after a data scientist accidentally joins a fact table to itself.

21.5 Redshift Spectrum: Querying S3 Data from Redshift

Alright, let’s talk about Redshift Spectrum. You’ve got your nice, shiny Redshift cluster humming along, full of your most precious, frequently-queryed data. But then you remember: you’ve got petabytes of ancient log files, a zillion CSV dumps from third parties, and a whole data lake sitting in S3. The thought of ETL-ing all that junk into Redshift proper makes your wallet physically ache. Enter Spectrum. This is the feature that lets your Redshift cluster, the prissy aristocrat, send its servants out to the messy, wild data lake (S3) to fetch data for it, so it doesn’t get its hands dirty. You don’t load the S3 data into Redshift; you query it directly from S3. The key thing to understand is the division of labor: your Redshift cluster is the brain that plans the query and aggregates the final results, but the grunt work of actually reading the raw data from S3 is done by a vast, invisible fleet of Amazon’s compute resources outside of your cluster. Your cluster’s size determines the brainpower for the final join and sort, not the raw S3 scanning power. This is why it can feel like magic.

21.4 Sort Keys: Compound vs Interleaved

Right, let’s talk sort keys. This isn’t some academic exercise; this is where your multi-million-row table goes from “agonizingly slow” to “blisteringly fast” or, if you get it wrong, “somehow even slower than before.” A sort key is how Redshift physically organizes your data on disk, and getting it right is the single biggest lever you can pull for performance. Think of it like the index in a massive reference book. If it’s sorted by topic, finding “quantum entanglement” is trivial. If it’s sorted by the number of times the letter ‘z’ appears on the page, you’re in for a long night.

21.3 Distribution Styles: EVEN, KEY, ALL

Alright, let’s talk about how Redshift physically arranges your data across its compute nodes. This isn’t some abstract concept; it’s the absolute bedrock of performance. Get this wrong, and you’ll be pouring money into a cluster that spends 90% of its time shuffling data around like a confused intern. We call this the distribution style. Think of your Redshift cluster as a team of workers (the nodes). You have a massive table (a list of every sale your company has ever made) and you need to split it among them. How you do that—the distribution style—determines whether these workers can operate independently or if they’re constantly on the intercom asking each other for data. There are three ways to do this: EVEN, KEY, and ALL. Your job is to pick the right one.

21.2 Node Types: RA3 with Managed Storage vs DC2

Right, let’s settle the great Redshift node debate: RA3 versus DC2. This isn’t just a choice of hardware; it’s a fundamental decision about how you want to pay for and manage your data’s most expensive real estate: its storage. Get this wrong, and you’ll be writing a very large check to AWS for a service you’re not using efficiently. Get it right, and you look like a wizard. The core distinction is beautifully simple: with DC2 nodes, you’re paying for both compute and the attached storage. It’s the old-school way. You buy the whole pizza. With RA3 nodes, you pay for the compute and then separately for the managed storage you actually use. You buy slices. This isn’t just a billing nicety; it’s an architectural revolution that dictates how you’ll scale.

21.1 Redshift Architecture: Leader Node, Compute Nodes, and Slices

Right, let’s get under the hood. You can’t effectively use Redshift—or troubleshoot its special brand of weirdness—without understanding its architecture. It’s not some magical black box; it’s a collection of machines with specific jobs, and when you know who does what, the whole system makes a lot more sense. Forget the marketing fluff; we’re here to talk about the actual metal and software. At its core, a Redshift cluster is a shared-nothing MPP (Massively Parallel Processing) database. This is a fancy way of saying it’s a team of computers working together on one problem, and no single computer shares its memory or disk with the others. They have to talk over the network. Your cluster has two types of players: the Leader Node and the Compute Nodes.