21.7 Loading Data: COPY Command from S3, Kinesis, and DMS

Right, let’s talk about getting your data into Redshift. This is where the rubber meets the road, and where many a well-intentioned data warehouse project goes to die a slow, painful death of timeouts and malformed data. I’m here to make sure that doesn’t happen to you.

The COPY command is Redshift’s workhorse for bulk data ingestion. Forget INSERT for large datasets; that’s for chumps and small dimension tables. COPY is a massively parallel operation, pushing data directly to the compute nodes. It’s the difference between carrying a sofa up a flight of stairs by yourself versus having a team of movers with a pulley system. You want the team.

The Basic S3 COPY (Your New Best Friend)

Ninety percent of your data loading will probably come from S3. It’s the classic. The syntax is straightforward, but the devil is in the details. Here’s the simplest, most vanilla way to do it:

COPY sales
FROM 's3://my-awesome-bucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftLoadRole'
DELIMITER '|'
IGNOREHEADER 1;

Let’s break this down because it’s important:

FROM: The path to your files. A best practice is to point to a prefix (folder) and let Redshift pull all files within it. This allows for parallelization.
IAM_ROLE: This is how Redshift gets permission to read your S3 bucket. You must create an IAM role that trusts Redshift and has an S3 read policy attached. Don’t even think about using access keys and secrets; that’s legacy nonsense.
DELIMITER: Specifies the character that separates fields in your file. CSV uses a comma, but I’m a big fan of the pipe |—it’s less likely to appear in your actual data than a comma, saving you countless hours of escaping-hell.
IGNOREHEADER: Tells Redshift to skip the first n lines. Essential for CSVs with column headers.

Now, here’s where the designers made a choice I find…questionable. Redshift, by default, will happily COPY your data and then, if it encounters errors, just log them and keep going, leaving you with a partially loaded table. This is a terrible default. Always, and I mean always, use the MAXERROR option to avoid this silent failure.

COPY sales
FROM 's3://my-awesome-bucket/sales_data/'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftLoadRole'
DELIMITER '|'
IGNOREHEADER 1
MAXERROR 10; -- Fails the entire load if > 10 errors are found

The Kinesis Data Firehose Pipe

Sometimes you need near-real-time ingestion. For this, you use Kinesis Data Firehose. But here’s the kicker: you don’t use a COPY command. Firehose is the one pushing data to Redshift. You set up a Firehose delivery stream that buffers your streaming data and then periodically executes a pre-configured COPY command on your behalf.

Why is this brilliant? Because it turns a continuous stream of tiny inserts (a performance nightmare) into periodic bulk COPY operations (a performance dream). It’s batching under the hood, which is exactly what a columnar data warehouse wants.

The pitfall? Latency. Your data isn’t available the millisecond it hits Kinesis. It’s available once the buffer fills up or the buffer time elapses (whichever comes first). You’re trading a little latency for a massive win in throughput.

The DMS (Data Migration Service) Illusion

AWS Database Migration Service (DMS) is marketed as the easy button for replicating data from, say, an OLTP database like PostgreSQL or MySQL into Redshift. And it can be. You set up a replication instance, define source and target endpoints, and create a task. DMS then uses COPY internally to load data.

But I need to be direct with you: DMS for ongoing replication (CDC) to Redshift can be a fiddly beast. It works by writing changes to files in S3 and then triggering a COPY. The latency is higher than Firehose, and the setup is more complex.

The real value of DMS is in one-time full-load migrations. For ongoing changes, unless you have a very high tolerance for latency and complexity, I often recommend building your own pipeline using a change-data-capture tool (like Debezium) writing to S3 and then using a scheduled COPY. You get more control, and control is what you want when things inevitably go sideways at 2 AM.

Best Practices from the Trenches

Manifest Files are Your Safety Net: If you point COPY at a folder, it loads all files it finds. What if you have a bad file? What if your process accidentally drops a half-written file in that folder? Use a manifest file. It’s a JSON file that explicitly lists every single file to be loaded. It prevents those “oops, wrong file” moments.
```
{
  "entries": [
    {"url":"s3://my-bucket/alpha.txt", "mandatory":true},
    {"url":"s3://my-bucket/beta.txt", "mandatory":true},
    {"url":"s3://my-bucket/gamma.txt", "mandatory":false}
  ]
}
```
```
COPY sales FROM 's3://my-bucket/manifest.manifest' IAM_ROLE 'arn:aws:iam...' MANIFEST;
```
VACUUM and ANALYZE are Not Optional: A COPY appends new rows, which creates new blocks on disk. This leads to unsorted, “skewed” data, which murders query performance. After a large COPY, you must run VACUUM to re-sort the table and ANALYZE to update statistics. Automate this.
Watch Your Distribution Style: COPYing data into a table distributed on a key? The incoming rows are hashed and sent to the appropriate node. This is network traffic. Loading a billion rows? That’s a lot of network traffic. It’s efficient, but it’s a cost in time. Choose your distribution keys wisely before you load terabytes of data.

The COPY command is powerful, but it demands respect. Use it wisely, wrap it in robust error handling, and never, ever assume it worked correctly without checking the STL_LOAD_ERRORS system table. Trust, but verify.