21.5 Redshift Spectrum: Querying S3 Data from Redshift

Alright, let’s talk about Redshift Spectrum. You’ve got your nice, shiny Redshift cluster humming along, full of your most precious, frequently-queryed data. But then you remember: you’ve got petabytes of ancient log files, a zillion CSV dumps from third parties, and a whole data lake sitting in S3. The thought of ETL-ing all that junk into Redshift proper makes your wallet physically ache.

Enter Spectrum. This is the feature that lets your Redshift cluster, the prissy aristocrat, send its servants out to the messy, wild data lake (S3) to fetch data for it, so it doesn’t get its hands dirty. You don’t load the S3 data into Redshift; you query it directly from S3. The key thing to understand is the division of labor: your Redshift cluster is the brain that plans the query and aggregates the final results, but the grunt work of actually reading the raw data from S3 is done by a vast, invisible fleet of Amazon’s compute resources outside of your cluster. Your cluster’s size determines the brainpower for the final join and sort, not the raw S3 scanning power. This is why it can feel like magic.

The Absolute Non-Negotiables: IAM and the External Schema

Before you write a single line of SQL, you have to sort out the permissions. This is the number one “it doesn’t work!” moment for everyone. Spectrum needs permission to read from your S3 bucket, and the only way to do that is through IAM roles. You can’t use access keys. Your Redshift cluster must assume an IAM role that has the s3:GetObject and s3:ListBucket permissions on your bucket.

Once that’s done, you don’t point your queries directly at the S3 bucket. You create an EXTERNAL SCHEMA in Redshift. This is just a metadata wrapper that tells Redshift “hey, when someone queries a table in this schema, the data is actually over in S3, and here’s the IAM role to use to get it.”

-- First, create the database in the Glue Data Catalog (or use an existing one)
CREATE EXTERNAL SCHEMA my_spectrum_schema
FROM DATA CATALOG
DATABASE 'spectrum_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/MySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Defining Your External Tables

This is where you make or break your performance. You’re not creating a table; you’re creating a definition of a table that describes the data living in S3. Get this wrong, and your queries will be slow and expensive.

You must specify the STORED AS file format (e.g., PARQUET, AVRO, ORC, TEXTFILE) and, crucially, the LOCATION. The location is the S3 path prefix, not a single file. Redshift Spectrum will read every file under that prefix.

CREATE EXTERNAL TABLE my_spectrum_schema.page_views (
  user_id INT,
  url VARCHAR(100),
  time TIMESTAMP
)
PARTITIONED BY (date DATE) -- This is a huge deal for performance
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetSerde'
WITH SERDEPROPERTIES ('serialization.format' = '1')
STORED AS PARQUET
LOCATION 's3://my-data-lake/page_views/'
TABLE PROPERTIES ('classification'='parquet');

Notice the PARTITIONED BY clause. This is not optional for any serious workload. It allows Spectrum to skip vast swaths of data in S3. If your query has WHERE date = '2023-10-01', Spectrum will only read the data in the s3://my-data-lake/page_views/date=2023-10-01/ path. Without partitioning, it does a full scan of every file under the prefix. You will cry. Your CFO will cry. Amazon’s shareholders will rejoice.

The Glue Data Catalog (Or Hive Metastore)

You have a choice here: let Redshift manage the table definitions itself (as in the first example) or use the AWS Glue Data Catalog. Using Glue is the modern, smarter choice. It means your table definitions are shared. You can define a table in Glue and query it from Redshift Spectrum, Athena, and EMR without duplicating the DDL. It’s the closest thing we have to a universal metastore in the AWS ecosystem.

-- This creates the table definition in the Glue Data Catalog
CREATE EXTERNAL TABLE my_spectrum_schema.page_views
...

Actually Querying the Stuff

Now for the fun part. You just… query it. Like any other table. It’s brilliantly simple.

SELECT
  url,
  COUNT(*) as view_count
FROM my_spectrum_schema.page_views
WHERE date BETWEEN '2023-10-01' AND '2023-10-07'
  AND user_id IS NOT NULL
GROUP BY url
ORDER BY view_count DESC
LIMIT 100;

Redshift’s query planner figures out the most efficient way to get the job done, pushing predicates down to the Spectrum layer to minimize the amount of data scanned. You can even join your internal, hot Redshift tables with your massive external S3 tables. This is the killer feature. Join your current user table (in Redshift) with years of historical page views (in S3) without moving a byte.

Pitfalls and “Oh, C’mon!” Moments

Cost Surprises: Remember, you pay per byte of data scanned by Spectrum. A sloppy SELECT * on a petabyte dataset could cost you thousands of dollars. Always include filters, use columnar formats like Parquet (which are columnar, so it only reads the columns you select), and PARTITION YOUR DATA.
File Size Matters: Spectrum hates tiny files. If you have a million 10KB JSON files, the overhead of opening each one will murder performance. Use your ETL process to coalesce data into larger files (aim for 100MB to 1GB+). The ideal is large, columnar, compressed files.
Schema Evolution is on You: If the schema of your data in S3 changes (e.g., a new column is added), you must update your external table definition with ALTER TABLE ... ADD COLUMN. It won’t auto-discover like some fancy-pants systems. This is the “direct” part of my voice: it’s a bit of a pain.
Data Type Mapping: Be hyper-aware of how your S3 data types map to Redshift types. Get this wrong in your DDL, and you’ll get silent NULLs or spectacular failures. A string that’s too long will be truncated. A number in a string column will become NULL. Test with small samples first.

Spectrum is arguably the most powerful feature of Redshift. It bridges the gap between the high-performance data warehouse and the infinitely scalable data lake. Use it wisely, partition everything, and for the love of all that is holy, mind your costs.