21.8 Redshift Data Sharing: Cross-Cluster and Cross-Account Queries
Right, so you’ve got your data loaded, your queries are humming along, and you’re feeling pretty good about your Redshift cluster. Then someone from the marketing team (bless their hearts) asks for direct, live access to your sales data. Your first instinct is to scream. Your second is to build a fragile pipeline of nightly extracts, which is just a different kind of scream. Enter Redshift data sharing, which is basically the database equivalent of saying, “Fine, here’s a live read-only feed, but you break it, you bought it.”
Data sharing lets you do something that feels a bit like magic: grant seamless, secure, read-only access to a specific database in your cluster (the producer) to other Redshift clusters (the consumers). The killer feature? This works whether the consumer is in the same AWS account or a completely different one. There’s no data movement, no copying, no S3 buckets acting as awkward middlemen. The consumer cluster queries the data directly from the producer’s disks, over a high-speed private network. It’s not a sync; it’s a live, managed connection.
The Core Concepts: Namespaces and Shares
Before you start slinging SQL, you need to wrap your head around the lingo. A datashare is a named object you create on the producer cluster. Think of it as a gift basket. You then add specific database objects (schemas, tables, views) to this basket. Finally, you grant access to this basket for a specific consumer cluster or an entire AWS account.
The consumer side sees this shared basket as a special kind of database, pointedly not with the same name as your source database to avoid utter confusion. This is where namespaces come in. The producer cluster has a namespace (a globally unique identifier, something like a1234567-b12c-34de-5678-f123456789ab). When you create a share, you’re essentially creating a named pointer from that namespace.
-- On your PRODUCER cluster
CREATE DATASHARE sales_share;
ALTER DATASHARE sales_share ADD SCHEMA public;
ALTER DATASHARE sales_share ADD TABLE public.sales, public.customers;
-- Now, grant usage to a specific consumer namespace
GRANT USAGE ON DATASHARE sales_share TO NAMESPACE 'a9876543-c21b-43dc-8765-a987654321ab';
Consuming the Shared Data
Over on the consumer cluster, the admin needs to create a local database that’s actually a reference to the remote share. Notice how you need the producer’s namespace and the exact name of the share. This is where you avoid naming collisions.
-- On your CONSUMER cluster
CREATE DATABASE marketing_db FROM DATASHARE sales_share OF NAMESPACE 'a1234567-b12c-34de-5678-f123456789ab';
Now, anyone with permissions on the marketing_db database can query marketing_db.public.sales just like any other table. The query planning happens on the consumer cluster, but the heavy lifting of scanning the data is done on the producer. You’re billed for the compute on the cluster that runs the query—the consumer. The producer just serves the bytes.
The Security Model: It’s Actually Good
This is where AWS got it right. The sharing is built on the bedrock of Redshift’s existing role-based security. The consumer cluster’s users and roles have no inherent permissions on the shared data. None. The producer admin must explicitly grant SELECT on the shared tables to a special PUBLIC role within the context of the datashare. This PUBLIC doesn’t mean “everyone on the internet”; it means “every user of the consumer database.”
-- Back on the PRODUCER cluster
GRANT SELECT ON TABLE public.sales TO PUBLIC;
The consumer admin then grants usage on the newly created marketing_db to their own users or roles. This two-layer permission system is elegant: the producer controls what can be accessed, and the consumer controls who on their side gets to access it.
The Rough Edges and “Questionable Choices”
It’s not all rainbows. The biggest limitation is that you can only share base tables and views. You cannot share SQL user-defined functions (UDFs), stored procedures, or materialized views. This is a massive pain if your data model relies on UDFs for critical transformations. You’ll have to replicate that logic on the consumer side, which defeats the purpose of a single source of truth.
Also, while the performance is generally excellent, it’s not quite the same as querying local data. Network latency between the two clusters (even in the same region) adds overhead. The planner on the consumer side doesn’t have the same deep statistics for the shared tables as it does for local ones. For complex, multi-join queries, you might find it more performant to periodically materialize the results into a local table on the consumer cluster. It’s a trade-off: real-time access vs. raw speed.
The most common pitfall? Forgetting that GRANT SELECT step on the producer. You’ll set everything up perfectly, the consumer admin will create the database, and their users will get a frustrating Permission denied error when they try to query. It’s the first thing I check when it “doesn’t work.” The second is ensuring both clusters are in the same AWS region. Because, no, you cannot share data across regions. They designed it that way to keep the network latency low and the billing simple, but it’s a real constraint you have to architect around.