31.1 Kinesis Data Streams: Shards, Records, Partition Keys, and Sequence Numbers
Right, let’s talk about Kinesis Data Streams. Think of it as Amazon’s answer to “what if we built a super-scalable, durable log, but put it on a credit card and made you pay for every single byte that moves through it?” It’s a fantastic service, but you need to understand its moving parts or you’ll either overpay, underperform, or accidentally lose data. And I refuse to let that happen to you.
At its heart, a Kinesis Data Stream is just a massively scalable, ordered sequence of data records. Your job is to put records into it (we call that putting), and your consumer’s job is to read records from it (shockingly, we call that getting). The magic, and the complexity, lies in how it achieves that massive scalability: by splitting your stream into shards.
What in the World is a Shard?
A shard is the fundamental unit of throughput and cost in Kinesis. It’s not just a partition; it’s a contract with AWS. Each shard gives you a specific, fixed capacity:
- 1 MB/second of data input.
- 2 MB/second of data output (across all consumers! More on that later).
- 1000 records per second for writes.
Think of a shard as a single, ordered conveyor belt of data. You can’t change a shard’s capacity. If you need more throughput, you add more conveyor belts—more shards. This is called resharding. You can merge shards (decrease your throughput) or split shards (increase your throughput). It’s a bit clunky to manage, but it’s how the system maintains its performance guarantees.
The number of shards you start with is the single most important decision you’ll make. Start with too few, and your producers will get throttled, throwing ProvisionedThroughputExceededExceptions at you like confetti at a very sad parade. Start with too many, and you’ll be writing a heartfelt letter to your CFO explaining the AWS bill.
Partition Keys: Directing Traffic
You don’t just toss a record onto a stream and hope it finds a home. You provide a partition key with every record. Kinesis takes this key, hashes it into an MD5 soup, and uses the resulting value to assign the record to a specific, deterministic shard.
This is crucial for ordering. Records with the same partition key are guaranteed to be written to the same shard and will be read in the exact order they were written. If you need all records for a specific user (user_id: 12345) or a specific device (device_id: abcdef) to be processed in order, you must use the same partition key for all of them.
The most common rookie mistake is using a random partition key for everything. This evenly distributes your data, which is great for throughput, but it completely destroys any hope of ordered processing for related records. The other common mistake is using too few distinct partition keys, causing a “hot shard” where one shard is overloaded while others sit idle. You need a strategy. Maybe use user_id % 100 to get a good distribution, or a composite key like {country_code}-{user_id}.
Here’s how you put a record using the AWS SDK for JavaScript. Notice the PartitionKey.
const { KinesisClient, PutRecordCommand } = require("@aws-sdk/client-kinesis");
const kinesisClient = new KinesisClient({ region: "us-east-1" });
const params = {
StreamName: "my-orders-stream",
Data: Buffer.from(JSON.stringify({
orderId: "ord_12345",
userId: "user_67890",
amount: 99.99
})), // Data must be a Buffer or Blob
PartitionKey: "user_67890" // All orders for this user go to the same shard
};
const putRecord = async () => {
try {
const data = await kinesisClient.send(new PutRecordCommand(params));
console.log("Record sequenced with:", data.SequenceNumber);
} catch (err) {
console.error("Oh no, a failure!", err);
}
};
putRecord();
Records, Sequence Numbers, and That One Weird Trick
The actual data you send is a record. It’s a blob of data up to 1 MB in size (before Base64 encoding, which adds overhead, so keep it under ~900 KB to be safe). Along with your data and the partition key, Kinesis slaps two critical pieces of metadata on it:
- Sequence Number: A unique identifier for the record within its shard. This is how consumers track their position. It’s big, it’s ugly, and you should never try to parse it. Just treat it as an opaque string. It’s your bookmark.
- ApproximateArrivalTimestamp: When the stream received the record.
Now, the “weird trick”: the SequenceNumber you get back from a PutRecord call is different from the one attached to the record when it’s read. The write operation gives you a sequence number for the put request itself. When the record is read, it has the sequence number within the shard’s log. They’re both unique identifiers, but they serve different purposes. It’s confusing by design, I’m convinced.
Reading Data: It’s All About the Iterator
Consumers don’t just say “give me data.” They have to ask for data from a specific position in a specific shard using a shard iterator. This iterator is a token that represents your current place in the shard’s log. You can get an iterator from the tip of the shard (LATEST) or from the beginning (TRIM_HORIZON), or even from a specific sequence number you’ve stored yourself, which is the basis for checkpointing.
Here’s a simplified example of a consumer reading from a shard. Real applications use the KCL because managing this yourself is a part-time job.
const { KinesisClient, GetShardIteratorCommand, GetRecordsCommand } = require("@aws-sdk/client-kinesis");
const kinesisClient = new KinesisClient({ region: "us-east-1" });
// First, get an iterator starting from the beginning of the shard
const getIteratorParams = {
StreamName: "my-orders-stream",
ShardId: "shardId-000000000000", // You need to know which shard to read from!
ShardIteratorType: "TRIM_HORIZON"
};
const readRecords = async () => {
try {
const iteratorData = await kinesisClient.send(new GetShardIteratorCommand(getIteratorParams));
const shardIterator = iteratorData.ShardIterator;
// Now use the iterator to get the actual records
const recordsParams = { ShardIterator: shardIterator };
const recordsData = await kinesisClient.send(new GetRecordsCommand(recordsParams));
recordsData.Records.forEach(record => {
const data = JSON.parse(Buffer.from(record.Data).toString());
console.log(`Seq: ${record.SequenceNumber}, Data:`, data);
});
// The next call would use recordsData.NextShardIterator to continue
} catch (err) {
console.error("Error reading records", err);
}
};
readRecords();
The absolute most critical thing to remember about reading is the 2 MB/second output limit per shard is shared across all consumers. If you have five applications reading from the same shard, they are all collectively limited to 2 MB/s. If one greedy consumer hogs it, the others get throttled. This is the number one reason people unexpectedly need to add more shards. Your consumers are scaling, but your stream can’t keep up. Plan for it.