31.4 Kinesis Data Firehose: Managed Delivery to S3, Redshift, OpenSearch, Splunk

Right, so you’ve got data streaming in, and you need to get it somewhere for storage or analysis. Kinesis Data Streams is the raw firehose; Kinesis Data Firehose is the attachment that aims it for you. Think of it as the difference between a pile of lumber and a pre-fab IKEA bookshelf. One gives you ultimate flexibility (and a lot of work), the other gets the job done quickly, albeit with some… interesting design choices.

Firehose is AWS’s fully managed service for loading streaming data into destinations like S3, Amazon Redshift, OpenSearch Service, and even third-party spots like Splunk or Datadog. Its primary job is to reliably capture, transform, batch, compress, and deliver your data. The key word here is managed. You don’t provision shards, you don’t worry about scaling. You just point your data at it and tell it where to go. It’s brilliant for use cases like loading data lakes, feeding analytics warehouses, or streaming application logs.

Why You’d Use Firehose Over a Data Stream

You use a raw Data Stream when you need multiple consumers processing the same data at different rates, or when you need ultra-low latency for your consumers. You use Firehose when you have one ultimate destination and you’d really rather not be the person managing the consumer application that does the batching, retrying, and uploading. It’s the “set it and forget it” option, assuming you set it up correctly. The trade-off is latency; delivery isn’t instant, it’s based on buffer size or time.

The Core Configuration: Buffering and Batching

This is where you’ll make your most important choices. Firehose doesn’t send every single record the moment it arrives. That would be horrifically inefficient and expensive. Instead, it buffers records.

You control this with two parameters: Buffer size (e.g., 128 MiB) and Buffer interval (e.g., 300 seconds). Firehose will deliver a batch whichever condition is met first. This is a classic throughput vs. latency trade-off.

Need lower latency? Set a small buffer size (like 1 MiB) and a short interval (60 seconds). You’ll pay more for the frequent PUT requests.
Want better compression and cost efficiency? Crank the buffer size to the max (128 MiB) and the interval to 900 seconds. Your data will be delivered in bigger, less frequent, highly compressed batches.

Here’s a quick CloudFormation snippet showing a basic S3 delivery stream. Note the BufferingHints:

Resources:
  MyFirehoseToS3:
    Type: AWS::KinesisFirehose::DeliveryStream
    Properties:
      DeliveryStreamName: "my-app-logs-stream"
      DeliveryStreamType: DirectPut
      ExtendedS3DestinationConfiguration:
        BucketARN: !GetAtt MyDataBucket.Arn
        RoleARN: !GetAtt FirehoseDeliveryRole.Arn
        BufferingHints:
          SizeInMBs: 64
          IntervalInSeconds: 300
        CompressionFormat: GZIP
        Prefix: "raw-logs/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/"
        ErrorOutputPrefix: "error-logs/!{firehose:error-output-type}/"

The “Oh, Right, That” Pitfall: The Prefix

See that Prefix line? This is the number one “I got burned by this” feature. Firehose uses the prefix to organize objects in your S3 bucket. If you just set it to logs/, you’ll end up with a single, massive, terrifying folder. You must use the built-in dynamic partitioning with the !{timestamp} notation to split your data by hour, day, or whatever. Otherwise, you’ll murder your S3 performance and make any query engine (Athena, Spark) weep. The ErrorOutputPrefix is equally crucial for debugging failed transformations.

Data Transformation with Lambda

This is Firehose’s killer feature. You can specify a Lambda function to transform your records on the fly before they’re delivered. Need to convert JSON to Parquet? Add a new field? Obfuscate PII? Your Lambda function receives a batch of records and returns them back, modified (or even filtered out).

But a word of warning: this Lambda is invoked synchronously by Firehose. If your function is slow or error-prone, it will backpressure the entire delivery stream. Keep it lean and mean. Here’s the structure your Lambda needs to adhere to:

import base64
import json

def lambda_handler(event, context):
    output = []
    
    for record in event['records']:
        # Decode the data from base64
        payload = base64.b64decode(record['data'])
        # Your transformation logic here. Let's just add a field.
        transformed_data = json.loads(payload)
        transformed_data['processed_by'] = 'firehose_lambda'
        
        # Re-encode and mark as successful
        encoded_data = base64.b64encode(json.dumps(transformed_data).encode('utf-8'))
        output_record = {
            'recordId': record['recordId'],
            'result': 'Ok',
            'data': encoded_data.decode('utf-8')
        }
        output.append(output_record)
    
    return {'records': output}

The Redshift Delivery Quirk

Sending data directly to Redshift is fantastic, but it has a weird implementation detail that feels like a leftover from 2012. Firehose doesn’t write directly to your Redshift tables. No, that would be too straightforward. Instead, it first lands the data in an S3 bucket you provide, then issues a Redshift COPY command to load it from S3 into the final table.

This two-step process is actually robust (it leverages Redshift’s optimized bulk load), but it’s just… clunky. You must pre-create the table, and you’re on the hook for managing the S3 staging bucket. It works beautifully, but don’t be surprised when you see it. It’s a design choice you simply have to accept.

You absolutely must monitor two key CloudWatch metrics for any production Firehose stream: DeliveryToS3.Success and DeliveryToS3.DataFreshness. The first tells you if deliveries are succeeding. The second is your de facto latency metric—it’s the age of the oldest record in the buffer. If this number is climbing, your stream can’t keep up with the incoming data rate, and you need to adjust your buffering or check your transformation Lambda. Ignoring these is a recipe for a 3 AM wake-up call.

Why You’d Use Firehose Over a Data Stream

The Core Configuration: Buffering and Batching

The “Oh, Right, That” Pitfall: The Prefix

Data Transformation with Lambda

The Redshift Delivery Quirk

Monitoring: Don’t Fly Blind