31.5 Kinesis Data Analytics: SQL and Apache Flink on Streaming Data

Right, so you’ve got a Kinesis Data Stream humming along, dutifully shoveling data into Firehose or maybe an S3 bucket. That’s fine. It’s the data equivalent of putting everything in a big box to sort through later. But what if you need to know what’s in the box now? Not in five minutes, not after a Lambda runs, but right this second. That’s where Kinesis Data Analytics (KDA) comes in. Think of it as your SQL-speaking, caffeine-addled analyst who can look at a firehose of data and tell you the running average, the top trending items, or an emerging anomaly, all in real-time. It’s SQL (or Flink Java/Scala) on live data, and it’s shockingly powerful once you get your head around it.

The Two Flavors: SQL and Apache Flink

Let’s be clear: KDA is actually two different services hiding under one name, and the choice you make here is a big one.

First, there’s Kinesis Data Analytics for SQL Applications. This is the “quick and (somewhat) dirty” option. You write standard ANSI SQL, point it at a stream, and it gives you another stream back. It’s fantastic for straightforward aggregations, filtering, and enrichment. The magic here is that it manages the entire stateful streaming engine for you. You just write the query.

Then there’s Kinesis Data Analytics for Apache Flink Applications. This is the “I need a lightsaber, not a Swiss Army knife” option. You’re actually deploying a full Apache Flink application (in Java or Scala) onto a managed Flink cluster. This gives you immense power: complex event processing, sophisticated windowing, custom stateful functions, and co-processed streams. The trade-off? It’s vastly more complex. You’re not just writing a query; you’re writing an application. For now, let’s focus on the SQL side, because it’s the gateway drug.

The Core Concept: In-Application Streams and Pumps

This is the part everyone gets wrong at first. You don’t just write a SELECT * FROM my_stream. KDA SQL introduces two key constructs you must grok:

In-Application Streams: Think of these as internal, continuously updating tables that represent your streaming source. You create one for each source or intermediate step.
Pumps: This is the engine. A pump is essentially a SELECT statement that runs continuously, pumping its results into another in-application stream. It’s what makes the whole thing stream.

Here’s how you set up a basic job to read from a Kinesis Data Stream and pump results into another one.

-- Create a stream that represents our source. We must define the schema.
CREATE OR REPLACE STREAM "SOURCE_SQL_STREAM_001" (
    "ticker_symbol" VARCHAR(4),
    "sector" VARCHAR(16),
    "change" DOUBLE,
    "price" DOUBLE,
    "event_time" TIMESTAMP
);

-- Create a stream to hold our output. This is a best practice.
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM_001" (
    "ticker_symbol" VARCHAR(4),
    "rolling_avg_price" DOUBLE,
    "event_time" TIMESTAMP
);

-- Now, the pump. This runs forever, calculating a 1-minute rolling average.
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
    INSERT INTO "DESTINATION_SQL_STREAM_001"
    SELECT STREAM
        "ticker_symbol",
        AVG("price") OVER (PARTITION BY "ticker_symbol" RANGE INTERVAL '1' MINUTE PRECEDING) AS "rolling_avg_price",
        "event_time"
    FROM "SOURCE_SQL_STREAM_001";

Notice the WINDOWED clause? That’s the secret sauce. A ROW-based window looks at the last N rows, while a RANGE-based window (used here) looks back in time. This is where you define your “real-time” logic.

The Gotchas: Where They Get You

The designers made some… interesting choices. Here’s what you need to watch for:

Schema Discovery is a Lie: That cute “Schema Discovery” button in the console? It’s basically a random guess. It will sample a few records from your stream and infer a schema, which is almost guaranteed to be wrong for a real-world, schemaless data stream like JSON. You must manually define your source stream’s schema in the SQL code. Do not trust the automation here.
Timestamp Madness: Your data records have a timestamp in the payload (event_time), and Kinesis attaches its own ingestion timestamp (approximate_arrival_timestamp). Which one should you use for your windows? If you care about event-time processing (e.g., handling late-arriving data), you must use the timestamp from your payload and configure your application to use it as the ROWTIME field. Otherwise, you’re just processing data in the order it arrived at AWS, which might not be what you want.
Error Handling is… Abrupt: By default, if your SQL query encounters a malformed piece of data it can’t parse, the entire job stops. It doesn’t skip the bad record; it just gives up and fails. You must implement an EXCEPTION column in your CREATE STREAM statement to catch these errors and decide what to do with them, or your job will be incredibly fragile.

A More Realistic Example: Filtering and Enrichment

Let’s say you want to filter for only ‘TECH’ sector stocks and enrich the data with a static lookup table containing company names. This is a classic use case.

-- 1. Create our source stream (from the Kinesis stream)
CREATE OR REPLACE STREAM "SOURCE_STREAM" ( ... ); -- schema omitted for brevity

-- 2. Create a static reference table that we'll load from S3
CREATE OR REPLACE TABLE "COMPANY_REFERENCE_TABLE" (
    "ticker_symbol" VARCHAR(4),
    "company_name" VARCHAR(64)
);
-- You'd then configure the job to load this table from a pre-built file in S3.

-- 3. The pump that does the join and filter
CREATE OR REPLACE PUMP "MAIN_PUMP" AS
   INSERT INTO "OUTPUT_STREAM"
   SELECT STREAM
        s."ticker_symbol",
        r."company_name", -- enrichment from the static table
        s."price",
        s."event_time"
   FROM "SOURCE_STREAM" AS s
   JOIN "COMPANY_REFERENCE_TABLE" AS r
        ON s."ticker_symbol" = r."ticker_symbol"
   WHERE s."sector" = 'TECH'; -- filter

This demonstrates the true power of KDA SQL: joining a fast-moving stream with slow-moving or static reference data. It’s incredibly useful for things like adding user details to a stream of click events.

In the end, KDA SQL is a brilliant tool for a specific niche: continuous SQL on streaming data without the overhead of managing clusters. Just go in with your eyes open to its quirks, define your schemas manually, and for heaven’s sake, set up those exception streams. Your future self, who isn’t debugging a failed job at 2 AM, will thank you.