29.3 Streaming Responses
Right, let’s talk about streaming. You’ve probably already built a simple call to the Chat Completions API. You send a request, you wait, you get a whole response back. It works, but it feels… clunky. Like waiting for a fax machine to spit out the entire page before you can read the first sentence. We can do better.
Streaming is how you make your application feel like it’s thinking with you, not for some preordained amount of time and then dumping a result. It’s the difference between a monologue and a conversation. The core idea is brutally simple: instead of waiting for the entire completion to be generated on OpenAI’s servers, we have them send us each token (roughly, a word or part of a word) the moment it’s ready. This gets those first words to your user in hundreds of milliseconds instead of multiple seconds, a massive win for perceived performance.
The Basic Mechanics: It’s Just a HTTP Stream
Under the hood, this isn’t some magical WebSocket connection. It’s standard HTTP 1.1 chunked transfer encoding. When you set stream: true in your request, you’re telling the API, “Hey, don’t bother assembling the whole JSON object. Just start sending me chunks of it as you go.” The connection stays open until the generation is complete.
Each chunk you receive isn’t the full message you’re used to. It’s a smaller JSON object that looks like this:
{"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"delta":{"content":"Why"},"index":0,"finish_reason":null}]}
See the key? It’s not choices[0].message; it’s choices[0].delta. This delta object contains the difference from the previous chunk. For most chunks, it will only contain a content field with the next piece of text. The very last chunk you get will have a finish_reason (like "stop") and an empty delta, signaling the end of the stream.
Your First Streaming Code (Python)
Talking about it is one thing. Let’s see the code. Here’s the absolute bare minimum in Python, handling the raw HTTP stream. No fancy abstractions yet.
import json
import requests
# Your API key and endpoint
api_key = "YOUR_API_KEY"
url = "https://api.openai.com/v1/chat/completions"
# Headers
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
}
# The payload, with streaming enabled
data = {
"model": "gpt-4",
"messages": [{"role": "user", "content": "Tell me a 3-sentence story about a robot with a sense of humor."}],
"stream": True # This is the magic switch
}
# Make the request, streaming the response
response = requests.post(url, headers=headers, json=data, stream=True)
# Iterate through each chunk as it arrives
for chunk in response.iter_lines():
if chunk:
# The chunk is a byte string. Decode it and strip the 'data: ' prefix
decoded_chunk = chunk.decode('utf-8')
if decoded_chunk.startswith('data: '):
# Strip the prefix to get the JSON string
json_str = decoded_chunk[6:]
# The very last chunk is just [DONE]
if json_str == '[DONE]':
break
try:
chunk_data = json.loads(json_str)
# Extract the content delta
content = chunk_data['choices'][0]['delta'].get('content', '')
if content:
print(content, end='', flush=True) # Print it as it comes
except json.JSONDecodeError:
print(f"\nError decoding chunk: {json_str}")
Run this. You’ll see the story unfold word by word. It feels alive. This is the foundation everything else is built on.
Why You Want an SDK (and the Gotcha There)
Now, handling those chunks manually is educational, but it’s a pain. You have to worry about decoding, the data: prefix, the [DONE] signal, and error handling. This is where the official OpenAI Python SDK (and others) earn their keep.
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Explain quantum computing in 50 words."}],
stream=True,
)
for chunk in response:
# The SDK handles all the parsing for you. Much cleaner.
content = chunk.choices[0].delta.content
if content is not None:
print(content, end='', flush=True)
The Major Gotcha: The SDK is fantastic, but it abstracts away a critical reality: network connections can fail. If your user’s Wi-Fi drops mid-stream, the for loop will just hang, waiting for a chunk that’s never coming. You must implement timeouts and error handling. This is non-negotiable for production code. You might wrap the stream in a try-catch, use a library that handles retries, or implement a heartbeat timer.
Beyond the Terminal: Streaming in a Web App
Printing to a terminal is easy. The real payoff is in a web application. The pattern is similar, but you’re sending those chunks down to the client over a Server-Sent Event (SSE) or WebSocket connection. The client-side JavaScript then appends each new token to the DOM.
The beauty here is that the entire response isn’t stuck behind your server’s processing time. The first token flies from OpenAI -> Your Server -> User’s Browser almost instantly. This is why ChatGPT feels so fast.
When Not to Stream (Yes, There’s a Catch)
Streaming isn’t a universal good. Don’t use it if:
- You need the entire response for post-processing: If you’re going to parse the JSON, validate a function call, or run the text through another tool, waiting for the stream to complete just to reassemble it is pointless overhead. Just get the whole thing at once.
- You’re billing per token: Some systems need the entire usage count (
total_tokens) to calculate cost before doing anything else. You only get this in the final, non-streaming response. - The response is very short: For a one-word answer, the overhead of setting up the stream might negate the benefit.
So, the rule of thumb: stream for user-facing features where perception is key; use the standard API for backend processing and batch jobs. It’s that simple. Now go make your apps feel less like a fax machine and more like a conversation.