Why Streaming?
LLMs can take several seconds to generate a complete response. Streaming sends tokens to the user as they're generated, dramatically improving perceived performance. Users see the response appear word-by-word instead of waiting for the entire answer.
⚡ Streaming Benefits
Without streaming: User waits 5-10 seconds, then sees the full response
With streaming: First token appears in ~200ms, response builds progressively
Basic Model Streaming
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
modelName: "gpt-4",
streaming: true,
});
// Stream tokens one by one
const stream = await model.stream("Explain React hooks in 3 sentences.");
for await (const chunk of stream) {
process.stdout.write(chunk.content as string);
// Outputs: "React" " hooks" " are" " functions" ...
}
Streaming with Chains
When streaming chains, you can stream the final output or intermediate steps.
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
const prompt = ChatPromptTemplate.fromTemplate(
"Write a short tutorial about {topic}"
);
const model = new ChatOpenAI({ modelName: "gpt-4" });
const parser = new StringOutputParser();
// Build a chain
const chain = prompt.pipe(model).pipe(parser);
// Stream the chain output
const stream = await chain.stream({ topic: "Docker volumes" });
let fullResponse = "";
for await (const chunk of stream) {
process.stdout.write(chunk);
fullResponse += chunk;
}
console.log("\n\nFull response length:", fullResponse.length);
Streaming with Callbacks
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
modelName: "gpt-4",
streaming: true,
callbacks: [
{
handleLLMNewToken(token: string) {
// Called for each token
process.stdout.write(token);
},
handleLLMEnd() {
console.log("\n[Stream complete]");
},
handleLLMError(error: Error) {
console.error("Stream error:", error);
},
},
],
});
await model.invoke("Explain WebSockets in simple terms");
Server-Sent Events (SSE) with Next.js
The most common pattern for streaming AI responses in web applications.
// app/api/chat/route.ts
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";
export async function POST(req: Request) {
const { message } = await req.json();
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are a helpful coding assistant."],
["human", "{input}"],
]);
const model = new ChatOpenAI({ modelName: "gpt-4", streaming: true });
const parser = new StringOutputParser();
const chain = prompt.pipe(model).pipe(parser);
// Create a readable stream
const stream = await chain.stream({ input: message });
// Convert to SSE format
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ text: chunk })}\n\n`)
);
}
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
},
});
return new Response(readable, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}
React Client for SSE
"use client";
import { useState } from "react";
export default function Chat() {
const [response, setResponse] = useState("");
const [isStreaming, setIsStreaming] = useState(false);
async function handleSubmit(message: string) {
setResponse("");
setIsStreaming(true);
const res = await fetch("/api/chat", {
method: "POST",
body: JSON.stringify({ message }),
});
const reader = res.body!.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const text = decoder.decode(value);
const lines = text.split("\n");
for (const line of lines) {
if (line.startsWith("data: ") && line !== "data: [DONE]") {
const data = JSON.parse(line.slice(6));
setResponse((prev) => prev + data.text);
}
}
}
setIsStreaming(false);
}
return (
<div>
<button onClick={() => handleSubmit("Explain Docker")}>
Ask
</button>
<div>{response}{isStreaming && "▊"}</div>
</div>
);
}
Streaming RAG Responses
import { ChatOpenAI } from "@langchain/openai";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";
const model = new ChatOpenAI({ modelName: "gpt-4", streaming: true });
const chain = await createRetrievalChain({
combineDocsChain: await createStuffDocumentsChain({
llm: model,
prompt: ragPrompt,
}),
retriever: vectorStore.asRetriever(),
});
// Stream RAG chain — includes sources!
const stream = await chain.stream({
input: "How do I configure Docker Compose?",
});
for await (const chunk of stream) {
if (chunk.answer) {
process.stdout.write(chunk.answer);
}
}
// After streaming, last chunk contains full context:
// chunk.context = [Document, Document, ...] (source documents)
💡 Key Takeaways
- • Streaming dramatically improves perceived performance (200ms vs 5-10s)
- • Use
.stream()on any chain to get an async iterator of chunks - • Server-Sent Events (SSE) is the standard pattern for web streaming
- • RAG chains can stream the answer while providing source documents
- • Always handle errors and connection drops in streaming clients