TechLead
Lesson 12 of 18
5 min read
LangChain

LangChain Streaming

Stream LLM responses in real-time for better UX. Learn token streaming, chain streaming, and server-sent events

Why Streaming?

LLMs can take several seconds to generate a complete response. Streaming sends tokens to the user as they're generated, dramatically improving perceived performance. Users see the response appear word-by-word instead of waiting for the entire answer.

⚡ Streaming Benefits

Without streaming: User waits 5-10 seconds, then sees the full response
With streaming: First token appears in ~200ms, response builds progressively

Basic Model Streaming

import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
  modelName: "gpt-4",
  streaming: true,
});

// Stream tokens one by one
const stream = await model.stream("Explain React hooks in 3 sentences.");

for await (const chunk of stream) {
  process.stdout.write(chunk.content as string);
  // Outputs: "React" " hooks" " are" " functions" ...
}

Streaming with Chains

When streaming chains, you can stream the final output or intermediate steps.

import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";

const prompt = ChatPromptTemplate.fromTemplate(
  "Write a short tutorial about {topic}"
);
const model = new ChatOpenAI({ modelName: "gpt-4" });
const parser = new StringOutputParser();

// Build a chain
const chain = prompt.pipe(model).pipe(parser);

// Stream the chain output
const stream = await chain.stream({ topic: "Docker volumes" });

let fullResponse = "";
for await (const chunk of stream) {
  process.stdout.write(chunk);
  fullResponse += chunk;
}
console.log("\n\nFull response length:", fullResponse.length);

Streaming with Callbacks

import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
  modelName: "gpt-4",
  streaming: true,
  callbacks: [
    {
      handleLLMNewToken(token: string) {
        // Called for each token
        process.stdout.write(token);
      },
      handleLLMEnd() {
        console.log("\n[Stream complete]");
      },
      handleLLMError(error: Error) {
        console.error("Stream error:", error);
      },
    },
  ],
});

await model.invoke("Explain WebSockets in simple terms");

Server-Sent Events (SSE) with Next.js

The most common pattern for streaming AI responses in web applications.

// app/api/chat/route.ts
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";

export async function POST(req: Request) {
  const { message } = await req.json();

  const prompt = ChatPromptTemplate.fromMessages([
    ["system", "You are a helpful coding assistant."],
    ["human", "{input}"],
  ]);

  const model = new ChatOpenAI({ modelName: "gpt-4", streaming: true });
  const parser = new StringOutputParser();
  const chain = prompt.pipe(model).pipe(parser);

  // Create a readable stream
  const stream = await chain.stream({ input: message });

  // Convert to SSE format
  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        controller.enqueue(
          encoder.encode(`data: ${JSON.stringify({ text: chunk })}\n\n`)
        );
      }
      controller.enqueue(encoder.encode("data: [DONE]\n\n"));
      controller.close();
    },
  });

  return new Response(readable, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      Connection: "keep-alive",
    },
  });
}

React Client for SSE

"use client";
import { useState } from "react";

export default function Chat() {
  const [response, setResponse] = useState("");
  const [isStreaming, setIsStreaming] = useState(false);

  async function handleSubmit(message: string) {
    setResponse("");
    setIsStreaming(true);

    const res = await fetch("/api/chat", {
      method: "POST",
      body: JSON.stringify({ message }),
    });

    const reader = res.body!.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const text = decoder.decode(value);
      const lines = text.split("\n");

      for (const line of lines) {
        if (line.startsWith("data: ") && line !== "data: [DONE]") {
          const data = JSON.parse(line.slice(6));
          setResponse((prev) => prev + data.text);
        }
      }
    }

    setIsStreaming(false);
  }

  return (
    <div>
      <button onClick={() => handleSubmit("Explain Docker")}>
        Ask
      </button>
      <div>{response}{isStreaming && "▊"}</div>
    </div>
  );
}

Streaming RAG Responses

import { ChatOpenAI } from "@langchain/openai";
import { createRetrievalChain } from "langchain/chains/retrieval";
import { createStuffDocumentsChain } from "langchain/chains/combine_documents";

const model = new ChatOpenAI({ modelName: "gpt-4", streaming: true });

const chain = await createRetrievalChain({
  combineDocsChain: await createStuffDocumentsChain({
    llm: model,
    prompt: ragPrompt,
  }),
  retriever: vectorStore.asRetriever(),
});

// Stream RAG chain — includes sources!
const stream = await chain.stream({
  input: "How do I configure Docker Compose?",
});

for await (const chunk of stream) {
  if (chunk.answer) {
    process.stdout.write(chunk.answer);
  }
}
// After streaming, last chunk contains full context:
// chunk.context = [Document, Document, ...] (source documents)

💡 Key Takeaways

  • • Streaming dramatically improves perceived performance (200ms vs 5-10s)
  • • Use .stream() on any chain to get an async iterator of chunks
  • • Server-Sent Events (SSE) is the standard pattern for web streaming
  • • RAG chains can stream the answer while providing source documents
  • • Always handle errors and connection drops in streaming clients

Continue Learning