TechLead
Lesson 15 of 18
5 min read
LangChain

Multi-Modal AI (Images + Text)

Build applications that process images, describe visuals, and combine vision with text using LangChain's multi-modal support

What Is Multi-Modal AI?

Multi-modal AI models can process multiple types of input — text, images, audio, and video. With LangChain, you can build applications that analyze images, generate descriptions, extract data from screenshots, and combine visual understanding with text reasoning.

👁️ Multi-Modal Use Cases

  • Image Analysis: Describe, classify, or extract data from images
  • Screenshot to Code: Convert UI screenshots to React components
  • Document OCR: Read text from images of documents
  • Visual Q&A: Answer questions about images
  • Content Moderation: Check images for policy compliance

Sending Images to Chat Models

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";

const model = new ChatOpenAI({ modelName: "gpt-4o" }); // Vision-capable model

// Send image via URL
const response = await model.invoke([
  new HumanMessage({
    content: [
      { type: "text", text: "What's in this image? Describe it in detail." },
      {
        type: "image_url",
        image_url: {
          url: "https://example.com/photo.jpg",
          detail: "high", // "low", "high", or "auto"
        },
      },
    ],
  }),
]);

console.log(response.content);
// "The image shows a modern web application dashboard with..."

Using Base64 Images

Send local images or uploaded files as base64-encoded strings.

import { readFileSync } from "fs";

// Read local image and convert to base64
const imageBuffer = readFileSync("./screenshot.png");
const base64Image = imageBuffer.toString("base64");

const response = await model.invoke([
  new HumanMessage({
    content: [
      { type: "text", text: "Describe this UI design. What framework does it look like?" },
      {
        type: "image_url",
        image_url: {
          url: `data:image/png;base64,${base64Image}`,
        },
      },
    ],
  }),
]);

Comparing Multiple Images

const response = await model.invoke([
  new HumanMessage({
    content: [
      { type: "text", text: "Compare these two UI designs. Which one has better UX and why?" },
      {
        type: "image_url",
        image_url: { url: "https://example.com/design-a.png" },
      },
      {
        type: "image_url",
        image_url: { url: "https://example.com/design-b.png" },
      },
    ],
  }),
]);

console.log(response.content);
// Detailed comparison of both designs

Screenshot to Code

One of the most powerful use cases — converting UI screenshots into working code.

import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";

const model = new ChatOpenAI({ modelName: "gpt-4o" });

const response = await model.invoke([
  new HumanMessage({
    content: [
      {
        type: "text",
        text: `Convert this UI screenshot to a React component using Tailwind CSS.
               Use TypeScript and functional components with proper types.
               Include responsive design for mobile.`,
      },
      {
        type: "image_url",
        image_url: {
          url: `data:image/png;base64,${screenshotBase64}`,
          detail: "high",
        },
      },
    ],
  }),
]);

// Returns complete React + Tailwind component code!

Extracting Structured Data from Images

import { z } from "zod";

const ReceiptSchema = z.object({
  store: z.string(),
  date: z.string(),
  items: z.array(z.object({
    name: z.string(),
    quantity: z.number(),
    price: z.number(),
  })),
  total: z.number(),
  tax: z.number(),
});

const model = new ChatOpenAI({ modelName: "gpt-4o" })
  .withStructuredOutput(ReceiptSchema);

const receipt = await model.invoke([
  new HumanMessage({
    content: [
      { type: "text", text: "Extract all data from this receipt image." },
      {
        type: "image_url",
        image_url: { url: `data:image/jpeg;base64,${receiptBase64}` },
      },
    ],
  }),
]);

console.log(receipt);
// { store: "Walmart", date: "2026-02-08", items: [...], total: 47.93, tax: 3.84 }

Next.js API Route with Image Upload

// app/api/analyze-image/route.ts
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";

export async function POST(req: Request) {
  const formData = await req.formData();
  const file = formData.get("image") as File;
  const question = formData.get("question") as string;

  // Convert to base64
  const bytes = await file.arrayBuffer();
  const base64 = Buffer.from(bytes).toString("base64");
  const mimeType = file.type;

  const model = new ChatOpenAI({ modelName: "gpt-4o" });

  const response = await model.invoke([
    new HumanMessage({
      content: [
        { type: "text", text: question || "Describe this image" },
        {
          type: "image_url",
          image_url: {
            url: `data:${mimeType};base64,${base64}`,
          },
        },
      ],
    }),
  ]);

  return Response.json({ analysis: response.content });
}

💡 Key Takeaways

  • • Use vision-capable models like GPT-4o or Claude 3 for image understanding
  • • Images can be sent as URLs or base64-encoded strings
  • • Combine withStructuredOutput() with images to extract typed data
  • • Screenshot-to-code is a powerful practical application
  • • Use detail: "high" for images that need precise analysis

Continue Learning