What Is Multi-Modal AI?
Multi-modal AI models can process multiple types of input — text, images, audio, and video. With LangChain, you can build applications that analyze images, generate descriptions, extract data from screenshots, and combine visual understanding with text reasoning.
👁️ Multi-Modal Use Cases
- Image Analysis: Describe, classify, or extract data from images
- Screenshot to Code: Convert UI screenshots to React components
- Document OCR: Read text from images of documents
- Visual Q&A: Answer questions about images
- Content Moderation: Check images for policy compliance
Sending Images to Chat Models
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
const model = new ChatOpenAI({ modelName: "gpt-4o" }); // Vision-capable model
// Send image via URL
const response = await model.invoke([
new HumanMessage({
content: [
{ type: "text", text: "What's in this image? Describe it in detail." },
{
type: "image_url",
image_url: {
url: "https://example.com/photo.jpg",
detail: "high", // "low", "high", or "auto"
},
},
],
}),
]);
console.log(response.content);
// "The image shows a modern web application dashboard with..."
Using Base64 Images
Send local images or uploaded files as base64-encoded strings.
import { readFileSync } from "fs";
// Read local image and convert to base64
const imageBuffer = readFileSync("./screenshot.png");
const base64Image = imageBuffer.toString("base64");
const response = await model.invoke([
new HumanMessage({
content: [
{ type: "text", text: "Describe this UI design. What framework does it look like?" },
{
type: "image_url",
image_url: {
url: `data:image/png;base64,${base64Image}`,
},
},
],
}),
]);
Comparing Multiple Images
const response = await model.invoke([
new HumanMessage({
content: [
{ type: "text", text: "Compare these two UI designs. Which one has better UX and why?" },
{
type: "image_url",
image_url: { url: "https://example.com/design-a.png" },
},
{
type: "image_url",
image_url: { url: "https://example.com/design-b.png" },
},
],
}),
]);
console.log(response.content);
// Detailed comparison of both designs
Screenshot to Code
One of the most powerful use cases — converting UI screenshots into working code.
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
const model = new ChatOpenAI({ modelName: "gpt-4o" });
const response = await model.invoke([
new HumanMessage({
content: [
{
type: "text",
text: `Convert this UI screenshot to a React component using Tailwind CSS.
Use TypeScript and functional components with proper types.
Include responsive design for mobile.`,
},
{
type: "image_url",
image_url: {
url: `data:image/png;base64,${screenshotBase64}`,
detail: "high",
},
},
],
}),
]);
// Returns complete React + Tailwind component code!
Extracting Structured Data from Images
import { z } from "zod";
const ReceiptSchema = z.object({
store: z.string(),
date: z.string(),
items: z.array(z.object({
name: z.string(),
quantity: z.number(),
price: z.number(),
})),
total: z.number(),
tax: z.number(),
});
const model = new ChatOpenAI({ modelName: "gpt-4o" })
.withStructuredOutput(ReceiptSchema);
const receipt = await model.invoke([
new HumanMessage({
content: [
{ type: "text", text: "Extract all data from this receipt image." },
{
type: "image_url",
image_url: { url: `data:image/jpeg;base64,${receiptBase64}` },
},
],
}),
]);
console.log(receipt);
// { store: "Walmart", date: "2026-02-08", items: [...], total: 47.93, tax: 3.84 }
Next.js API Route with Image Upload
// app/api/analyze-image/route.ts
import { ChatOpenAI } from "@langchain/openai";
import { HumanMessage } from "@langchain/core/messages";
export async function POST(req: Request) {
const formData = await req.formData();
const file = formData.get("image") as File;
const question = formData.get("question") as string;
// Convert to base64
const bytes = await file.arrayBuffer();
const base64 = Buffer.from(bytes).toString("base64");
const mimeType = file.type;
const model = new ChatOpenAI({ modelName: "gpt-4o" });
const response = await model.invoke([
new HumanMessage({
content: [
{ type: "text", text: question || "Describe this image" },
{
type: "image_url",
image_url: {
url: `data:${mimeType};base64,${base64}`,
},
},
],
}),
]);
return Response.json({ analysis: response.content });
}
💡 Key Takeaways
- • Use vision-capable models like GPT-4o or Claude 3 for image understanding
- • Images can be sent as URLs or base64-encoded strings
- • Combine
withStructuredOutput()with images to extract typed data - • Screenshot-to-code is a powerful practical application
- • Use
detail: "high"for images that need precise analysis