Why Evaluate LLM Applications?
Unlike traditional software where tests check exact outputs, LLM applications produce non-deterministic responses. You need specialized evaluation strategies to ensure your AI app is accurate, helpful, and safe before deploying to production.
⚠️ What Can Go Wrong
- Hallucination: The model makes up facts that sound convincing
- Relevance Drift: Answers don't match the question asked
- Context Ignored: RAG retrieves documents but the model ignores them
- Inconsistency: Same question gets wildly different answers
- Safety: Model produces harmful or biased content
Manual Evaluation with Test Cases
interface TestCase {
input: string;
expectedTopics: string[]; // Topics the answer should cover
shouldNotContain: string[]; // Things to avoid
}
const testCases: TestCase[] = [
{
input: "What is React?",
expectedTopics: ["UI library", "components", "JavaScript"],
shouldNotContain: ["Angular", "Vue"],
},
{
input: "Explain Docker volumes",
expectedTopics: ["persistent storage", "containers", "mount"],
shouldNotContain: ["virtual machines"],
},
];
async function runEvaluation(chain: any) {
const results = [];
for (const testCase of testCases) {
const response = await chain.invoke({ input: testCase.input });
const answer = response.answer || response.content;
const lowerAnswer = answer.toLowerCase();
const topicsCovered = testCase.expectedTopics.filter(
(topic) => lowerAnswer.includes(topic.toLowerCase())
);
const unwantedFound = testCase.shouldNotContain.filter(
(term) => lowerAnswer.includes(term.toLowerCase())
);
results.push({
input: testCase.input,
passed: topicsCovered.length === testCase.expectedTopics.length
&& unwantedFound.length === 0,
coverage: `${topicsCovered.length}/${testCase.expectedTopics.length}`,
unwantedTerms: unwantedFound,
});
}
return results;
}
const results = await runEvaluation(myChain);
console.table(results);
LLM-as-Judge Evaluation
Use a powerful LLM to evaluate the quality of another LLM's responses.
import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";
const EvalSchema = z.object({
relevance: z.number().min(1).max(5).describe("How relevant is the answer to the question"),
accuracy: z.number().min(1).max(5).describe("How factually accurate is the answer"),
completeness: z.number().min(1).max(5).describe("How thorough is the answer"),
reasoning: z.string().describe("Brief explanation of the scores"),
});
const judge = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 })
.withStructuredOutput(EvalSchema);
async function evaluateResponse(question: string, answer: string, context?: string) {
const prompt = `
You are an expert evaluator. Score this AI response on a 1-5 scale.
Question: ${question}
${context ? `Context provided: ${context}` : ""}
AI Response: ${answer}
Score each dimension 1-5:
- Relevance: Does the answer address the question?
- Accuracy: Are the facts correct?
- Completeness: Is the answer thorough?
`;
return await judge.invoke(prompt);
}
// Evaluate your chain's responses
const question = "How do Docker volumes work?";
const answer = await myChain.invoke({ input: question });
const evaluation = await evaluateResponse(question, answer);
console.log(evaluation);
// { relevance: 5, accuracy: 4, completeness: 4, reasoning: "..." }
RAG-Specific Evaluation
const RAGEvalSchema = z.object({
faithfulness: z.number().min(1).max(5)
.describe("Does the answer only use information from the context?"),
relevance: z.number().min(1).max(5)
.describe("Did the retriever find relevant documents?"),
answerRelevance: z.number().min(1).max(5)
.describe("Does the answer address the original question?"),
hallucination: z.boolean()
.describe("Does the answer contain information NOT in the context?"),
});
async function evaluateRAG(
question: string,
retrievedDocs: string[],
answer: string
) {
const judge = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 })
.withStructuredOutput(RAGEvalSchema);
return await judge.invoke(`
Evaluate this RAG (Retrieval-Augmented Generation) response.
Question: ${question}
Retrieved Context:
${retrievedDocs.map((d, i) => `[${i + 1}] ${d}`).join("\n")}
Generated Answer: ${answer}
Evaluate faithfulness (uses only context), retrieval relevance,
answer relevance, and whether hallucination occurred.
`);
}
Automated Test Suite
// evaluation.test.ts
import { describe, it, expect } from "vitest";
describe("AI Chain Quality", () => {
it("should answer React questions accurately", async () => {
const result = await myChain.invoke({ input: "What are React hooks?" });
const evaluation = await evaluateResponse("What are React hooks?", result);
expect(evaluation.relevance).toBeGreaterThanOrEqual(4);
expect(evaluation.accuracy).toBeGreaterThanOrEqual(4);
});
it("should not hallucinate in RAG responses", async () => {
const result = await ragChain.invoke({ input: "Company refund policy?" });
const ragEval = await evaluateRAG(
"Company refund policy?",
result.sourceDocuments.map((d: any) => d.pageContent),
result.answer
);
expect(ragEval.hallucination).toBe(false);
expect(ragEval.faithfulness).toBeGreaterThanOrEqual(4);
});
it("should handle edge cases gracefully", async () => {
const result = await myChain.invoke({ input: "" });
expect(result).toBeDefined();
expect(result.length).toBeGreaterThan(0);
});
});
LangSmith for Production Monitoring
# Set up LangSmith tracing
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-api-key
export LANGCHAIN_PROJECT=my-project
// LangSmith automatically traces all LangChain calls
// No code changes needed — just set the env variables
// You can also add custom metadata
const result = await myChain.invoke(
{ input: "What is Docker?" },
{
metadata: {
userId: "user-123",
sessionId: "session-456",
environment: "production",
},
tags: ["production", "docker-questions"],
}
);
// View traces, latency, token usage, and errors at smith.langchain.com
💡 Key Takeaways
- • LLM apps need specialized evaluation — traditional unit tests aren't enough
- • LLM-as-Judge uses a powerful model to evaluate another model's output
- • RAG evaluation checks faithfulness (no hallucination) and retrieval quality
- • Build automated test suites that run evaluation on every deploy
- • LangSmith provides production tracing, monitoring, and debugging