Evaluating & Testing LLM Apps - LangChain Tutorial | TechLead

Why Evaluate LLM Applications?

Unlike traditional software where tests check exact outputs, LLM applications produce non-deterministic responses. You need specialized evaluation strategies to ensure your AI app is accurate, helpful, and safe before deploying to production.

⚠️ What Can Go Wrong

Hallucination: The model makes up facts that sound convincing
Relevance Drift: Answers don't match the question asked
Context Ignored: RAG retrieves documents but the model ignores them
Inconsistency: Same question gets wildly different answers
Safety: Model produces harmful or biased content

Manual Evaluation with Test Cases

interface TestCase {
  input: string;
  expectedTopics: string[];   // Topics the answer should cover
  shouldNotContain: string[]; // Things to avoid
}

const testCases: TestCase[] = [
  {
    input: "What is React?",
    expectedTopics: ["UI library", "components", "JavaScript"],
    shouldNotContain: ["Angular", "Vue"],
  },
  {
    input: "Explain Docker volumes",
    expectedTopics: ["persistent storage", "containers", "mount"],
    shouldNotContain: ["virtual machines"],
  },
];

async function runEvaluation(chain: any) {
  const results = [];

  for (const testCase of testCases) {
    const response = await chain.invoke({ input: testCase.input });
    const answer = response.answer || response.content;
    const lowerAnswer = answer.toLowerCase();

    const topicsCovered = testCase.expectedTopics.filter(
      (topic) => lowerAnswer.includes(topic.toLowerCase())
    );

    const unwantedFound = testCase.shouldNotContain.filter(
      (term) => lowerAnswer.includes(term.toLowerCase())
    );

    results.push({
      input: testCase.input,
      passed: topicsCovered.length === testCase.expectedTopics.length 
              && unwantedFound.length === 0,
      coverage: `${topicsCovered.length}/${testCase.expectedTopics.length}`,
      unwantedTerms: unwantedFound,
    });
  }

  return results;
}

const results = await runEvaluation(myChain);
console.table(results);

LLM-as-Judge Evaluation

Use a powerful LLM to evaluate the quality of another LLM's responses.

import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";

const EvalSchema = z.object({
  relevance: z.number().min(1).max(5).describe("How relevant is the answer to the question"),
  accuracy: z.number().min(1).max(5).describe("How factually accurate is the answer"),
  completeness: z.number().min(1).max(5).describe("How thorough is the answer"),
  reasoning: z.string().describe("Brief explanation of the scores"),
});

const judge = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 })
  .withStructuredOutput(EvalSchema);

async function evaluateResponse(question: string, answer: string, context?: string) {
  const prompt = `
You are an expert evaluator. Score this AI response on a 1-5 scale.

Question: ${question}
${context ? `Context provided: ${context}` : ""}
AI Response: ${answer}

Score each dimension 1-5:
- Relevance: Does the answer address the question?
- Accuracy: Are the facts correct?
- Completeness: Is the answer thorough?
`;

  return await judge.invoke(prompt);
}

// Evaluate your chain's responses
const question = "How do Docker volumes work?";
const answer = await myChain.invoke({ input: question });

const evaluation = await evaluateResponse(question, answer);
console.log(evaluation);
// { relevance: 5, accuracy: 4, completeness: 4, reasoning: "..." }

RAG-Specific Evaluation

const RAGEvalSchema = z.object({
  faithfulness: z.number().min(1).max(5)
    .describe("Does the answer only use information from the context?"),
  relevance: z.number().min(1).max(5)
    .describe("Did the retriever find relevant documents?"),
  answerRelevance: z.number().min(1).max(5)
    .describe("Does the answer address the original question?"),
  hallucination: z.boolean()
    .describe("Does the answer contain information NOT in the context?"),
});

async function evaluateRAG(
  question: string,
  retrievedDocs: string[],
  answer: string
) {
  const judge = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 })
    .withStructuredOutput(RAGEvalSchema);

  return await judge.invoke(`
    Evaluate this RAG (Retrieval-Augmented Generation) response.
    
    Question: ${question}
    
    Retrieved Context:
    ${retrievedDocs.map((d, i) => `[${i + 1}] ${d}`).join("\n")}
    
    Generated Answer: ${answer}
    
    Evaluate faithfulness (uses only context), retrieval relevance,
    answer relevance, and whether hallucination occurred.
  `);
}

Automated Test Suite

// evaluation.test.ts
import { describe, it, expect } from "vitest";

describe("AI Chain Quality", () => {
  it("should answer React questions accurately", async () => {
    const result = await myChain.invoke({ input: "What are React hooks?" });
    const evaluation = await evaluateResponse("What are React hooks?", result);
    
    expect(evaluation.relevance).toBeGreaterThanOrEqual(4);
    expect(evaluation.accuracy).toBeGreaterThanOrEqual(4);
  });

  it("should not hallucinate in RAG responses", async () => {
    const result = await ragChain.invoke({ input: "Company refund policy?" });
    const ragEval = await evaluateRAG(
      "Company refund policy?",
      result.sourceDocuments.map((d: any) => d.pageContent),
      result.answer
    );
    
    expect(ragEval.hallucination).toBe(false);
    expect(ragEval.faithfulness).toBeGreaterThanOrEqual(4);
  });

  it("should handle edge cases gracefully", async () => {
    const result = await myChain.invoke({ input: "" });
    expect(result).toBeDefined();
    expect(result.length).toBeGreaterThan(0);
  });
});

LangSmith for Production Monitoring

# Set up LangSmith tracing
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-api-key
export LANGCHAIN_PROJECT=my-project

// LangSmith automatically traces all LangChain calls
// No code changes needed — just set the env variables

// You can also add custom metadata
const result = await myChain.invoke(
  { input: "What is Docker?" },
  {
    metadata: {
      userId: "user-123",
      sessionId: "session-456",
      environment: "production",
    },
    tags: ["production", "docker-questions"],
  }
);

// View traces, latency, token usage, and errors at smith.langchain.com

💡 Key Takeaways

• LLM apps need specialized evaluation — traditional unit tests aren't enough
• LLM-as-Judge uses a powerful model to evaluate another model's output
• RAG evaluation checks faithfulness (no hallucination) and retrieval quality
• Build automated test suites that run evaluation on every deploy
• LangSmith provides production tracing, monitoring, and debugging