TechLead
Lesson 16 of 18
5 min read
LangChain

Evaluating & Testing LLM Apps

Test and evaluate your LLM applications for quality, accuracy, and reliability using LangSmith and custom evaluators

Why Evaluate LLM Applications?

Unlike traditional software where tests check exact outputs, LLM applications produce non-deterministic responses. You need specialized evaluation strategies to ensure your AI app is accurate, helpful, and safe before deploying to production.

⚠️ What Can Go Wrong

  • Hallucination: The model makes up facts that sound convincing
  • Relevance Drift: Answers don't match the question asked
  • Context Ignored: RAG retrieves documents but the model ignores them
  • Inconsistency: Same question gets wildly different answers
  • Safety: Model produces harmful or biased content

Manual Evaluation with Test Cases

interface TestCase {
  input: string;
  expectedTopics: string[];   // Topics the answer should cover
  shouldNotContain: string[]; // Things to avoid
}

const testCases: TestCase[] = [
  {
    input: "What is React?",
    expectedTopics: ["UI library", "components", "JavaScript"],
    shouldNotContain: ["Angular", "Vue"],
  },
  {
    input: "Explain Docker volumes",
    expectedTopics: ["persistent storage", "containers", "mount"],
    shouldNotContain: ["virtual machines"],
  },
];

async function runEvaluation(chain: any) {
  const results = [];

  for (const testCase of testCases) {
    const response = await chain.invoke({ input: testCase.input });
    const answer = response.answer || response.content;
    const lowerAnswer = answer.toLowerCase();

    const topicsCovered = testCase.expectedTopics.filter(
      (topic) => lowerAnswer.includes(topic.toLowerCase())
    );

    const unwantedFound = testCase.shouldNotContain.filter(
      (term) => lowerAnswer.includes(term.toLowerCase())
    );

    results.push({
      input: testCase.input,
      passed: topicsCovered.length === testCase.expectedTopics.length 
              && unwantedFound.length === 0,
      coverage: `${topicsCovered.length}/${testCase.expectedTopics.length}`,
      unwantedTerms: unwantedFound,
    });
  }

  return results;
}

const results = await runEvaluation(myChain);
console.table(results);

LLM-as-Judge Evaluation

Use a powerful LLM to evaluate the quality of another LLM's responses.

import { ChatOpenAI } from "@langchain/openai";
import { z } from "zod";

const EvalSchema = z.object({
  relevance: z.number().min(1).max(5).describe("How relevant is the answer to the question"),
  accuracy: z.number().min(1).max(5).describe("How factually accurate is the answer"),
  completeness: z.number().min(1).max(5).describe("How thorough is the answer"),
  reasoning: z.string().describe("Brief explanation of the scores"),
});

const judge = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 })
  .withStructuredOutput(EvalSchema);

async function evaluateResponse(question: string, answer: string, context?: string) {
  const prompt = `
You are an expert evaluator. Score this AI response on a 1-5 scale.

Question: ${question}
${context ? `Context provided: ${context}` : ""}
AI Response: ${answer}

Score each dimension 1-5:
- Relevance: Does the answer address the question?
- Accuracy: Are the facts correct?
- Completeness: Is the answer thorough?
`;

  return await judge.invoke(prompt);
}

// Evaluate your chain's responses
const question = "How do Docker volumes work?";
const answer = await myChain.invoke({ input: question });

const evaluation = await evaluateResponse(question, answer);
console.log(evaluation);
// { relevance: 5, accuracy: 4, completeness: 4, reasoning: "..." }

RAG-Specific Evaluation

const RAGEvalSchema = z.object({
  faithfulness: z.number().min(1).max(5)
    .describe("Does the answer only use information from the context?"),
  relevance: z.number().min(1).max(5)
    .describe("Did the retriever find relevant documents?"),
  answerRelevance: z.number().min(1).max(5)
    .describe("Does the answer address the original question?"),
  hallucination: z.boolean()
    .describe("Does the answer contain information NOT in the context?"),
});

async function evaluateRAG(
  question: string,
  retrievedDocs: string[],
  answer: string
) {
  const judge = new ChatOpenAI({ modelName: "gpt-4", temperature: 0 })
    .withStructuredOutput(RAGEvalSchema);

  return await judge.invoke(`
    Evaluate this RAG (Retrieval-Augmented Generation) response.
    
    Question: ${question}
    
    Retrieved Context:
    ${retrievedDocs.map((d, i) => `[${i + 1}] ${d}`).join("\n")}
    
    Generated Answer: ${answer}
    
    Evaluate faithfulness (uses only context), retrieval relevance,
    answer relevance, and whether hallucination occurred.
  `);
}

Automated Test Suite

// evaluation.test.ts
import { describe, it, expect } from "vitest";

describe("AI Chain Quality", () => {
  it("should answer React questions accurately", async () => {
    const result = await myChain.invoke({ input: "What are React hooks?" });
    const evaluation = await evaluateResponse("What are React hooks?", result);
    
    expect(evaluation.relevance).toBeGreaterThanOrEqual(4);
    expect(evaluation.accuracy).toBeGreaterThanOrEqual(4);
  });

  it("should not hallucinate in RAG responses", async () => {
    const result = await ragChain.invoke({ input: "Company refund policy?" });
    const ragEval = await evaluateRAG(
      "Company refund policy?",
      result.sourceDocuments.map((d: any) => d.pageContent),
      result.answer
    );
    
    expect(ragEval.hallucination).toBe(false);
    expect(ragEval.faithfulness).toBeGreaterThanOrEqual(4);
  });

  it("should handle edge cases gracefully", async () => {
    const result = await myChain.invoke({ input: "" });
    expect(result).toBeDefined();
    expect(result.length).toBeGreaterThan(0);
  });
});

LangSmith for Production Monitoring

# Set up LangSmith tracing
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=your-langsmith-api-key
export LANGCHAIN_PROJECT=my-project
// LangSmith automatically traces all LangChain calls
// No code changes needed — just set the env variables

// You can also add custom metadata
const result = await myChain.invoke(
  { input: "What is Docker?" },
  {
    metadata: {
      userId: "user-123",
      sessionId: "session-456",
      environment: "production",
    },
    tags: ["production", "docker-questions"],
  }
);

// View traces, latency, token usage, and errors at smith.langchain.com

💡 Key Takeaways

  • • LLM apps need specialized evaluation — traditional unit tests aren't enough
  • • LLM-as-Judge uses a powerful model to evaluate another model's output
  • • RAG evaluation checks faithfulness (no hallucination) and retrieval quality
  • • Build automated test suites that run evaluation on every deploy
  • • LangSmith provides production tracing, monitoring, and debugging

Continue Learning