TechLead
Intermediate
20 min
Full Guide

RAG (Retrieval-Augmented Generation)

Give LLMs access to your data — embeddings, vector databases, chunking strategies, and building RAG pipelines

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that makes LLMs smarter by connecting them to external knowledge. Instead of relying solely on training data, a RAG system retrieves relevant documents and includes them in the LLM's prompt, enabling accurate answers about your specific data.

💡 Why RAG?

LLMs have a knowledge cutoff and can't know about your private documents, codebase, or internal docs. RAG bridges this gap without expensive fine-tuning.

How RAG Works

Step 1: Indexing (Offline)

Load documents → split into chunks → generate embeddings → store in vector database

Step 2: Retrieval (At Query Time)

Embed the user's query → search vector DB for similar chunks → return top-K most relevant chunks

Step 3: Generation (At Query Time)

Inject retrieved chunks into the LLM prompt as context → LLM generates an answer grounded in the retrieved data

Embeddings: The Foundation

Embeddings convert text into numerical vectors that capture semantic meaning. Similar text produces similar vectors.

// Generate embeddings using OpenAI
async function getEmbedding(text) {
  const response = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
    },
    body: JSON.stringify({
      model: 'text-embedding-3-small',
      input: text,
    }),
  });
  const data = await response.json();
  return data.data[0].embedding; // Returns a 1536-dimension vector
}

// Cosine similarity: how "similar" are two vectors?
function cosineSimilarity(a, b) {
  let dotProduct = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

// Find most relevant documents
async function findRelevant(query, documents, topK = 3) {
  const queryEmbedding = await getEmbedding(query);

  const scored = await Promise.all(
    documents.map(async (doc) => ({
      doc,
      score: cosineSimilarity(queryEmbedding, doc.embedding),
    }))
  );

  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map(s => s.doc);
}

Chunking Strategies

How you split documents dramatically affects retrieval quality:

📄 Fixed-size Chunks

Split every N characters/tokens. Simple but may break mid-sentence. Use overlap (e.g., 200 chars) to preserve context.

📝 Semantic Chunking

Split at natural boundaries: paragraphs, sections, headings. Preserves meaning but varies in size.

🔀 Recursive Splitting

Try splitting by \n\n, then \n, then sentence, then word. Falls back through progressively finer boundaries.

🎯 Best Practice

Chunk size of 500-1000 tokens with 100-200 token overlap works well for most use cases. Test and iterate.

Vector Databases

Specialized databases optimized for storing and searching embeddings:

Pinecone

Fully managed, serverless. Best for production with no infrastructure management. Free tier available.

Supabase pgvector

PostgreSQL extension for vector similarity search. Great if you already use Supabase for your database.

Chroma / Weaviate

Open-source vector databases. Chroma is great for prototyping; Weaviate for hybrid search (vector + keyword).

Complete RAG Pipeline

// Complete RAG pipeline with Next.js API route
import { OpenAI } from 'openai';

const openai = new OpenAI();

export async function POST(req) {
  const { question } = await req.json();

  // 1. Embed the question
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  });

  // 2. Search vector database for relevant chunks
  const relevantChunks = await vectorDB.query({
    vector: queryEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true,
  });

  // 3. Build context from retrieved chunks
  const context = relevantChunks.matches
    .map(m => m.metadata.text)
    .join('\n\n---\n\n');

  // 4. Generate answer using retrieved context
  const completion = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      {
        role: 'system',
        content: `Answer questions based on the following context. If the answer isn't in the context, say "I don't have enough information."

Context:
${context}`
      },
      { role: 'user', content: question }
    ],
    temperature: 0.3, // Lower temperature for factual answers
  });

  return Response.json({
    answer: completion.choices[0].message.content,
    sources: relevantChunks.matches.map(m => m.metadata.source),
  });
}

Advanced RAG Techniques

Hybrid Search

Combine vector similarity with keyword search (BM25). Catches exact matches that embedding search might miss.

Query Rewriting

Use an LLM to rephrase the user's query into a better search query before retrieval.

Re-ranking

Retrieve more chunks than needed (top-20), then use a cross-encoder to re-rank and select the best (top-5).

Contextual Compression

After retrieval, use an LLM to extract only the relevant sentences from each chunk, reducing noise.

🔑 Key Takeaways

  • • RAG = Retrieve relevant docs + inject into LLM prompt. No fine-tuning needed
  • • Embeddings capture semantic meaning; similar texts have similar vectors
  • • Chunk size (500-1000 tokens) and overlap (100-200 tokens) matter significantly
  • • Always include source attribution in RAG responses
  • • Hybrid search (vector + keyword) outperforms vector-only search