RAG (Retrieval-Augmented Generation)
Give LLMs access to your data — embeddings, vector databases, chunking strategies, and building RAG pipelines
What is RAG?
Retrieval-Augmented Generation (RAG) is a technique that makes LLMs smarter by connecting them to external knowledge. Instead of relying solely on training data, a RAG system retrieves relevant documents and includes them in the LLM's prompt, enabling accurate answers about your specific data.
💡 Why RAG?
LLMs have a knowledge cutoff and can't know about your private documents, codebase, or internal docs. RAG bridges this gap without expensive fine-tuning.
How RAG Works
Step 1: Indexing (Offline)
Load documents → split into chunks → generate embeddings → store in vector database
Step 2: Retrieval (At Query Time)
Embed the user's query → search vector DB for similar chunks → return top-K most relevant chunks
Step 3: Generation (At Query Time)
Inject retrieved chunks into the LLM prompt as context → LLM generates an answer grounded in the retrieved data
Embeddings: The Foundation
Embeddings convert text into numerical vectors that capture semantic meaning. Similar text produces similar vectors.
// Generate embeddings using OpenAI
async function getEmbedding(text) {
const response = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: 'text-embedding-3-small',
input: text,
}),
});
const data = await response.json();
return data.data[0].embedding; // Returns a 1536-dimension vector
}
// Cosine similarity: how "similar" are two vectors?
function cosineSimilarity(a, b) {
let dotProduct = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
// Find most relevant documents
async function findRelevant(query, documents, topK = 3) {
const queryEmbedding = await getEmbedding(query);
const scored = await Promise.all(
documents.map(async (doc) => ({
doc,
score: cosineSimilarity(queryEmbedding, doc.embedding),
}))
);
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map(s => s.doc);
}
Chunking Strategies
How you split documents dramatically affects retrieval quality:
📄 Fixed-size Chunks
Split every N characters/tokens. Simple but may break mid-sentence. Use overlap (e.g., 200 chars) to preserve context.
📝 Semantic Chunking
Split at natural boundaries: paragraphs, sections, headings. Preserves meaning but varies in size.
🔀 Recursive Splitting
Try splitting by \n\n, then \n, then sentence, then word. Falls back through progressively finer boundaries.
🎯 Best Practice
Chunk size of 500-1000 tokens with 100-200 token overlap works well for most use cases. Test and iterate.
Vector Databases
Specialized databases optimized for storing and searching embeddings:
Pinecone
Fully managed, serverless. Best for production with no infrastructure management. Free tier available.
Supabase pgvector
PostgreSQL extension for vector similarity search. Great if you already use Supabase for your database.
Chroma / Weaviate
Open-source vector databases. Chroma is great for prototyping; Weaviate for hybrid search (vector + keyword).
Complete RAG Pipeline
// Complete RAG pipeline with Next.js API route
import { OpenAI } from 'openai';
const openai = new OpenAI();
export async function POST(req) {
const { question } = await req.json();
// 1. Embed the question
const queryEmbedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question,
});
// 2. Search vector database for relevant chunks
const relevantChunks = await vectorDB.query({
vector: queryEmbedding.data[0].embedding,
topK: 5,
includeMetadata: true,
});
// 3. Build context from retrieved chunks
const context = relevantChunks.matches
.map(m => m.metadata.text)
.join('\n\n---\n\n');
// 4. Generate answer using retrieved context
const completion = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{
role: 'system',
content: `Answer questions based on the following context. If the answer isn't in the context, say "I don't have enough information."
Context:
${context}`
},
{ role: 'user', content: question }
],
temperature: 0.3, // Lower temperature for factual answers
});
return Response.json({
answer: completion.choices[0].message.content,
sources: relevantChunks.matches.map(m => m.metadata.source),
});
}
Advanced RAG Techniques
Hybrid Search
Combine vector similarity with keyword search (BM25). Catches exact matches that embedding search might miss.
Query Rewriting
Use an LLM to rephrase the user's query into a better search query before retrieval.
Re-ranking
Retrieve more chunks than needed (top-20), then use a cross-encoder to re-rank and select the best (top-5).
Contextual Compression
After retrieval, use an LLM to extract only the relevant sentences from each chunk, reducing noise.
🔑 Key Takeaways
- • RAG = Retrieve relevant docs + inject into LLM prompt. No fine-tuning needed
- • Embeddings capture semantic meaning; similar texts have similar vectors
- • Chunk size (500-1000 tokens) and overlap (100-200 tokens) matter significantly
- • Always include source attribution in RAG responses
- • Hybrid search (vector + keyword) outperforms vector-only search