Document Loaders
Document Loaders are LangChain components that load data from various sources and convert
them into Document objects. Each Document has pageContent (the text) and
metadata (source info, page numbers, etc.).
π Supported Sources
π Files
PDF, TXT, CSV, JSON, Markdown, DOCX
π Web
Web pages, sitemaps, GitHub repos
ποΈ Databases
Notion, Confluence, Google Docs
Loading Text Files
import { TextLoader } from "langchain/document_loaders/fs/text";
const loader = new TextLoader("./docs/readme.txt");
const documents = await loader.load();
console.log(documents[0].pageContent); // File contents
console.log(documents[0].metadata); // { source: "./docs/readme.txt" }
Loading PDFs
PDF loading is one of the most common use cases for RAG applications.
npm install pdf-parse
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
// Load entire PDF as one document
const loader = new PDFLoader("./docs/report.pdf");
const docs = await loader.load();
console.log(docs.length); // Number of pages
// Load with one document per page
const loaderByPage = new PDFLoader("./docs/report.pdf", {
splitPages: true,
});
const pages = await loaderByPage.load();
console.log(pages[0].metadata);
// { source: "./docs/report.pdf", pdf: { ... }, loc: { pageNumber: 1 } }
Loading Web Pages
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
// Load a web page
const loader = new CheerioWebBaseLoader(
"https://www.frontendtechlead.com/learn-langchain"
);
const docs = await loader.load();
console.log(docs[0].pageContent.slice(0, 200));
// Load multiple pages
const urls = [
"https://example.com/page1",
"https://example.com/page2",
];
const allDocs = [];
for (const url of urls) {
const loader = new CheerioWebBaseLoader(url);
const docs = await loader.load();
allDocs.push(...docs);
}
Loading CSV Files
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
// Load CSV β each row becomes a document
const loader = new CSVLoader("./data/products.csv");
const docs = await loader.load();
console.log(docs[0].pageContent);
// "name: Widget A\nprice: 29.99\ncategory: Tools"
console.log(docs[0].metadata);
// { source: "./data/products.csv", line: 1 }
// Load specific columns only
const loader2 = new CSVLoader("./data/products.csv", {
column: "description", // Only load this column
});
Loading JSON
import { JSONLoader } from "langchain/document_loaders/fs/json";
// Load all text values from JSON
const loader = new JSONLoader("./data/faq.json");
const docs = await loader.load();
// Load specific JSON paths
const loader2 = new JSONLoader(
"./data/faq.json",
["/questions/*/answer"] // JSONPointer paths
);
Directory Loader
Load all files from a directory, automatically selecting the right loader based on file extension.
import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { JSONLoader } from "langchain/document_loaders/fs/json";
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";
const loader = new DirectoryLoader("./docs", {
".txt": (path) => new TextLoader(path),
".json": (path) => new JSONLoader(path),
".csv": (path) => new CSVLoader(path),
});
const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);
Text Splitters
After loading documents, you need to split them into smaller chunks for two reasons:
- LLMs have context window limits β you can't pass entire books
- Smaller chunks improve retrieval accuracy β finding the exact relevant paragraph
RecursiveCharacterTextSplitter (Recommended)
Splits by paragraphs first, then sentences, then words β preserving semantic coherence.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Max characters per chunk
chunkOverlap: 200, // Overlap between chunks for context
separators: ["\n\n", "\n", ". ", " ", ""], // Split hierarchy
});
const text = "Your very long document content here...";
const chunks = await splitter.createDocuments([text]);
console.log(`Split into ${chunks.length} chunks`);
console.log(chunks[0].pageContent.length); // ~1000 chars
Splitting by Code
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
// Language-aware splitting for code
const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
chunkSize: 500,
chunkOverlap: 50,
});
const code = `
function hello() {
console.log("Hello!");
}
class UserService {
async getUser(id) {
return await db.users.find(id);
}
}
`;
const chunks = await splitter.createDocuments([code]);
// Splits at function/class boundaries
Complete Pipeline: Load β Split β Store
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
// 1. Load
const loader = new PDFLoader("./docs/manual.pdf", { splitPages: true });
const rawDocs = await loader.load();
console.log(`Loaded ${rawDocs.length} pages`);
// 2. Split
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const splitDocs = await splitter.splitDocuments(rawDocs);
console.log(`Split into ${splitDocs.length} chunks`);
// 3. Store in vector database
const vectorStore = await MemoryVectorStore.fromDocuments(
splitDocs,
new OpenAIEmbeddings()
);
// 4. Search!
const results = await vectorStore.similaritySearch(
"How do I configure the settings?", 3
);
results.forEach((doc) => {
console.log(doc.pageContent.slice(0, 100));
console.log(doc.metadata);
});
π Chunk Size Guidelines
- Small chunks (200-500): Better precision, more results needed
- Medium chunks (500-1000): Best balance for most RAG apps
- Large chunks (1000-2000): More context per result, fewer results needed
- Overlap (10-20%): Prevents losing context at chunk boundaries
π‘ Key Takeaways
- β’ Document loaders convert files, web pages, and APIs into LangChain Documents
- β’ Always split large documents before storing in a vector database
- β’ RecursiveCharacterTextSplitter is the best default choice
- β’ Use chunk overlap to prevent losing context at boundaries
- β’ The pipeline is always: Load β Split β Embed β Store β Retrieve