TechLead
Lesson 10 of 18
5 min read
LangChain

Document Loaders & Text Splitters

Load data from PDFs, web pages, CSVs, and more, then split them into optimal chunks for RAG

Document Loaders

Document Loaders are LangChain components that load data from various sources and convert them into Document objects. Each Document has pageContent (the text) and metadata (source info, page numbers, etc.).

πŸ“ Supported Sources

πŸ“„ Files

PDF, TXT, CSV, JSON, Markdown, DOCX

🌐 Web

Web pages, sitemaps, GitHub repos

πŸ—„οΈ Databases

Notion, Confluence, Google Docs

Loading Text Files

import { TextLoader } from "langchain/document_loaders/fs/text";

const loader = new TextLoader("./docs/readme.txt");
const documents = await loader.load();

console.log(documents[0].pageContent);  // File contents
console.log(documents[0].metadata);     // { source: "./docs/readme.txt" }

Loading PDFs

PDF loading is one of the most common use cases for RAG applications.

npm install pdf-parse
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

// Load entire PDF as one document
const loader = new PDFLoader("./docs/report.pdf");
const docs = await loader.load();
console.log(docs.length); // Number of pages

// Load with one document per page
const loaderByPage = new PDFLoader("./docs/report.pdf", {
  splitPages: true,
});
const pages = await loaderByPage.load();
console.log(pages[0].metadata);
// { source: "./docs/report.pdf", pdf: { ... }, loc: { pageNumber: 1 } }

Loading Web Pages

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

// Load a web page
const loader = new CheerioWebBaseLoader(
  "https://www.frontendtechlead.com/learn-langchain"
);
const docs = await loader.load();
console.log(docs[0].pageContent.slice(0, 200));

// Load multiple pages
const urls = [
  "https://example.com/page1",
  "https://example.com/page2",
];

const allDocs = [];
for (const url of urls) {
  const loader = new CheerioWebBaseLoader(url);
  const docs = await loader.load();
  allDocs.push(...docs);
}

Loading CSV Files

import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";

// Load CSV β€” each row becomes a document
const loader = new CSVLoader("./data/products.csv");
const docs = await loader.load();

console.log(docs[0].pageContent);
// "name: Widget A\nprice: 29.99\ncategory: Tools"
console.log(docs[0].metadata);
// { source: "./data/products.csv", line: 1 }

// Load specific columns only
const loader2 = new CSVLoader("./data/products.csv", {
  column: "description", // Only load this column
});

Loading JSON

import { JSONLoader } from "langchain/document_loaders/fs/json";

// Load all text values from JSON
const loader = new JSONLoader("./data/faq.json");
const docs = await loader.load();

// Load specific JSON paths
const loader2 = new JSONLoader(
  "./data/faq.json",
  ["/questions/*/answer"] // JSONPointer paths
);

Directory Loader

Load all files from a directory, automatically selecting the right loader based on file extension.

import { DirectoryLoader } from "langchain/document_loaders/fs/directory";
import { TextLoader } from "langchain/document_loaders/fs/text";
import { JSONLoader } from "langchain/document_loaders/fs/json";
import { CSVLoader } from "@langchain/community/document_loaders/fs/csv";

const loader = new DirectoryLoader("./docs", {
  ".txt": (path) => new TextLoader(path),
  ".json": (path) => new JSONLoader(path),
  ".csv": (path) => new CSVLoader(path),
});

const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);

Text Splitters

After loading documents, you need to split them into smaller chunks for two reasons:

  • LLMs have context window limits β€” you can't pass entire books
  • Smaller chunks improve retrieval accuracy β€” finding the exact relevant paragraph

RecursiveCharacterTextSplitter (Recommended)

Splits by paragraphs first, then sentences, then words β€” preserving semantic coherence.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,     // Max characters per chunk
  chunkOverlap: 200,   // Overlap between chunks for context
  separators: ["\n\n", "\n", ". ", " ", ""], // Split hierarchy
});

const text = "Your very long document content here...";
const chunks = await splitter.createDocuments([text]);

console.log(`Split into ${chunks.length} chunks`);
console.log(chunks[0].pageContent.length); // ~1000 chars

Splitting by Code

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

// Language-aware splitting for code
const splitter = RecursiveCharacterTextSplitter.fromLanguage("js", {
  chunkSize: 500,
  chunkOverlap: 50,
});

const code = `
function hello() {
  console.log("Hello!");
}

class UserService {
  async getUser(id) {
    return await db.users.find(id);
  }
}
`;

const chunks = await splitter.createDocuments([code]);
// Splits at function/class boundaries

Complete Pipeline: Load β†’ Split β†’ Store

import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

// 1. Load
const loader = new PDFLoader("./docs/manual.pdf", { splitPages: true });
const rawDocs = await loader.load();
console.log(`Loaded ${rawDocs.length} pages`);

// 2. Split
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});
const splitDocs = await splitter.splitDocuments(rawDocs);
console.log(`Split into ${splitDocs.length} chunks`);

// 3. Store in vector database
const vectorStore = await MemoryVectorStore.fromDocuments(
  splitDocs,
  new OpenAIEmbeddings()
);

// 4. Search!
const results = await vectorStore.similaritySearch(
  "How do I configure the settings?", 3
);
results.forEach((doc) => {
  console.log(doc.pageContent.slice(0, 100));
  console.log(doc.metadata);
});

πŸ“ Chunk Size Guidelines

  • Small chunks (200-500): Better precision, more results needed
  • Medium chunks (500-1000): Best balance for most RAG apps
  • Large chunks (1000-2000): More context per result, fewer results needed
  • Overlap (10-20%): Prevents losing context at chunk boundaries

πŸ’‘ Key Takeaways

  • β€’ Document loaders convert files, web pages, and APIs into LangChain Documents
  • β€’ Always split large documents before storing in a vector database
  • β€’ RecursiveCharacterTextSplitter is the best default choice
  • β€’ Use chunk overlap to prevent losing context at boundaries
  • β€’ The pipeline is always: Load β†’ Split β†’ Embed β†’ Store β†’ Retrieve

Continue Learning