intermediateRAG Engineering Practical Guide

Chunking in RAG — Breaking Text the Right Way

Guide to text chunking strategies, why they matter, and how to pick the right one for your RAG project.

RAG EngineeringApril 12, 2026

What's inside

• What is Chunking? • Why Do We Need It? • The Lost-in-the-Middle Problem • Choosing a Chunking Strategy • Chunking Methods

So, What is Chunking?

Let's keep it simple. You've got a big document — maybe a 200-page PDF, a long article, or an entire codebase. You can't just shove the whole thing into an LLM and hope for the best. That's where chunking comes in.

Chunking is the process of breaking down large text into smaller, manageable pieces (called "chunks") that can be embedded, stored in a vector database, and retrieved when needed. Think of it like slicing a pizza — you need the right size slices so people can actually eat them.

In short

Chunking = taking big text → making small, meaningful pieces → so your RAG system can actually find and use the right information.

Why Do We Even Need to Chunk?

Two big reasons:

Embedding models have token limits. Most embedding models work best with a certain input size. Feed them too much text and the quality of the embeddings drops. Feed them too little and you lose context. You need that sweet spot.

LLMs have context windows. Even though context windows are getting bigger (we're talking 100K+ tokens now), that doesn't mean you should dump everything in there. More context doesn't mean better answers — it often means worse ones. Which brings us to...

The Lost-in-the-Middle Problem

This one's a big deal and a lot of people overlook it.

Research has shown that when you give an LLM a long context, it pays the most attention to the beginning and the end. The stuff in the middle? It kinda gets ignored. The model literally "loses" information that sits in the middle of a long context window.

Why this matters for RAG

If you retrieve 20 chunks and stuff them all into the prompt, the model will mostly use the first few and last few. Chunks 8 through 15? Might as well not be there. This is why good chunking (and good retrieval ranking) is critical — you want fewer, better chunks rather than a massive wall of text.

So chunking isn't just about breaking text apart. It's about making sure each piece is meaningful enough to stand on its own when retrieved, so you don't need to dump 30 chunks into context and hope for the best.

How to Pick a Chunking Strategy

Before you start splitting text like a madman, ask yourself these four questions:

What kind of data am I working with? A 500-page legal document is very different from a bunch of short FAQ answers. Long, structured docs need smart splitting. Short docs might not need chunking at all — if they fit comfortably in context, just use them as-is.
Which embedding model am I using? Different models are optimized for different input lengths. Some work great with 256 tokens, others prefer 512 or more. Check your model's docs and match your chunk size accordingly.
What do user queries look like? Short keyword searches? Full-sentence questions? Multi-paragraph prompts? Your chunk size should roughly mirror the granularity of the queries. Short queries → shorter chunks tend to match better. Detailed questions → slightly larger chunks with more context.
How are retrieved chunks being used? Are they being fed directly into an LLM prompt? Displayed to a user? Used for citation? Each use case has different requirements for chunk size, overlap, and structure.

Chunking Methods

Alright, let's get into the actual techniques. We'll go from simple to sophisticated.

Fixed-Size Chunking

The most basic approach. You pick a number — say 500 characters — and just split the text every 500 characters. Maybe you add some overlap (like 50 characters) so chunks share a little context at the edges.

fixed_size_chunking.py

python

# Dead simple approach
chunk_size = 500
overlap = 50
chunks = []
for i in range(0, len(text), chunk_size - overlap):
    chunks.append(text[i:i + chunk_size])

It's fast, it's simple, it works in a pinch. But here's the problem — it's dumb. It doesn't care about sentence boundaries, paragraphs, or meaning. You'll end up with chunks that start mid-sentence and end mid-thought. The embeddings for those chunks will be noisy and retrieval quality suffers.

This is exactly why we need "Content-aware" chunking. Instead of blindly chopping text at arbitrary character counts, content-aware methods understand the structure of what they're splitting — sentences, paragraphs, headings, code blocks. The result? Chunks that actually make semantic sense, which means better embeddings and better retrieval.

Pros	Cons
✅ Easy to implement, fast, predictable chunk sizes	❌ Breaks mid-sentence, loses context, poor embedding quality

Now let's look at the content-aware methods that actually respect your text's structure.

Sentence & Paragraph Splitting

The idea is straightforward — split text along natural language boundaries like sentences and paragraphs. Each chunk is a complete thought, not a fragment.

There are a few ways to do this:

Naive Splitting (Full Stops)

Just split on periods. It works... until it doesn't. Think about "Dr. Smith went to Washington D.C. on Jan. 5th." — that's one sentence, but naive splitting sees four. Not great.

naive_split.py

python

# Don't do this in production
chunks = text.split(".")

NLTK — Natural Language Toolkit

NLTK's sentence tokenizer is way smarter. It uses a pre-trained model (Punkt) that understands abbreviations, decimals, and other tricky edge cases.

nltk_chunking.py

python

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(text)
# Group sentences into chunks of desired size
chunk = []
for sent in sentences:
    chunk.append(sent)
    if len(" ".join(chunk)) > 500:
        chunks.append(" ".join(chunk))
        chunk = []

spaCy

spaCy takes it up another notch. It doesn't just find sentence boundaries — it builds a full linguistic model of your text. Slightly heavier, but the sentence detection is rock solid.

spacy_chunking.py

python

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

Pros	Cons
✅ Preserves complete thoughts, better embeddings, natural boundaries	❌ Uneven chunk sizes, needs NLP libraries, slower than fixed-size

2. Recursive Character Chunking

This is probably the most popular method in the RAG world right now, and for good reason. LangChain's RecursiveCharacterTextSplitter is the go-to implementation.

The core idea is clever: try to split on the most meaningful boundary first, then fall back to less meaningful ones. The default separator hierarchy is:

separators.py

python

separators = ["\n\n", "\n", " ", ""]

# Translation:
# 1. First, try splitting on double newlines (paragraphs)
# 2. If chunks are still too big, split on single newlines
# 3. Still too big? Split on spaces (words)
# 4. Last resort: split on individual characters

So you set a target chunk size (say 1000 characters), and the splitter works its way down the separator list until each chunk fits. This means paragraphs stay intact when possible, sentences stay together when paragraphs are too long, and you only break words as an absolute last resort.

recursive_splitter.py

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(long_document)

The beauty of this approach is balance — you get roughly consistent chunk sizes (which embedding models love) while still respecting text structure (which retrieval quality loves).

Pros	Cons
✅ Great balance of size consistency and semantic meaning, easy to use, highly configurable	❌ Still character-based at its core, doesn't understand document structure (headers, tables)

Document Structure-Based Chunking

Now we're getting smart. Instead of just looking at characters and sentences, this approach actually understands the structure of your document.

Think about it — real documents aren't just walls of text. They have:

PDFs — headers, sub-headers, tables, figures, footers, page numbers
HTML pages — <h1> through <h6> tags, <p> paragraphs, <table> elements, <ul> lists
Markdown — # headings, code blocks, bullet lists
Code files — functions, classes, imports, comments

Structure-based chunking uses these natural boundaries to create chunks. A section under an <h2> header becomes one chunk. A table stays together. A code function isn't split in half.

html_splitter.py

python

# For HTML documents
from langchain.text_splitter import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(html_content)

The real power here is that each chunk comes with metadata. You don't just get the text — you know which section it came from, what heading it was under, what page it was on. This makes retrieval way more precise.

Pros	Cons
✅ Preserves document logic, rich metadata, great for structured docs	❌ Format-specific (need different parsers for PDF vs HTML vs Markdown), more complex setup

Quick Recap

Here's the deal — there's no single "best" chunking strategy. It depends on your data, your embedding model, your queries, and your use case. But here's a rough mental model:

Just prototyping? → Fixed-size or recursive character splitting. Get something working fast.
Building for production with unstructured text? → Recursive character splitting with tuned parameters. It's the sweet spot for most use cases.
Working with structured documents (PDFs, HTML, docs)? → Document structure-based chunking. Preserve that structure — it's free metadata.
Need maximum retrieval quality? → Combine structure-based chunking with sentence-level splitting inside each section. Best of both worlds.

Final Advice

Start simple, measure your retrieval quality, and iterate. Chunking is one of those things where a small improvement in strategy can lead to a huge improvement in your RAG system's output quality. Don't sleep on it. Once you have your chunking strategy dialed in, learn how to make your RAG pipeline faster and measurable with Semantic Caching & RAGAS Evaluation.

Happy chunking. Go build something cool. 🚀

#rag #nlp #embeddings

intermediate

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG

A comprehensive 8-week roadmap covering LLM APIs, prompt engineering, function calling, tool use, and retrieval-augmented generation — everything you need to build production AI applications.

intermediate

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

The practical layer between sequence-model theory and a working PyTorch classifier — building a vocabulary, batching variable-length sentences, packing sequences for the LSTM, and pooling the output back into a single vector.

intermediate

BiLSTM for Text Classification: Understanding Sequential Deep Learning

Learn how Bidirectional LSTM networks process text sequentially to capture context, word order, and meaning. A complete guide to building your first sequence model for NLP.

What's inside

In short

Why this matters for RAG

Final Advice

Related Articles

Phase 1: Core Foundations of LLM Engineering — APIs, Prompts, Tools & RAG

Building a BiLSTM Intent Classifier in PyTorch: Vocab, Packing, and Pooling

BiLSTM for Text Classification: Understanding Sequential Deep Learning