What is a Chunking Strategy for RAG?
A chunking strategy is the approach used to split source documents into smaller segments before encoding them into vectors for a RAG (Retrieval-Augmented Generation) pipeline. The size, overlap, and boundary logic of chunks directly affects retrieval quality — chunks that are too large compress too much information into one vector; chunks that are too small lose context. The right strategy depends on your document type and retrieval requirements.
Why does chunking matter so much for RAG?
Embedding models encode a fixed input into a single vector. If a chunk contains five unrelated paragraphs, the vector averages over all of them — diluting the signal for any individual topic. If a chunk is a single sentence, it may lack the context needed to correctly represent its meaning.
Chunking is also the most impactful thing you can change after deployment — the embedding model and vector DB are fixed infrastructure, but chunking can be updated and re-indexed relatively quickly. Getting it right before launch saves significant re-indexing cost.
Main chunking strategies
Fixed-size chunking
Split documents every N tokens (or characters), with optional overlap:
def fixed_size_chunks(text, chunk_size=512, overlap=64): tokens = tokenizer.encode(text) chunks = [] for i in range(0, len(tokens), chunk_size - overlap): chunk = tokens[i:i + chunk_size] chunks.append(tokenizer.decode(chunk)) return chunksPros: Simple, predictable, easy to implement. Cons: Splits sentences and paragraphs mid-thought, losing semantic coherence. Best for: Homogeneous documents (e.g. database records, product descriptions).
Sentence-based chunking
Split on sentence boundaries, then group sentences until reaching a token limit:
from nltk.tokenize import sent_tokenize
def sentence_chunks(text, max_tokens=256): sentences = sent_tokenize(text) chunks, current, count = [], [], 0 for sent in sentences: n = len(tokenizer.encode(sent)) if count + n > max_tokens and current: chunks.append(" ".join(current)) current, count = [], 0 current.append(sent) count += n if current: chunks.append(" ".join(current)) return chunksPros: Preserves sentence-level coherence. Cons: Chunk sizes vary; may miss multi-sentence context. Best for: General prose documents, articles, reports.
Recursive / semantic chunking
Respect the document’s natural hierarchy — split on headings first, then paragraphs, then sentences:
# LangChain's RecursiveCharacterTextSplitter approachseparators = ["\n\n", "\n", ". ", " ", ""]# Split on paragraph breaks first; fall back to finer splits only when neededPros: Preserves document structure and meaning. Cons: More complex to implement; structure varies across documents. Best for: Structured documents with clear headings (wikis, documentation, legal contracts).
Semantic chunking
Use an embedding model to find natural breakpoints — split where the semantic similarity between adjacent sentences drops significantly:
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
def semantic_chunks(sentences, threshold=0.7): emb_results = client.encode("BAAI/bge-m3", [Item(text=s) for s in sentences]) embeddings = [r["dense"] for r in emb_results] chunks, current = [], [sentences[0]] for i in range(1, len(sentences)): similarity = cosine_similarity(embeddings[i-1], embeddings[i]) if similarity < threshold: chunks.append(" ".join(current)) current = [] current.append(sentences[i]) chunks.append(" ".join(current)) return chunksPros: Most semantically coherent chunks. Cons: Requires encoding at index time (extra compute), threshold tuning needed. Best for: Long heterogeneous documents where topics shift unpredictably.
Chunk size recommendations by document type
| Document type | Recommended chunk size | Overlap |
|---|---|---|
| Short product descriptions | 128–256 tokens | 0–32 |
| News articles / blog posts | 256–512 tokens | 32–64 |
| Technical documentation | 512 tokens | 64–128 |
| Legal / financial documents | 512–1024 tokens | 128–256 |
| Research papers | 256–512 per section | 32–64 |
| Chat transcripts | Per turn or 256 tokens | 0 |
When in doubt, start with 512 tokens and 64 token overlap — this works well for most document types.
Parent-child chunking
A powerful pattern for long documents: index small chunks for retrieval precision, but return larger parent chunks as context to the LLM:
- Split document into large parent chunks (e.g. 2048 tokens)
- Split each parent into small child chunks (e.g. 256 tokens)
- Index only child chunks in the vector DB
- At retrieval time: find relevant child chunks, then return the full parent chunk as LLM context
This gives the precision of small-chunk retrieval with the context richness of large chunks.
How chunking interacts with SIE
SIE encodes whatever text you pass. Better chunking = better input to the embedding model = better vectors = better retrieval:
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode chunks — SIE handles batching efficientlychunks = recursive_chunk(document, max_tokens=512)encode_results = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])vectors = [r["dense"] for r in encode_results]
# Store chunks + vectors in your vector DBfor chunk, vector in zip(chunks, vectors): vector_db.upsert(text=chunk, vector=vector, metadata={"doc_id": doc_id})Frequently asked questions
What chunk size should I start with? 512 tokens with 64 token overlap is a safe default for most document types. Evaluate with recall@k metrics on a sample of real queries before committing to a chunking strategy.
Does chunk size affect LLM context window usage? Yes. Larger chunks consume more of the LLM’s context window. For GPT-4 or similar, you can fit ~5–10 chunks of 512 tokens. Smaller chunks let you include more retrieved results but with less context per chunk.
What is the best way to evaluate my chunking strategy? Build a small evaluation set of (query, expected document) pairs from your domain. Measure recall@10 and recall@100 with different chunking strategies. The strategy with the highest recall is usually the best starting point.