Why did we open-source our inference engine? Read the post
← All Glossary Articles

What is a Chunking Strategy for RAG?

A chunking strategy is the approach used to split source documents into smaller segments before encoding them into vectors for a RAG (Retrieval-Augmented Generation) pipeline. The size, overlap, and boundary logic of chunks directly affects retrieval quality — chunks that are too large compress too much information into one vector; chunks that are too small lose context. The right strategy depends on your document type and retrieval requirements.


Why does chunking matter so much for RAG?

Embedding models encode a fixed input into a single vector. If a chunk contains five unrelated paragraphs, the vector averages over all of them — diluting the signal for any individual topic. If a chunk is a single sentence, it may lack the context needed to correctly represent its meaning.

Chunking is also the most impactful thing you can change after deployment — the embedding model and vector DB are fixed infrastructure, but chunking can be updated and re-indexed relatively quickly. Getting it right before launch saves significant re-indexing cost.


Main chunking strategies

Fixed-size chunking

Split documents every N tokens (or characters), with optional overlap:

def fixed_size_chunks(text, chunk_size=512, overlap=64):
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = tokens[i:i + chunk_size]
chunks.append(tokenizer.decode(chunk))
return chunks

Pros: Simple, predictable, easy to implement. Cons: Splits sentences and paragraphs mid-thought, losing semantic coherence. Best for: Homogeneous documents (e.g. database records, product descriptions).


Sentence-based chunking

Split on sentence boundaries, then group sentences until reaching a token limit:

from nltk.tokenize import sent_tokenize
def sentence_chunks(text, max_tokens=256):
sentences = sent_tokenize(text)
chunks, current, count = [], [], 0
for sent in sentences:
n = len(tokenizer.encode(sent))
if count + n > max_tokens and current:
chunks.append(" ".join(current))
current, count = [], 0
current.append(sent)
count += n
if current:
chunks.append(" ".join(current))
return chunks

Pros: Preserves sentence-level coherence. Cons: Chunk sizes vary; may miss multi-sentence context. Best for: General prose documents, articles, reports.


Recursive / semantic chunking

Respect the document’s natural hierarchy — split on headings first, then paragraphs, then sentences:

# LangChain's RecursiveCharacterTextSplitter approach
separators = ["\n\n", "\n", ". ", " ", ""]
# Split on paragraph breaks first; fall back to finer splits only when needed

Pros: Preserves document structure and meaning. Cons: More complex to implement; structure varies across documents. Best for: Structured documents with clear headings (wikis, documentation, legal contracts).


Semantic chunking

Use an embedding model to find natural breakpoints — split where the semantic similarity between adjacent sentences drops significantly:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
def semantic_chunks(sentences, threshold=0.7):
emb_results = client.encode("BAAI/bge-m3", [Item(text=s) for s in sentences])
embeddings = [r["dense"] for r in emb_results]
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
similarity = cosine_similarity(embeddings[i-1], embeddings[i])
if similarity < threshold:
chunks.append(" ".join(current))
current = []
current.append(sentences[i])
chunks.append(" ".join(current))
return chunks

Pros: Most semantically coherent chunks. Cons: Requires encoding at index time (extra compute), threshold tuning needed. Best for: Long heterogeneous documents where topics shift unpredictably.


Chunk size recommendations by document type

Document typeRecommended chunk sizeOverlap
Short product descriptions128–256 tokens0–32
News articles / blog posts256–512 tokens32–64
Technical documentation512 tokens64–128
Legal / financial documents512–1024 tokens128–256
Research papers256–512 per section32–64
Chat transcriptsPer turn or 256 tokens0

When in doubt, start with 512 tokens and 64 token overlap — this works well for most document types.


Parent-child chunking

A powerful pattern for long documents: index small chunks for retrieval precision, but return larger parent chunks as context to the LLM:

  1. Split document into large parent chunks (e.g. 2048 tokens)
  2. Split each parent into small child chunks (e.g. 256 tokens)
  3. Index only child chunks in the vector DB
  4. At retrieval time: find relevant child chunks, then return the full parent chunk as LLM context

This gives the precision of small-chunk retrieval with the context richness of large chunks.


How chunking interacts with SIE

SIE encodes whatever text you pass. Better chunking = better input to the embedding model = better vectors = better retrieval:

from sie_sdk import SIEClient
from sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Encode chunks — SIE handles batching efficiently
chunks = recursive_chunk(document, max_tokens=512)
encode_results = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])
vectors = [r["dense"] for r in encode_results]
# Store chunks + vectors in your vector DB
for chunk, vector in zip(chunks, vectors):
vector_db.upsert(text=chunk, vector=vector, metadata={"doc_id": doc_id})

Frequently asked questions

What chunk size should I start with? 512 tokens with 64 token overlap is a safe default for most document types. Evaluate with recall@k metrics on a sample of real queries before committing to a chunking strategy.

Does chunk size affect LLM context window usage? Yes. Larger chunks consume more of the LLM’s context window. For GPT-4 or similar, you can fit ~5–10 chunks of 512 tokens. Smaller chunks let you include more retrieved results but with less context per chunk.

What is the best way to evaluate my chunking strategy? Build a small evaluation set of (query, expected document) pairs from your domain. Measure recall@10 and recall@100 with different chunking strategies. The strategy with the highest recall is usually the best starting point.


Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 2.0K

Contact us

Tell us about your use case and we'll get back to you shortly.