---
title: What is a Chunking Strategy for RAG?
description: A chunking strategy is the approach used to split source documents into smaller segments before encoding them into vectors for a RAG (Retrieval-Augmented Generation) pipeline. The size, overlap, and boundary logic of chunks directly affects retrieval quality — chunks that are too large compress too much information ...
canonical_url: https://superlinked.com/glossary/what-is-a-chunking-strategy-for-rag
last_updated: 2026-06-02
---

# What is a Chunking Strategy for RAG?

A chunking strategy is the approach used to split source documents into smaller segments before encoding them into vectors for a RAG (Retrieval-Augmented Generation) pipeline. The size, overlap, and boundary logic of chunks directly affects retrieval quality — chunks that are too large compress too much information into one vector; chunks that are too small lose context. The right strategy depends on your document type and retrieval requirements.

---

## Why does chunking matter so much for RAG?

Embedding models encode a fixed input into a single vector. If a chunk contains five unrelated paragraphs, the vector averages over all of them — diluting the signal for any individual topic. If a chunk is a single sentence, it may lack the context needed to correctly represent its meaning.

Chunking is also the most impactful thing you can change after deployment — the embedding model and vector DB are fixed infrastructure, but chunking can be updated and re-indexed relatively quickly. Getting it right before launch saves significant re-indexing cost.

---

## Main chunking strategies

### Fixed-size chunking
Split documents every N tokens (or characters), with optional overlap:

```python
def fixed_size_chunks(text, chunk_size=512, overlap=64):
    tokens = tokenizer.encode(text)
    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(tokenizer.decode(chunk))
    return chunks
```

**Pros:** Simple, predictable, easy to implement.
**Cons:** Splits sentences and paragraphs mid-thought, losing semantic coherence.
**Best for:** Homogeneous documents (e.g. database records, product descriptions).

---

### Sentence-based chunking
Split on sentence boundaries, then group sentences until reaching a token limit:

```python
from nltk.tokenize import sent_tokenize

def sentence_chunks(text, max_tokens=256):
    sentences = sent_tokenize(text)
    chunks, current, count = [], [], 0
    for sent in sentences:
        n = len(tokenizer.encode(sent))
        if count + n > max_tokens and current:
            chunks.append(" ".join(current))
            current, count = [], 0
        current.append(sent)
        count += n
    if current:
        chunks.append(" ".join(current))
    return chunks
```

**Pros:** Preserves sentence-level coherence.
**Cons:** Chunk sizes vary; may miss multi-sentence context.
**Best for:** General prose documents, articles, reports.

---

### Recursive / semantic chunking
Respect the document's natural hierarchy — split on headings first, then paragraphs, then sentences:

```python
# LangChain's RecursiveCharacterTextSplitter approach
separators = ["\n\n", "\n", ". ", " ", ""]
# Split on paragraph breaks first; fall back to finer splits only when needed
```

**Pros:** Preserves document structure and meaning.
**Cons:** More complex to implement; structure varies across documents.
**Best for:** Structured documents with clear headings (wikis, documentation, legal contracts).

---

### Semantic chunking
Use an embedding model to find natural breakpoints — split where the semantic similarity between adjacent sentences drops significantly:

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

def semantic_chunks(sentences, threshold=0.7):
    emb_results = client.encode("BAAI/bge-m3", [Item(text=s) for s in sentences])
    embeddings = [r["dense"] for r in emb_results]
    chunks, current = [], [sentences[0]]
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            chunks.append(" ".join(current))
            current = []
        current.append(sentences[i])
    chunks.append(" ".join(current))
    return chunks
```

**Pros:** Most semantically coherent chunks.
**Cons:** Requires encoding at index time (extra compute), threshold tuning needed.
**Best for:** Long heterogeneous documents where topics shift unpredictably.

---

## Chunk size recommendations by document type

| Document type | Recommended chunk size | Overlap |
|---|---|---|
| Short product descriptions | 128–256 tokens | 0–32 |
| News articles / blog posts | 256–512 tokens | 32–64 |
| Technical documentation | 512 tokens | 64–128 |
| Legal / financial documents | 512–1024 tokens | 128–256 |
| Research papers | 256–512 per section | 32–64 |
| Chat transcripts | Per turn or 256 tokens | 0 |

When in doubt, start with 512 tokens and 64 token overlap — this works well for most document types.

---

## Parent-child chunking

A powerful pattern for long documents: index small chunks for retrieval precision, but return larger parent chunks as context to the LLM:

1. Split document into large parent chunks (e.g. 2048 tokens)
2. Split each parent into small child chunks (e.g. 256 tokens)
3. Index only child chunks in the vector DB
4. At retrieval time: find relevant child chunks, then return the full parent chunk as LLM context

This gives the precision of small-chunk retrieval with the context richness of large chunks.

---

## How chunking interacts with SIE

SIE encodes whatever text you pass. Better chunking = better input to the embedding model = better vectors = better retrieval:

```python
from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Encode chunks — SIE handles batching efficiently
chunks = recursive_chunk(document, max_tokens=512)
encode_results = client.encode("BAAI/bge-m3", [Item(text=c) for c in chunks])
vectors = [r["dense"] for r in encode_results]

# Store chunks + vectors in your vector DB
for chunk, vector in zip(chunks, vectors):
    vector_db.upsert(text=chunk, vector=vector, metadata={"doc_id": doc_id})
```

---

## Frequently asked questions

**What chunk size should I start with?**
512 tokens with 64 token overlap is a safe default for most document types. Evaluate with recall@k metrics on a sample of real queries before committing to a chunking strategy.

**Does chunk size affect LLM context window usage?**
Yes. Larger chunks consume more of the LLM's context window. For GPT-4 or similar, you can fit ~5–10 chunks of 512 tokens. Smaller chunks let you include more retrieved results but with less context per chunk.

**What is the best way to evaluate my chunking strategy?**
Build a small evaluation set of (query, expected document) pairs from your domain. Measure recall@10 and recall@100 with different chunking strategies. The strategy with the highest recall is usually the best starting point.

---

## Related resources

- [What is RAG?](/glossary/what-is-rag)
- [Regulatory Intelligence RAG example](/docs/examples/regulatory-intelligence-rag)
- [What is a text embedding model?](/glossary/what-is-a-text-embedding-model)
- [What is semantic search?](/glossary/what-is-semantic-search)
- [Browse embedding models on SIE](/models)
