Search & Retrieval

What is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is an AI architecture that combines a retrieval system with a large language model (LLM). Instead of relying solely on the LLM’s training data, RAG first retrieves relevant documents from a knowledge base and passes them as context to the LLM, producing answers that are grounded in up-to-date, domain-specific information.

Why does RAG matter?

LLMs have two fundamental limitations when used alone:

Knowledge cutoff: they only know what was in their training data
Hallucination: they generate plausible-sounding but incorrect information when they don’t know the answer

RAG addresses both. By retrieving real documents at query time and passing them to the LLM, you ground the model’s output in your actual knowledge base. The model is no longer guessing; it’s summarising information you’ve given it.

This makes RAG the standard architecture for enterprise AI applications: customer support bots, document Q&A, legal research tools, and internal knowledge assistants.

How does RAG work?

A RAG pipeline has three main stages:

1. Indexing (offline)

Split documents into chunks
Encode each chunk into a vector embedding using an embedding model
Store chunks and their vectors in a vector database

2. Retrieval (at query time)

Encode the user’s query into a vector
Search the vector database for the most relevant chunks
Optionally rerank the top-k results for higher precision

3. Generation

Pass the retrieved chunks + the user’s query to an LLM as context
The LLM generates an answer grounded in the retrieved documents

User query
    │
    ▼
[Embedding model] ──► query vector
    │
    ▼
[Vector DB] ──► top-k relevant chunks
    │
    ▼
[Optional: Reranker] ──► reordered chunks
    │
    ▼
[LLM] ──► grounded answer

How does SIE fit into a RAG pipeline?

SIE handles the inference-heavy steps: embedding and reranking. Both are computationally expensive at scale, and managed API pricing (per-token) becomes costly quickly.

from sie_sdk import SIEClient
from sie_sdk.types import Item

client = SIEClient("http://localhost:8080")

# Indexing: encode document chunks
chunk_results = client.encode(
    "BAAI/bge-m3",
    [Item(text=c) for c in document_chunks],
)
chunk_vectors = [r["dense"] for r in chunk_results]

# Retrieval: encode query
query_vector = client.encode("BAAI/bge-m3", Item(text=user_query), is_query=True)["dense"]

# Retrieve from vector DB, then rerank
candidates = vector_db.search(query_vector, top_k=50)
score_result = client.score(
    "BAAI/bge-reranker-v2-m3",
    Item(text=user_query),
    [Item(id=str(i), text=c.text) for i, c in enumerate(candidates)],
)

Because SIE is self-hosted in your AWS or GCP environment, your documents never leave your infrastructure, which is critical for regulated industries.

RAG vs fine-tuning

	RAG	Fine-tuning
Knowledge updates	Real-time (re-index)	Requires retraining
Domain adaptation	Via retrieval	Baked into weights
Cost to update	Low	High
Handles private data	✓	Risky
Best for	Dynamic knowledge bases	Consistent style/behaviour

For most enterprise use cases, RAG is the right choice. Fine-tuning is complementary for adapting model tone or behaviour, not for keeping knowledge current.

Common RAG failure modes

Poor retrieval quality: if the wrong chunks are retrieved, the LLM has bad context. Fix with better embedding models, hybrid search, or reranking.

Chunks too large or too small: overlapping chunks of 256-512 tokens work well for most document types. Experiment with chunking strategy for your domain.

Context window overflow: passing too many chunks exceeds the LLM’s context limit. Reranking lets you select only the most relevant chunks.

Outdated index: re-index frequently updated documents on a schedule to keep retrieval fresh.

Frequently asked questions

What’s the difference between RAG and semantic search? Semantic search retrieves relevant documents. RAG takes that a step further by passing those documents to an LLM to generate a synthesised answer.

Do I need a GPU to run RAG? The LLM step can be CPU-heavy or GPU-dependent. For the embedding and reranking steps, SIE uses GPU efficiently via batching, and supports spot GPUs to reduce cost.

What vector databases work with SIE for RAG? SIE integrates with Qdrant, Weaviate, Chroma, and Haystack out of the box.