What is RAG (Retrieval-Augmented Generation)?
Retrieval-Augmented Generation (RAG) is an AI architecture that combines a retrieval system with a large language model (LLM). Instead of relying solely on the LLM’s training data, RAG first retrieves relevant documents from a knowledge base and passes them as context to the LLM — producing answers that are grounded in up-to-date, domain-specific information.
Why does RAG matter?
LLMs have two fundamental limitations when used alone:
- Knowledge cutoff — they only know what was in their training data
- Hallucination — they generate plausible-sounding but incorrect information when they don’t know the answer
RAG addresses both. By retrieving real documents at query time and passing them to the LLM, you ground the model’s output in your actual knowledge base. The model is no longer guessing — it’s summarising information you’ve given it.
This makes RAG the standard architecture for enterprise AI applications: customer support bots, document Q&A, legal research tools, and internal knowledge assistants.
How does RAG work?
A RAG pipeline has three main stages:
1. Indexing (offline)
- Split documents into chunks
- Encode each chunk into a vector embedding using an embedding model
- Store chunks and their vectors in a vector database
2. Retrieval (at query time)
- Encode the user’s query into a vector
- Search the vector database for the most relevant chunks
- Optionally rerank the top-k results for higher precision
3. Generation
- Pass the retrieved chunks + the user’s query to an LLM as context
- The LLM generates an answer grounded in the retrieved documents
User query │ ▼[Embedding model] ──► query vector │ ▼[Vector DB] ──► top-k relevant chunks │ ▼[Optional: Reranker] ──► reordered chunks │ ▼[LLM] ──► grounded answerHow does SIE fit into a RAG pipeline?
SIE handles the inference-heavy steps: embedding and reranking. Both are computationally expensive at scale, and managed API pricing (per-token) becomes costly quickly.
from sie_sdk import SIEClientfrom sie_sdk.types import Item
client = SIEClient("http://localhost:8080")
# Indexing: encode document chunkschunk_results = client.encode( "BAAI/bge-m3", [Item(text=c) for c in document_chunks],)chunk_vectors = [r["dense"] for r in chunk_results]
# Retrieval: encode queryquery_vector = client.encode("BAAI/bge-m3", Item(text=user_query), is_query=True)["dense"]
# Retrieve from vector DB, then rerankcandidates = vector_db.search(query_vector, top_k=50)score_result = client.score( "BAAI/bge-reranker-v2-m3", Item(text=user_query), [Item(id=str(i), text=c.text) for i, c in enumerate(candidates)],)Because SIE is self-hosted in your AWS or GCP environment, your documents never leave your infrastructure — critical for regulated industries.
RAG vs fine-tuning
| RAG | Fine-tuning | |
|---|---|---|
| Knowledge updates | Real-time (re-index) | Requires retraining |
| Domain adaptation | Via retrieval | Baked into weights |
| Cost to update | Low | High |
| Handles private data | ✓ | Risky |
| Best for | Dynamic knowledge bases | Consistent style/behaviour |
For most enterprise use cases, RAG is the right choice. Fine-tuning is complementary for adapting model tone or behaviour, not for keeping knowledge current.
Common RAG failure modes
Poor retrieval quality — if the wrong chunks are retrieved, the LLM has bad context. Fix with better embedding models, hybrid search, or reranking.
Chunks too large or too small — overlapping chunks of 256–512 tokens work well for most document types. Experiment with chunking strategy for your domain.
Context window overflow — passing too many chunks exceeds the LLM’s context limit. Reranking lets you select only the most relevant chunks.
Outdated index — re-index frequently updated documents on a schedule to keep retrieval fresh.
Frequently asked questions
What’s the difference between RAG and semantic search? Semantic search retrieves relevant documents. RAG takes that a step further by passing those documents to an LLM to generate a synthesised answer.
Do I need a GPU to run RAG? The LLM step can be CPU-heavy or GPU-dependent. For the embedding and reranking steps, SIE uses GPU efficiently via batching — and supports spot GPUs to reduce cost.
What vector databases work with SIE for RAG? SIE integrates with Qdrant, Weaviate, Chroma, and Haystack out of the box.