Why did we open-source our inference engine? Read the post
Building an Agentic NLQ System for Real Estate Search

SIE embeddings and Qdrant retrieval behind a GPT-4 router: cross-encoder reranking, hard filters, and five agent tools for natural language real estate search.

Improving RAG with RAPTOR

How hierarchical cluster-embedding chunking with RAPTOR improves RAG retrieval over vanilla chunking, with a step-by-step implementation and a note on serving embeddings in production with SIE.

Evaluating Retrieval Augmented Generation using RAGAS

Part two of our RAG evaluation series: building synthetic eval datasets with RAGAS, interpreting faithfulness and retrieval metrics, and mapping results to inference and serving concerns.

An evaluation of Retrieval Chunking Methods for Inference Systems

We benchmarked LlamaIndex and LangChain chunkers, MTEB embedding models, ColBERT v2, and rerankers on HotpotQA, SQUAD, and QuAC—and what the results mean for inference-heavy retrieval stacks.

Semantic Chunking

Explore semantic chunking for RAG: embedding similarity, hierarchical clustering, and LLM-based methods, with code, HotpotQA and SQUAD evaluation, and BAAI/bge-small-en-v1.5.

A Practical Guide for Choosing a Vector Database

Key considerations and trade-offs for picking a vector database that fits your architecture, scale, and operational limits.

Optimizing RAG with Hybrid Search & Reranking

How combining keyword search, vector search, and semantic reranking improves RAG retrieval precision and recall.

Vector Embeddings in the Browser

Build AI apps that generate and compare vector embeddings directly in your browser using TensorFlow.js. No backend required.

Self-hosted inference for search & document processing

Cut API costs by 50x, boost quality with 85+ SOTA models, and keep your data in your own cloud.

Github 1.5K

Contact us

Tell us about your use case and we'll get back to you shortly.