Pure vector/semantic search often "dilutes" precise identifiers like acronyms, codes, and names within the surrounding semantic context of embeddings.
- Embedding Models Prioritize Meaning Over Precision -
Semantic embeddings treat "AWS S3" and "cloud storage" as similar concepts rather than distinct entities. Product codes like "SKU-1234" have no semantic meaning for the model to understand or differentiate.
- Tokenization Breaks Apart Important Terms -
Acronyms and codes get split unpredictably ("COVID-19" → ["COVID", "-", "19"]), causing mismatches. Different tokenization between indexing and querying means exact terms simply won't be found.
- Wrong Tool for the Job -
RAG systems excel at semantic similarity but acronyms need exact lexical matching. Searching for "Dr. John" shouldn't return "Dr. Jones" just because they're semantically similar as doctors.
The production fix is implementing hybrid search with the formula
- H=(1−α)K+αV where
- K is keyword (BM25) score and
- V is vector score.
- Set α between 0.3−0.5 when your corpus contains many technical terms, abbreviations, or unique identifiers.
Stack Overflow discovered this when their semantic-only search missed exact code snippets users were searching for - their hybrid approach now catches both the exact code AND semantically similar solutions. For domains heavy in technical jargon or product codes, weight keyword search higher (α = 0.3) to ensure precise matching doesn't get lost.