Retrieval-augmented generation (RAG) has shown great promise for powering conversational AI. However, in most RAG systems today, a single model handles the full workflow of query analysis, passage retrieval, contextual ranking, summarization, and prompt augmentation. This results in suboptimal relevance, latency, and coherence. A multi-agent architecture that factors responsibilities across specialized retrieval, ranking, reading, and orchestration agents, operating asynchronously, allows each agent to focus on its specialized capability using custom models and data. Multi-agent RAG is thus able to improve relevance, latency, and coherence overall.
While multi-agent RAG is not a panacea – for simpler conversational tasks a single RAG agent may suffice – multi-agent RAG outperforms single agent RAG when your use case requires reasoning over diverse information sources. This article explores a multi-agent RAG architecture and quantifies its benefits.
Retrieval augmented generation faces several key challenges that limit its performance in real-world applications.
First, existing retrieval mechanisms struggle to identify the most relevant passages from corpora containing millions of documents. Simple similarity functions often return superfluous or tangential results. When retrieval fails to return the most relevant information, it leads to suboptimal prompting.
Second, retrieving supplementary information introduces latency; if the database is large, this latency can be prohibitive. Searching terabytes of text with complex ranking creates wait times that are too long for consumer applications.
In addition, current RAG systems fail to appropriately weight the original prompt and retrieved passages. Without dynamic contextual weighting, the model can become over-reliant on retrievals (resulting in reduced control or adaptability in generating meaningful responses).
Specialized agents with divided responsibilities can help address the challenges that plague single-agent architectures, and unlock RAG's full potential. By factoring RAG into separable subtasks executed concurrently by collaborative and specialized query understanding, retriever, ranker, reader, and orchestrator agents, multi-agent RAG can mitigate single-agent RAG's relevance, scalability, and latency limitations. This allows RAG to scale efficiently to enterprise workloads.
Let's break multi-agent RAG into its parts:
First, a query understanding / parsing agent comprehends the query, breaking it down and describing it in different sub-queries.
Then x number of retriever agents, each utilizing optimized vector indices, focus solely on efficient passage retrieval from the document corpus, based on the sub-queries. These retriever agents employ vector similarity search or knowledge graph retrieval-based searches to quickly find potentially relevant passages, minimizing latency even when document corpora are large.
The ranker agent evaluates the relevance of the retrieved passages using additional ranking signals like source credibility, passage specificity, and lexical overlap. This provides a relevance-based filtering step. This agent might be using ontology, for example, as a way to rerank retrieved information.
The reader agent summarizes lengthy retrieved passages into succinct snippets containing only the most salient information. This distills the context down to key facts.
Finally, the orchestrator agent dynamically adjusts the relevance weighting and integration of the prompt and filtered, ranked context passages (i.e., prompt hybridization) to maximize coherence in the final augmented prompt.
Let's look at an implementation of multi-agent RAG, and then look under the hood of the agents that make up multi-agent RAG, examining their logic, sequence, and possible optimizations.
Before going into the code snippet below from the Microsoft AutoGen library, some explanation of terms:
AssistantAgent: The AssistantAgent is given a name, a system message, and a configuration object (llm_config). The system message is a string that describes the role of the agent. The llm_config object is a dictionary that contains functions for the agent to perform its role.
user_proxy is an instance of UserProxyAgent. It is given a name and several configuration options. The is_termination_msg option is a function that determines when the user wants to terminate the conversation. The human_input_mode option is set to "NEVER", which means the agent will never ask for input from a human. The max_consecutive_auto_reply option is set to 10, which means the agent will automatically reply to up to 10 consecutive messages without input from a human. The code_execution_config option is a dictionary that contains configuration options for executing code.
llm_config = { "understand_query": mock_understand_query, "retrieve_passages": mock_retrieve_passages, "rank_passages": mock_rank_passages, "summarize_passages": mock_summarize_passages, "adjust_weighting": mock_adjust_weighting, } boss = autogen.UserProxyAgent( name="Boss", is_termination_msg=termination_msg, human_input_mode="TERMINATE", system_message="The boss who ask questions and give tasks.", code_execution_config=False, # we don't want to execute code in this case. )
# QueryUnderstandingAgent query_understanding_agent = autogen.AssistantAgent( name="query_understanding_agent", system_message="You must use X function. You are only here to understand queries. You intervene First.", llm_config=llm_config )
The goal of the QueryUnderstandingAgent is to check each subquery and determine which retriever agent is best suited to handle it based on the database schema matching. For example, some subqueries may be better served by a vector database, and others by a knowledge graph database.
To implement the QueryUnderstandingAgent, we can create a SubqueryRouter component, which takes in two retriever agents — a VectorRetrieverAgent and a KnowledgeGraphRetrieverAgent.
When a subquery needs to be routed, the SubqueryRouter will check to see if the subquery matches the schema of the vector database using some keyword or metadata matching logic. If there is a match, it will return the VectorRetrieverAgent to handle the subquery. If there is no match for the vector database, the SubqueryRouter will next check if the subquery matches the schema of the knowledge graph database. If so, it will return the KnowledgeGraphRetrieverAgent instead.
The SubqueryRouter acts like a dispatcher, distributing subquery work to the optimal retriever agent. This way, each retriever agent can focus on efficiently retrieving results from its respective databases without worrying about handling all subquery types.
This multi-agent modularity makes it easy to add more specialized retriever agents as needed for different databases or data sources.
We can create multiple retriever agents, each focused on efficient retrieval from a specific data source or using a particular technique. For example:
When subqueries are generated, we assign each one to the optimal retriever agent based on its content and the agent capabilities. For example, a fact-based subquery may go to the KnowledgeGraphRetriever, while a broader subquery could use the VectorDBRetrieverAgent.
retriever_agent_vector = autogen.AssistantAgent( name="retriever_agent_vector", system_message="You must use Y function. You are only here to retrieve passages using vector search. You intervene at the same time as other Retriever agents.", llm_config=llm_config_vector ) retriever_agent_kg = autogen.AssistantAgent( name="retriever_agent_kg", system_message="You must use Z function. You are only here to retrieve passages using knowledge graph. You intervene at the same time as other Retriever agents.", llm_config=llm_config_kg ) retriever_agent_sql = autogen.AssistantAgent( name="retriever_agent_sql", system_message="You must use A function. You are only here to retrieve passages using SQL. You intervene at the same time as other Retriever agents.", llm_config=llm_config_sql )
To enable asynchronous retrieval, we use Python’s asyncio framework. When subqueries are available, we create asyncio tasks to run the assigned retriever agent for each subquery concurrently.
For example:
retrieval_tasks = [] for subquery in subqueries: agent = assign_agent(subquery) task = asyncio.create_task(agent.retrieve(subquery)) retrieval_tasks.append(task) await asyncio.gather(*retrieval_tasks)
This allows all retriever agents to work in parallel instead of waiting for each one to finish before moving on to the next. Asynchronous retrieval returns passages far more quickly than single-agent retrieval.
The results from each agent can then be merged and ranked for the next stages.
The ranker agents in a multi-agent retrieval system can be specialized using different ranking tools and techniques:
# RankerAgent ranker_agent = autogen.AssistantAgent( name="ranker_agent", system_message="You must use B function. You are only here to rank passages. You intervene in third position. ", llm_config=llm_config )
To optimize accuracy, speed, and customization in your ranker agents, you need to identify which specialized techniques enhance ranking performance in which scenarios, then use them to configure your ranker agents accordingly.
To optimize your reader agent, we recommend that you:
By taking the above steps, you can ensure that:
# ReaderAgent reader_agent = autogen.AssistantAgent( name="reader_agent", system_message="You must use C function. You are only here to summarize passages. You intervene in fourth position. ", llm_config=llm_config )
# OrchestratorAgent orchestrator_agent = autogen.AssistantAgent( name="orchestrator_agent", system_message="You must use D function. You are only here to adjust weighting. You intervene in last position.", llm_config=llm_config )
The OrchestratorAgent can leverage structured knowledge and symbolic methods to complement LLM reasoning where appropriate and produce answers that are highly accurate, contextual, and explainable. We recommend that you:
Finally, to facilitate communication and interactions among the participating agents, you need to create a group chat:
# Create a group chat with all agents chat = GroupChat( agents = [user, retriever_agent, ranker_agent, reader_agent, orchestrator_agent] ) # Run the chat manager = GroupChatManager(chat) manager.run()
The proposed multi-agent RAG architecture delivers significant benefits in conversational AI, compared to single-agent RAG systems:
Overall, the multi-agent factored RAG system demonstrates substantial improvements in appropriateness, coherence, reasoning, and correctness over single-agent RAG baselines.
Stay updated with VectorHub
Continue Reading