Multi-agent Retrieval-Augmented Generation for Enhancing Answer Generation and Knowledge Retrieval

Kumar, Deepak; Jain, Bhavesh Mahender

doi:10.1007/978-3-032-05179-0_22

2026

Conference Paper

Abstract

Large language models (LLMs) have shown remarkable capabilities in natural language processing but often exhibit factual inconsistencies when applied to knowledge-intensive tasks, with hallucination rates as high as 30% in open-domain question answering. Retrieval-Augmented Generation (RAG) has emerged as a promising solution by coupling language generation with evidence retrieval. However, conventional RAG systems frequently suffer from noisy document retrieval, limited context coverage, and decreased faithfulness in generated outputs. To address these limitations, this paper introduces a novel architecture, Multi-Agent Retrieval-Augmented Generation (MA-RAG), which decomposes the reasoning process into a set of specialized agents responsible for query reformulation, iterative retrieval refinement, hallucination detection, and answer validation. The modular design enables dynamic coordination and layered decision-making across the retrieval and generation pipeline. We evaluate MA-RAG on three widely used QA benchmarks: SQuAD v1.1, SQuAD v2.0, and HotpotQA, under both realistic large-scale retrieval conditions and idealized filtered settings. System performance is assessed using five retrieval and generation-focused metrics such as context precision, context recall, faithfulness, answer relevancy, and answer correctness, derived from the RAGAS framework. Additionally, we complement this evaluation with span-based metrics including Exact Match (EM), F1, and BLEU scores to capture surface-level overlap and fluency. MA-RAG consistently outperforms both Traditional RAG and Ensemble RAG across all datasets. Compared to Traditional RAG, it achieves up to 29.2% improvement in recall, 25.6% in precision, and 22.7% in correctness. Against Ensemble RAG, gains reach 25.9% in precision, 15.6% in recall, and 7.2% in correctness. On average, MA-RAG improves F1 by over 9% points and BLEU by more than 11%, while nearly doubling EM scores on SQuAD v1.1. These improvements highlight the robustness of the agentic framework under both noisy and clean retrieval environments. The empirical findings suggest MA-RAG thus provides a scalable and interpretable pathway toward building more trustworthy and accurate AI systems for question answering and other knowledge-centric NLP applications.

Author(s)

Kumar, Deepak

Fraunhofer-Institut für System- und Innovationsforschung ISI

Jain, Bhavesh Mahender

Fraunhofer-Institut für System- und Innovationsforschung ISI

Mainwork

Progress in Artificial Intelligence. 24th EPIA Conference on Artificial Intelligence, EPIA 2025. Proceedings. Part II

Conference

Conference on Artificial Intelligence 2025

Options

Multi-agent Retrieval-Augmented Generation for Enhancing Answer Generation and Knowledge Retrieval