ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering

Under CopyrightTran, Duong T.Duong T.TranTran, Trung-KienTrung-KienTranHauswirth, ManfredManfredHauswirthPhuoc, Danh LeDanh LePhuoc2025-08-052025-08-052025-07-28https://publica.fraunhofer.de/handle/publica/490174https://doi.org/10.24406/publica-499010.48550/arXiv.2507.1640310.24406/publica-4990arXiv:2507.16403v2In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on Rea-sonVQA, and the empirical results demonstrate that Rea-sonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.enReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answeringpaper