Increasing Interpretability in Outside Knowledge Visual Question Answering

Upravitelev, Max; Krauss, Christopher; Kuhlmann, Isabelle

doi:10.1007/978-3-031-63269-3_24

June 22, 2024

Conference Paper

Abstract

The field of Visual Question Answering (VQA) bridges the disciplines of vision- and language-based reasoning by combining scene understanding and the answering of arbitrary questions regarding a given image. The number of questions that can be answered is limited by the visual information given in an image, but it can be expanded by utilizing external knowledge from different sources. Recently, the Outside Knowledge Visual Question Answering (OK-VQA) task was introduced to facilitate research in this field. Several current state-of-the-art solutions incorporate Graph Neural Networks (GNNs) for this task. Like other Neural Network-based architectures, GNNs usually behave as black boxes. The interpretability of the reasoning behind predictions from GNNs is, however, a desirable property. Especially in the context of Knowledge Management within organizations, it can be important (and in some cases, is also required by law) to know how the reasoning behind decisions made by utilizing GNNs came to be. Nonetheless, increasing the interpretability can come at the cost of decreasing the overall performance of a model. The following investigation concludes that this does not have to be the case in every scenario by evaluating a GNN-based model developed for the OK-VQA task and a selection of proposed updates to this model, which are based on the attention mechanism. Furthermore, potential interpretation techniques are explored, which focus on considering the attention values.