Coverage of LLM Trustworthiness Metrics in the Current Tool Landscape

CC BY 4.0Helmer, LennardLennardHelmerStein, Benny JörgBenny JörgSteinUfer, TimTimUferFernandes, ElantonElantonFernandesAbdelwahab, HammamHammamAbdelwahabPareek, AbhinavAbhinavPareekWoll, JoshuaJoshuaWoll2025-12-182025-12-182025-12-161613-0073 published 2025-12-16https://publica.fraunhofer.de/handle/publica/502219https://doi.org/10.24406/publica-689110.24406/publica-6891The increasing prevalence of AI systems that are build with Large Language Model (LLM) components raises the requirement for a dedicated tool stack that allows to monitor such systems, covering training, development and inference environments. Beside technical performance metrics like latency and throughput, regulations like the EU AI Act require the monitoring of trustworthiness related metrics like fairness and transparency during operation. In this paper, we describe the results of an investigation we conducted to gain an overview of the current landscape of LLM trustworthiness metrics and their coverage in monitoring tools. Based on an in-depth analysis of available catalogs and additional research, we identified 43 metrics and 23 tools. Furthermore, we highlight existing gaps and potential areas for further research. The results support practitioners and researchers in making informed decisions about the most appropriate tech stack for their AI systems.enLarge Language ModelsArtificial IntelligenceGenerative AITrustworthy AIResponsible AI,MLOpsCoverage of LLM Trustworthiness Metrics in the Current Tool Landscapeconference paper