CC BY 4.0Gana, BadyBadyGanaPalma, WenceslaoWenceslaoPalmaLucay, Freddy A.Freddy A.LucayMissana, CristóbalCristóbalMissanaAbarza, CarlosCarlosAbarzaAllende-Cid, HéctorHéctorAllende-Cid2025-11-242025-11-242025-10-20https://publica.fraunhofer.de/handle/publica/499582https://doi.org/10.24406/publica-648310.3390/math1321347210.24406/publica-64832-s2.0-105021564767The exponential growth of scientific literature demands scalable methods to evaluate large-language-model outputs beyond surface-level fluency. We present a two-phase framework that separates generation from evaluation: a retrieval-augmented generation system first produces candidate abstracts, which are then embedded into semantic co-occurrence graphs and assessed using seven robustness metrics from complex network theory. Two experiments were conducted. The first varied model, embedding and prompt configurations, achieved results showing clear differences in performance; the best family combined gemma-2b-it, a prompt inspired by chain-of-Thought reasoning, and all-mpnet-base-v2, achieving the highest graph-based robustness. The second experiment refined the temperature setting for this family, identifying τ = 0.2 as optimal, which stabilized results (sd = 0.12) and improved robustness relative to retrieval baselines (∆E G = +0.08, ∆ρ = +0.55). While human evaluation was limited to a small set of abstracts, the results revealed a partial convergence between graph-based robustness and expert judgments of coherence and importance. Our approach contrasts with methods like GraphRAG and establishes a reproducible, model-agnostic pathway for the scalable quality control of LLM-generated scientific content.entrueRAGcomplex networkssemantic graphsweighted kappagraph robustnessMeasuring Semantic Coherence of RAG-Generated Abstracts Through Complex Network Metricsjournal article