• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Artikel
  4. A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts
 
  • Details
  • Full
Options
January 2026
Journal Article
Title

A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts

Abstract
Background: Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer’s Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases.
Objectives: To evaluate how different text-embedding models perform for the harmonization of clinical variables. Design and setting: We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease.
Participants: No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only. Measurements: Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging.
Results: Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking.
Conclusions: Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.
Author(s)
Adams, Tim  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Salimi, Yasamin  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Ay, Mehmet  orcid-logo
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Valderrama Nino, Diego Felipe
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Jacobs, Marc  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Fröhlich, Holger  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Journal
The journal of prevention of Alzheimer's disease  
Project(s)
Synthetic Data Generation Framework for Integrated Validation of Use Cases and AI Healthcare Applications  
Funder
European Commission  
Open Access
File(s)
Download (2.38 MB)
Rights
CC BY 4.0: Creative Commons Attribution
DOI
10.1016/j.tjpad.2025.100420
10.24406/publica-6643
Additional link
Full text
Language
English
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Keyword(s)
  • Harmonization

  • Alzheimer’s disease

  • Text-embeddings

  • Large language models

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024