Options
January 2026
Journal Article
Title
A benchmark of text embedding models for semantic harmonization of Alzheimer's disease cohorts
Abstract
Background: Harmonizing diverse healthcare datasets is a challenging task due to inconsistent naming conventions. Manual harmonization is time- and resource-intensive, limiting scalability for multi-cohort Alzheimer’s Disease research. Large Language Models, or specifically text-embedding models, offer a promising solution, but their rapid development necessitates continuous, domain-specific benchmarking, especially since general established benchmarks lack clinical data harmonization use cases.
Objectives: To evaluate how different text-embedding models perform for the harmonization of clinical variables. Design and setting: We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease.
Participants: No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only. Measurements: Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging.
Results: Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking.
Conclusions: Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.
Objectives: To evaluate how different text-embedding models perform for the harmonization of clinical variables. Design and setting: We created a novel benchmark to assess how well different Language Model embeddings can be used to harmonize cohort study metadata with an in-house Common Data Model that includes cohort-to-cohort mappings for a wide range of Alzheimer’s Disease cohorts. We evaluated five different state-of-the-art text embedding models for seven different data sets in the context of Alzheimer’s disease.
Participants: No patient data were utilized for any of the analyses, as the evaluation was based on semantic harmonization of cohort metadata only. Measurements: Text descriptions of variables from different modalities were included for the analyses, namely clinical, lifestyle, demographics, and imaging.
Results: Our benchmark results favored different models compared to general-purpose benchmarks. This suggests that models fine-tuned for generic tasks may not translate well to real-world data harmonization, particularly in Alzheimer’s disease. We propose guidelines to format metadata to facilitate manual or model-assisted data harmonization. We introduce an open-source library (https://github.com/SCAI-BIO/ADHTEB) and an interactive leaderboard (https://adhteb.scai.fraunhofer.de) to aid future model benchmarking.
Conclusions: Our findings highlight the importance of domain-specific benchmarks for clinical data harmonization in the field of Alzheimer’s disease and motivate standards for naming conventions that may support semi-automated mapping applications in the future.
Author(s)
Valderrama Nino, Diego Felipe
Open Access
File(s)
Rights
CC BY 4.0: Creative Commons Attribution
Additional link
Language
English