• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. INDEX: the Intelligent Data Steward Toolbox
 
  • Details
  • Full
Options
2024
Conference Paper
Title

INDEX: the Intelligent Data Steward Toolbox

Title Supplement
Utilizing Large Language Model Embeddings for Automated Data Harmonization
Abstract
The data steward, responsible for overseeing data management, plays a pivotal role in evidence-based medicine by ensuring the quality, integrity, and accessibility of data throughout its lifecycle. However, managing medical data poses challenges, including handling diverse structured and unstructured data from various sources in different formats. This data curation process demands significant time and resources. To alleviate these challenges and enhance the efficiency of data stewards, we introduce a novel data stewardship tool and curation workflow utilizing Large Language Models (LLMs). We evaluated our approach by performing automatic pairwise cohort harmonization using data dictionaries of 6 different Parkinson’s Disease (PD) studies and 13 different studies in the context of Alzheimer’s Disease (AD), as well as a mapping task of over 38,000 ICD10 codes using code descriptions obtained from UKBioBank. When compared with a String Matching based baseline method that does not capture the context of variable descriptions, we found that Generative Pre-trained Transformer (GPT) embedding based mappings performed significantly better, reaching a best average accuracy for the application of PD cohort harmonization for an automated initial closest match of 82%. While we found that due to various different formulation and wording issues descriptions could not be automatically matched in all cases, we are confident that our data steward tool can significantly facilitate the work of the data steward in a semi-automatic fashion.
Author(s)
Adams, Tim  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Aborageh, Mohamed
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Salimi, Yasamin  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Fröhlich, Holger  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Jacobs, Marc  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Mainwork
SWAT4HCLS 2024, 15th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences  
Conference
International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences 2024  
Open Access
DOI
10.24406/publica-4577
File(s)
Download (598.29 KB)
Rights
CC BY 4.0: Creative Commons Attribution
Language
English
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Keyword(s)
  • data stewardship

  • large language models

  • embeddings

  • semantic mappings

  • common data model

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024