• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Scopus
  4. INDEX: the Intelligent Data Steward Toolbox Utilizing Large Language Model Embeddings for Automated Data Harmonization
 
  • Details
  • Full
Options
2024
Conference Paper
Title

INDEX: the Intelligent Data Steward Toolbox Utilizing Large Language Model Embeddings for Automated Data Harmonization

Abstract
The data steward, responsible for overseeing data management, plays a pivotal role in evidence-based medicine by ensuring the quality, integrity, and accessibility of data throughout its lifecycle. However, managing medical data poses challenges, including handling diverse structured and unstructured data from various sources in different formats. This data curation process demands significant time and resources. To alleviate these challenges and enhance the efficiency of data stewards, we introduce a novel data stewardship tool and curation workflow utilizing Large Language Models (LLMs). We evaluated our approach by performing automatic pairwise cohort harmonization using data dictionaries of 6 different Parkinson’s Disease (PD) studies and 13 different studies in the context of Alzheimer’s Disease (AD), as well as a mapping task of over 38,000 ICD10 codes using code descriptions obtained from UKBioBank. When compared with a String Matching based baseline method that does not capture the context of variable descriptions, we found that Generative Pre-trained Transformer (GPT) embedding based mappings performed significantly better, reaching a best average accuracy for the application of PD cohort harmonization for an automated initial closest match of 82%. While we found that due to various different formulation and wording issues descriptions could not be automatically matched in all cases, we are confident that our data steward tool can significantly facilitate the work of the data steward in a semi-automatic fashion.
Author(s)
Adams, Tim  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Aborageh, Mohamed
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Salimi, Yasamin  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Fröhlich, Holger F.  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Jacobs, Marc  
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Mainwork
Ceur Workshop Proceedings
Conference
15th International Conference on Semantic Web Applications and Tools for Health Care and Life Sciences, SWAT4HCLS 2024
Language
English
Fraunhofer-Institut für Algorithmen und Wissenschaftliches Rechnen SCAI  
Keyword(s)
  • common data model

  • data stewardship

  • embeddings

  • large language models

  • semantic mappings

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024