Now showing 1 - 10 of 28
  • Publication
    A study on interoperability between two Personal Health Train infrastructures in leukodystrophy data analysis
    ( 2024)
    Welten, Sascha
    ;
    Arruda Botelho Herr, Marius De
    ;
    Hempel, Lars
    ;
    Hieber, David
    ;
    Placzek, Peter
    ;
    Graf, Michael
    ;
    Weber, Sven
    ;
    Neumann, Laurenz
    ;
    Jugl, Maximilian
    ;
    Tirpitz, Liam
    ;
    Kindermann, Karl
    ;
    ;
    Bonino da Silva Santos, Luiz Olavo
    ;
    ;
    Pfeifer, Nico
    ;
    Kohlbacher, Oliver
    ;
    Kirsten, Toralf
    The development of platforms for distributed analytics has been driven by a growing need to comply with various governance-related or legal constraints. Among these platforms, the so-called Personal Health Train (PHT) is one representative that has emerged over the recent years. However, in projects that require data from sites featuring different PHT infrastructures, institutions are facing challenges emerging from the combination of multiple PHT ecosystems, including data governance, regulatory compliance, or the modification of existing workflows. In these scenarios, the interoperability of the platforms is preferable. In this work, we introduce a conceptual framework for the technical interoperability of the PHT covering five essential requirements: Data integration, unified station identifiers, mutual metadata, aligned security protocols, and business logic. We evaluated our concept in a feasibility study that involves two distinct PHT infrastructures: PHT-meDIC and PADME. We analyzed data on leukodystrophy from patients in the University Hospitals of Tübingen and Leipzig, and patients with differential diagnoses at the University Hospital Aachen. The results of our study demonstrate the technical interoperability between these two PHT infrastructures, allowing researchers to perform analyses across the participating institutions. Our method is more space-efficient compared to the multi-homing strategy, and it shows only a minimal time overhead.
  • Publication
    A Knowledge Graph for Query-Induced Analyses of Hierarchically Structured Time Series Information
    This paper introduces the concept of a knowledge graph for time series data, which allows for a structured management and propagation of characteristic time series information and the ability to support query-driven data analyses. We gradually link and enrich knowledge obtained by domain experts or previously performed analyses by representing globally and locally occurring time series insights as individual graph nodes. Supported by a utilization of techniques from automated knowledge discovery and machine learning, a recursive integration of analytical query results is exploited to generate a spectral representation of linked and successively condensed information. Besides a time series to graph mapping, we provide an ontology describing a classification of maintained knowledge and affiliated analysis methods for knowledge generation. After a discussion on gradual knowledge enrichment, we finally illustrate the concept of knowledge propagation based on an application of state-of-the-art methods for time series analysis.
  • Publication
    pFedV: Mitigating Feature Distribution Skewness via Personalized Federated Learning with Variational Distribution Constraints
    ( 2023)
    Mou, Yongli
    ;
    Geng, Jiahui
    ;
    Zhou, Feng
    ;
    Beyan, Oya Deniz
    ;
    Rong, Chunming
    ;
    Statistical heterogeneity, especially feature distribution skewness, among the distributed data is a common phenomenon in practice, which is a challenging problem in federated learning that can lead to a degradation in the performance of the aggregated global model. In this paper, we introduce pFedV, a novel approach that leverages a variational inference perspective by incorporating a variational distribution into neural networks. During training, we add the KL-divergence term to the loss function to constrain the output distribution of layers for feature extraction and personalize the final layer of models. The experimental results demonstrate the effectiveness of our approaches in mitigating the distribution shift in feature space in federated learning.
  • Publication
    Semantics in Dataspaces: Origin and Future Directions
    ( 2023) ;
    Kocher, Max
    ;
    ; ;
    Paulus, Alexander
    ;
    Pomp, André
    ;
    Curry, Edward
    The term dataspace was coined two decades ago and has evolved since then. Definitions range from (i) an abstraction for data management in an identifiable scope over (iii) a multi-sided data platform connecting participants in an ecosystem to (iii) interlinking data towards loosely connected (global) information. Many implementations and scientific notions follow different interpretations of the term dataspace, but agree on some use of semantic technologies. For example, dataspaces such as the European Open Science Cloud and the German National Research Data Infrastructure are committed to applying the FAIR principles. Dataspaces built on top of Gaia-X are using semantic methods for service Self-Descriptions. This paper investigates ongoing dataspace efforts and aims to provide insights on the definition of the term dataspace, the usage of semantics and FAIR principles, and future directions for the role of semantics in dataspaces.
  • Publication
    Explainable AI for Bioinformatics: Methods, Tools and Applications
    ( 2023)
    Karim, Md. Rezaul
    ;
    Islam, Tanhim
    ;
    Shajalal, Md
    ;
    Beyan, Oya
    ;
    Lange, Christoph
    ;
    Cochez, Michael
    ;
    Rebholz-Schuhmann, Dietrich
    ;
    Artificial intelligence (AI) systems utilizing deep neural networks and machine learning (ML) algorithms are widely used for solving critical problems in bioinformatics, biomedical informatics and precision medicine. However, complex ML models that are often perceived as opaque and black-box methods make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. In sensitive areas such as healthcare, explainability and accountability are not only desirable properties but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable AI (XAI) aims to overcome the opaqueness of black-box models and to provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and identify factors that influence their outcomes. However, the majority of the state-of-the-art interpretable ML methods are domain-agnostic and have evolved from fields such as computer vision, automated reasoning or statistics, making direct application to bioinformatics problems challenging without customization and domain adaptation. In this paper, we discuss the importance of explainability and algorithmic transparency in the context of bioinformatics. We provide an overview of model-specific and model-agnostic interpretable ML methods and tools and outline their potential limitations. We discuss how existing interpretable ML methods can be customized and fit to bioinformatics research problems. Further, through case studies in bioimaging, cancer genomics and text mining, we demonstrate how XAI methods can improve transparency and decision fairness. Our review aims at providing valuable insights and serving as a starting point for researchers wanting to enhance explainability and decision transparency while solving bioinformatics problems. GitHub: https://github.com/rezacsedu/XAI-for-bioinformatics.
  • Publication
    What prevents us from reusing medical real-world data in research
    ( 2023)
    Gehrmann, Julia
    ;
    Herczog, Edit
    ;
    ;
    Beyan, Oya Deniz
  • Publication
    Will it run?—A proof of concept for smoke testing decentralized data analytics experiments
    ( 2023)
    Welten, Sascha
    ;
    Weber, Sven
    ;
    Holt, Adrian
    ;
    Beyan, Oya Deniz
    ;
    The growing interest in data-driven medicine, in conjunction with the formation of initiatives such as the European Health Data Space (EHDS) has demonstrated the need for methodologies that are capable of facilitating privacy-preserving data analysis. Distributed Analytics (DA) as an enabler for privacy-preserving analysis across multiple data sources has shown its potential to support data-intensive research. However, the application of DA creates new challenges stemming from its distributed nature, such as identifying single points of failure (SPOFs) in DA tasks before their actual execution. Failing to detect such SPOFs can, for example, result in improper termination of the DA code, necessitating additional efforts from multiple stakeholders to resolve the malfunctions. Moreover, these malfunctions disrupt the seamless conduct of DA and entail several crucial consequences, including technical obstacles to resolve the issues, potential delays in research outcomes, and increased costs. In this study, we address this challenge by introducing a concept based on a method called Smoke Testing, an initial and foundational test run to ensure the operability of the analysis code. We review existing DA platforms and systematically extract six specific Smoke Testing criteria for DA applications. With these criteria in mind, we create an interactive environment called Development Environment for AuTomated and Holistic Smoke Testing of Analysis-Runs (DEATHSTAR), which allows researchers to perform Smoke Tests on their DA experiments. We conduct a user-study with 29 participants to assess our environment and additionally apply it to three real use cases. The results of our evaluation validate its effectiveness, revealing that 96.6% of the analyses created and (Smoke) tested by participants using our approach successfully terminated without any errors. Thus, by incorporating Smoke Testing as a fundamental method, our approach helps identify potential malfunctions early in the development process, ensuring smoother data-driven research within the scope of DA. Through its flexibility and adaptability to diverse real use cases, our solution enables more robust and efficient development of DA experiments, which contributes to their reliability.
  • Publication
    Property cardinality analysis to extract truly tabular query results from Wikidata
    ( 2022-11-03)
    Fahl, Wolfgang
    ;
    Holzheim, Tim
    ;
    Westerinen, Andrea
    ;
    Lange, Christoph
    ;
    Tabular views of data with tables, columns and rows as the key concepts are still a popular basis for data analysis and storage, used in relational database management systems and spreadsheet software. Graph based approaches are a superset of the tabular view and use vertices and edges/properties as the key concepts to manage the data. A common way to store graph data is using subject, predicate and object triples in a "triple store". For quite a few use cases, transforming the triple store data to a tabular view is needed since tabular systems are still widespread. The straightforward approach to generating such data using a "naive" query will, however, create unexpected results or even fail because of conceptual differences between the relational and the graph approaches regarding the handling of the cardinality/multiplicity of properties. This work shows a systematic approach to analyze the property cardinalities of the graph data in an RDF/SPARQL triple store and to extract "truly tabular" data (with cardinalities of 1 in each column) by automatically generating appropriate queries. We propose a SPARQL query builder that simplifies the generation of queries that limit the result set to such "truly tabular" data.
  • Publication
    Getting and hosting your own copy of Wikidata
    ( 2022-11-03)
    Fahl, Wolfgang
    ;
    Holzheim, Tim
    ;
    Westerinen, Andrea
    ;
    Lange, Christoph
    ;
    Wikidata is a very large, crowd sourced, general knowledge graph that is backed by a worldwide community. Its original purpose was to link different versions of Wikipedia articles across multiple languages. Access to Wikidata is provided by the non-profit Wikimedia Foundation and recently also by Wikimedia Enterprise as a commercial service. The query access via the public Wikidata Query Service (WDQS) has limits that make larger queries with millions of results next to impossible, due to a one minute timeout restriction. Beyond addressing the timeout restriction, hosting a copy of Wikidata may be desirable in order to have a more reliable service, quicker response times, less user load, and better control over the infrastructure. It is not easy, but it is possible to get and host your own copy of Wikidata. The data and software needed to run a complete Wikidata instance are available as open source or accessible via free licenses. In this paper, we report on both successful and failed attempts to get and host your own copy of Wikidata, using different triple store servers. We share recommendations for the needed hardware and software, provide documented scripts to semi-automate the procedures, and document things to avoid.