Now showing 1 - 10 of 24
  • Publication
    Efficient computation of comprehensive statistical information of large OWL datasets: A scalable approach
    ( 2023)
    Mohamed, H.
    ;
    Fathalla, S.
    ;
    ;
    Jabeen, H.
    Computing dataset statistics is crucial for exploring their structure, however, it becomes challenging for large-scale datasets. This has several key benefits, such as link target identification, vocabulary reuse, quality analysis, big data analytics, and coverage analysis. In this paper, we present the first attempt of developing a distributed approach (OWLStats) for collecting comprehensive statistics over large-scale OWL datasets. OWLStats is a distributed in-memory approach for computing 50 statistical criteria for OWL datasets utilizing Apache Spark. We have successfully integrated OWLStats into the SANSA framework. Experiments results prove that OWLStats is linearly scalable in terms of both node and data scalability.
  • Publication
    Survey on English Entity Linking on Wikidata: Datasets and approaches
    ( 2022-01-27)
    Möller, Cedric
    ;
    ;
    Usbeck, Ricardo
    Wikidata is a frequently updated, community-driven, and multilingual knowledge graph. Hence, Wikidata is an attractive basis for Entity Linking, which is evident by the recent increase in published papers. This survey focuses on four subjects: (1) Which Wikidata Entity Linking datasets exist, how widely used are they and how are they constructed? (2) Do the characteristics of Wikidata matter for the design of Entity Linking datasets and if so, how? (3) How do current Entity Linking approaches exploit the specific characteristics of Wikidata? (4) Which Wikidata characteristics are unexploited by existing Entity Linking approaches? This survey reveals that current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia. Thus, the potential for multilingual and time-dependent datasets, naturally suited for Wikidata, is not lifted. Furthermore, we show that most Entity Linking approaches use Wikidata in the same way as any other knowledge graph missing the chance to leverage Wikidata-specific characteristics to increase quality. Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics such as the hyper-relational structure. Hence, there is still room for improvement, for example, by including hyper-relational graph embeddings or type information. Many approaches also include information from Wikipedia, which is easily combinable with Wikidata and provides valuable textual information, which Wikidata lacks.
  • Publication
    Spatial concept learning and inference on geospatial polygon data
    ( 2022) ;
    Grubenmann, T.
    ;
    ;
    Bin, S.
    ;
    Bühmann, L.
    ;
    Geospatial knowledge has always been an essential driver for many societal aspects. This concerns in particular urban planning and urban growth management. To gain insights from geospatial data and guide decisions usually authoritative and open data sources are used, combined with user or citizen sensing data. However, we see a great potential for improving geospatial analytics by combining geospatial data with the rich terminological knowledge, e.g., provided by the Linked Open Data Cloud. Having semantically explicit, integrated geospatial and terminological knowledge, expressed by means of established vocabularies and ontologies, cross-domain spatial analytics can be performed. One analytics technique working on terminological knowledge is inductive concept learning, an approach that learns classifiers expressed as logical concept descriptions. In this paper, we extend inductive concept learning to infer and make use of the spatial context of entities in spatio-terminological data. We propose a formalism for extracting and making spatial relations explicit such that they can be exploited to learn spatial concept descriptions, enabling ‘spatially aware’ concept learning. We further provide an implementation of this formalism and demonstrate its capabilities in different evaluation scenarios.
  • Publication
    Bringing Light Into the Dark: A large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework
    ( 2022) ;
    Berrendorf, Max
    ;
    Hoyt, Charles Tapley
    ;
    Vermue, Laurent
    ;
    ;
    Sharifzadeh, Sahand
    ;
    Fischer, Asja
    ;
    Tresp, Volker
    ;
    The heterogeneity in recently published knowledge graph embedding models' implementations, training, and evaluation has made fair and thorough comparisons difficult. To assess the reproducibility of previously published results, we re-implemented and evaluated 21 models in the PyKEEN software package. In this paper, we outline which results could be reproduced with their reported hyper-parameters, which could only be reproduced with alternate hyper-parameters, and which could not be reproduced at all, as well as provide insight as to why this might be the case. We then performed a large-scale benchmarking on four datasets with several thousands of experiments and 24,804 GPU hours of computation time. We present insights gained as to best practices, best configurations for each model, and w here improvements could be made over previously published best configurations. Our results highlight that the combination of model architecture, training approach, loss function, and the explicit modeling of inverse relations is crucial for a model's performance and is not only determined by its architecture. We provide evidence that several architectures can obtain results competitive to the state of the art when configured carefully. We have made all code, experimental configurations, results, and analyses available at https://github.com/pykeen/pykeen and https://github.com/pykeen/benchmarking.
  • Publication
    SGPT: A Generative Approach for SPARQL Query Generation From Natural Language Questions
    ( 2022)
    Rony, Md Rashad Al Hasan
    ;
    Kumar, U.
    ;
    ;
    Kovriguina, Liubov
    ;
    SPARQL query generation from natural language questions is complex because it requires an understanding of both the question and underlying knowledge graph (KG) patterns. Most SPARQL query generation approaches are template-based, tailored to a specific knowledge graph and require pipelines with multiple steps, including entity and relation linking. Template-based approaches are also difficult to adapt for new KGs and require manual efforts from domain experts to construct query templates. To overcome this hurdle, we propose a new approach, dubbed SGPT, that combines the benefits of end-to-end and modular systems and leverages recent advances in large-scale language models. Specifically, we devise a novel embedding technique that can encode linguistic features from the question which enables the system to learn complex question patterns. In addition, we propose training techniques that allow the system to implicitly employ the graph-specific information (i.e., entities and relations) into the language model's parameters and generate SPARQL queries accurately. Finally, we introduce a strategy to adapt standard automatic metrics for evaluating SPARQL query generation. A comprehensive evaluation demonstrates the effectiveness of SGPT over state-of-the-art methods across several benchmark datasets.
  • Publication
    CLEP: A hybrid data- and knowledge-driven framework for generating patient representations
    ( 2021-05-08) ;
    Ali, Mehdi
    ;
    ; ; ; ;
    Hoyt, Charles Tapley
    ;
    Domingo-Fernández, Daniel
    As machine learning and artificial intelligence increasingly attain a larger number of applications in the biomedical domain, at their core, their utility depends on the data used to train them. Due to the complexity and high dimensionality of biomedical data, there is a need for approaches that combine prior knowledge around known biological interactions with patient data. Here, we present CLinical Embedding of Patients (CLEP), a novel approach that generates new patient representations by leveraging both prior knowledge and patient-level data. First, given a patient-level dataset and a knowledge graph containing relations across features that can be mapped to the dataset, CLEP incorporates patients into the knowledge graph as new nodes connected to their most characteristic features. Next, CLEP employs knowledge graph embedding models to generate new patient representations that can ultimately be used for a variety of downstream tasks, ranging from clustering to classification. We demonstrate how using new patient representations generated by CLEP significantly improves performance in classifying between patients and healthy controls for a variety of machine learning models, as compared to the use of the original transcriptomics data. Furthermore, we also show how incorporating patients into a knowledge graph can foster the interpretation and identification of biological features characteristic of a specific disease or patient subgroup. Finally, we released CLEP as an open source Python package together with examples and documentation.
  • Publication
    Discover Relations in the Industry 4.0 Standards Via Unsupervised Learning on Knowledge Graph Embeddings
    Industry 4.0 (I4.0) standards and standardization frameworks provide a unified way to describe smart factories. Standards specify the main components, systems, and processes inside a smart factory and the interaction among all of them. Furthermore, standardization frameworks classify standards according to their functions into layers and dimensions. Albeit informative, frameworks can categorize similar standards differently. As a result, interoperability conflicts are generated whenever smart factories are described with miss-classified standards. Approaches like ontologies and knowledge graphs enable the integration of standards and frameworks in a structured way. They also encode the meaning of the standards, known relations among them, as well as their classification according to existi ng frameworks. This structured modeling of the I4.0 landscape using a graph data model provides the basis for graph-based analytical methods to uncover alignments among standards. This paper contributes to analyzing the relatedness among standards and frameworks; it presents an unsupervised approach for discovering links among standards. The proposed method resorts to knowledge graph embeddings to determine relatedness among standards-based on similarity metrics. The proposed method is agnostic to the technique followed to create the embeddings and to the similarity measure. Building on the similarity values, community detection algorithms can automatically create communities of highly similar standards. Our approach follows the homophily principle, and assumes that related standards are t ogether in a community. Thus, alignments across standards are predicted and interoperability issues across them are solved. We empirically evaluate our approach on a knowledge graph of 249 I4.0 standards using the Trans$*$ family of embedding models for knowledge graph entities. Our results are promising and suggest that relations among standards can be detected accurately.
  • Publication
    PyKEEN 1.0: A python library for training and evaluating knowledge graph embeddings
    ( 2021) ;
    Berrendorf, Max
    ;
    Hoyt, Charles Tapley
    ;
    Vermue, Laurent
    ;
    Sharifzadeh, Sahand
    ;
    Tresp, Volker
    ;
    Recently, knowledge graph embeddings (KGEs) have received significant attention, and several software libraries have been developed for training and evaluation. While each of them addresses specific needs, we report on a community effort to a re-design and re-implementation of PyKEEN, one of the early KGE libraries. PyKEEN 1.0 enables users to compose knowledge graph embedding models based on a wide range of interaction models, training approaches, loss functions, and permits the explicit modeling of inverse relations. It allows users to measure each componentâs influence individually on the modelâs performance. Besides, an automatic memory optimization has been realized in order to optimally exploit the provided hardware. Through the integration of Optuna, extensive hyper-parameter optimization (HPO) functionalities are provided.
  • Publication
    Analysing the evolution of computer science events leveraging a scholarly knowledge graph: A scientometrics study of top-ranked events in the past decade
    ( 2021)
    Lackner, A.
    ;
    Fathalla, Said
    ;
    Nayyeri, M.
    ;
    Behrend, A.
    ;
    Manthey, R.
    ;
    ; ;
    Vahdati, Sahar
    The publish or perish culture of scholarly communication results in quality and relevance to be are subordinate to quantity. Scientific events such as conferences play an important role in scholarly communication and knowledge exchange. Researchers in many fields, such as computer science, often need to search for events to publish their research results, establish connections for collaborations with other researchers and stay up to date with recent works. Researchers need to have a meta-research understanding of the quality of scientific events to publish in high-quality venues. However, there are many diverse and complex criteria to be explored for the evaluation of events. Thus, finding events with quality-related criteria becomes a time-consuming task for researchers and often results in an experience-based subjective evaluation. OpenResearch.org is a crowd-sourcing platform that provides features to explore previous and upcoming events of computer science, based on a knowledge graph. In this paper, we devise an ontology representing scientific events metadata. Furthermore, we introduce an analytical study of the evolution of Computer Science events leveraging the OpenResearch.org knowledge graph. We identify common characteristics of these events, formalize them, and combine them as a group of metrics. These metrics can be used by potential authors to identify high-quality events. On top of the improved ontology, we analyzed the metadata of renowned conferences in various computer science communities, such as VLDB, ISWC, ESWC, WIMS, and SEMANTiCS, in order to inspect their potential as event metrics.
  • Publication
    Introduction to neural network-based question answering over knowledge graphs
    Question answering has emerged as an intuitive way of querying structured data sources and has attracted significant advancements over the years. A large body of recent work on question answering over knowledge graphs (KGQA) employs neural network-based systems. In this article, we provide an overview of these neural network-based methods for KGQA. We introduce readers to the formalism and the challenges of the task, different paradigms and approaches, discuss notable advancements, and outline the emerging trends in the field. Through this article, we aim to provide newcomers to the field with a suitable entry point to semantic parsing for KGQA, and ease their process of making informed decisions while creating their own QA systems.