Now showing 1 - 10 of 49
  • Publication
    Deep Dynamic Language Models
    This thesis investigates the domain of deep dynamic language models, focusing on the integration of temporal dynamics to enhance language modeling and its application in various tasks, such as text generation, recommendation systems, and predicting post popularity. Temporal content change, i.e., trends and themes that change with time featured in document collections such as academic journals, news articles and social media, make the traditional static language models (LMs) not an optimal solution. In order to address this limitation, several approaches to develop dynamic LMs are proposed and explored in this thesis. Initially, the impact of incorporating temporal information is explored, specifically in the context of modeling online communities. For the analysis of temporal content change in Yelp - a crowd-sourced review platform - an instantaneous language model is proposed. This model combines a temporal point process (TPP) for modeling review creation times and a LM to capture textual aspects. Empirical evaluations demonstrate that this model significantly improves the performance of LMs in terms of both language modeling and prediction of review creation time. Building upon the success of the instantaneous LM, the research in this thesis is extended to more application oriented task, such as recommender systems. Recognizing that user preferences and item reviews change over time, the proposed model here leverages users’ reviews to enhance rating predictions. By developing time-interval aware representations, the proposed model outperforms several state-of-the-art recommender systems models in real-world datasets. Additionally, the integration of dynamic topic models into LMs is explored. First, the problem of skewed topic distributions in topic modeling is addressed, which can cause models to learn more general topics present in the majority of documents, rather than rare topics present in only a few documents. A neural dynamic focused topic model is proposed as a solution, which decouples topic activities from topic proportions in documents using sequences of Bernoulli random variables. Experimental evaluations show that this model outperforms state-of-the-art topic models in generalization tasks, while employing a comparable number of parameters and converging two times faster. Furthermore, the performance of large pre-trained language models (LPLMs) in dynamic environments is explored. The empirical analysis on Reddit datasets reveals significant performance drops when predicting the popularity of future posts due to temporal distribution shifts in data. To mitigate this issue, a model is proposed that combines neural variational dynamic topic models and attention mechanisms to infer temporal LM representations. The proposed model exhibit improved performance while utilizing only a fraction of the parameters of LPLMs, and providing interpretable representations that offer insights into real-world events. In summary, this thesis emphasizes the significance of incorporating temporal dynamics into LMs and explores their application in various tasks.
  • Publication
    Informed Machine Learning: Integrating Prior Knowledge into Data-Driven Learning Systems
    ( 2023-11-14) ; ;
    Garcke, Jochen
    Machine Learning is an important method in Artificial Intelligence (AI). It has shown great success in building models for tasks like prediction or image recognition by learning from patterns in large amounts of data. However, it can have its limits when dealing with insufficient training data. A potential solution is the additional integration of prior knowledge, such as physical laws, logic rules, or knowledge graphs. This leads to the notion of Informed Machine Learning (Informed ML). However, the field is so application-driven that general analyses are rare. The goal of this PhD thesis is the unification of Informed ML through general, systematic frameworks. In particular, the following research questions are answered: 1) What is the fundamental concept of Informed ML, and how can existing approaches be structurally classified, 2) is it possible to integrate prior knowledge in a universal way, and 3) how can the benefits of Informed ML be quantified, and what are the requirements for the injected knowledge? First, a concept for Informed ML is proposed, which defines it as learning from a hybrid information source that consists of data and prior knowledge. A taxonomy that serves as a structured classification framework for existing or potential approaches is presented. It considers the knowledge source, its representation type, and the integration stage into the ML pipeline. The concept of Informed ML is further extended to the combination of ML and simulation towards Hybrid AI. Then, two new methods for a universal knowledge integration are developed. The first method, Informed Pre-Training, allows to initialize neural networks with prototypes from prior knowledge. Experiments show that it improves generalization, especially for small data, and increases robustness. An analysis of the individual neural network layers shows that the improvements come from transferring the deeper layers, which confirms the transfer of semantic knowledge (Informed Transfer Learning). The second method, Geo-Informed Validation, checks models for their conformity with knowledge from street maps. It is developed in the application context of autonomous driving, where it can help to prevent potential predictions errors, e.g., in semantic segmentations of traffic scenes. Finally, a catalogue of relevant metrics for quantifying the benefits of knowledge injection is defined. Among others, it includes in-distribution accuracy, out-of-distribution robustness, as well as knowledge conformity, and a new metric that combines performance improvement and data reduction is introduced. Furthermore, a theoretical framework that represents prior knowledge in a function space and relates it to data representations is presented. It reveals that the distances between knowledge and data influence potential model improvements, which is confirmed in a systematic experimental study. All in all, these frameworks support the unification of Informed ML, which makes it more accessible and usable - and helps to achieve trustworthy AI.
  • Publication
    Investigating Graph Representation Learning Methods for Link Prediction in Knowledge Graphs
    ( 2023-11-03) ;
    Lehmann, Jens
    ;
    Knowledge graphs (KGs) have become a fundamental approach to represent structured data and are employed in academic and industrial applications. KGs are used in various machine learning applications, such as question answering, dialogue systems, and recommendation systems. Although realworld KGs contain up to billions of links, they are usually still incomplete, which can severely impact downstream applications. Link prediction in KGs is the task of predicting missing links and can be performed in a transductive or inductive setting. In the past, a wide range of link prediction approaches have been proposed, encompassing rule-based and machine learning-based approaches. One promising line of research has been link prediction based on graph representation learning methods. In particular, a large number of knowledge graph embedding models (KGEMs) have been proposed and recently, also graph neural network (GNN) based approaches are used for link prediction within KGs. Despite the intensive research efforts in KGEMs, their capabilities are often not transparent. It has been shown that baseline models can obtain competitive results to the stateof-the-art models when configured appropriately, indicating that the performance of a KGEM may not merely depend on its model architecture, but on the interplay of various components. Link prediction within KGs has been investigated mainly within the transductive setting, prohibiting inference over unseen entities. However, lately, inductive link prediction approaches have obtained increased attention since they are capable of predicting links involving unseen entities. In this thesis, we propose an extensive ecosystem for investigating the performance of KGEM-based link prediction. We used the developed ecosystem to first perform a reproducibility study in which we investigated the reproducibility crisis of KGEM-based link prediction experiments. Second, we performed the most extensive KGEM-based link prediction study in which we investigated whether incremental performance improvements reported for KGEMs can solely be attributed to the model architectures or the combination of the KGEM’s components. After providing an in-depth analysis of transductive link prediction within triple-based KGs, we focus on inductive link prediction within hyper-relational KGs. We bridge the concepts of inductive link prediction and hyper-relational KGs and demonstrate that hyper-relational information improves semi- and fully-inductive link prediction. Finally, we demonstrate the effectiveness of knowledge graph representation learning for addressing biomedical applications.
  • Publication
    Advancing Knowledge-Enhanced Conversational Systems Leveraging Language Models
    ( 2023-09-18)
    Rony, Md Rashad Al Hasan
    ;
    Lehmann, Jens
    ;
    Large language models empowering recent conversational systems such as Alexa and Siri require external knowledge to generate informative and accurate dialogues. The knowledge may be provided in structured or unstructured forms, such as knowledge graphs, documents, and databases. Typically, language models face several issues when attempting to incorporate knowledge for conversational question answering: 1) they are unable to capture the relationship between facts in a structured knowledge, 2) they lack the capability of handling the dynamic knowledge in a multi-domain conversational setting, 3) because of the scarcity of unsupervised approaches for question answer over knowledge graphs (KGQA), systems often require a large amount of training data, and 4) because of the complexities and dependencies involved in the KGQA process it is difficult to generate a formal query for question answering. All of these issues result in uninformative and incorrect answers. Furthermore, an evaluation metric that can capture various aspects of the system response, such as semantic, syntactic, and grammatical acceptability, is necessary to ensure the quality of such conversational question answering systems. Addressing the shortcomings in this thesis, we propose techniques for incorporating structured and unstructured knowledge into pre-trained language models to improve conversational question answering systems. First, we propose a novel task-oriented dialogue system that introduces a structure aware knowledge embedding and knowledge graph-weighted attention masking strategies to facilitate a language model in selecting relevant facts from a KG for informative dialogue generation. Experiment results on the benchmark datasets demonstrate significant improvement over previous baselines. Next, we introduce an unsupervised KGQA system, leveraging several pre-trained language models to improve the essential components (i.e., entity and relation linking) of KGQA. The system further introduces a novel tree-based algorithm for extracting the answer entities from a KG. The proposed techniques relax the need for training data to improve KGQA performance. Then, we introduce a generative system that combines the benefits of end-to-end and modular systems and leverages a GPT-2 language model to learn graph-specific information (i.e., entities and relations) in its parameters to generate SPARQL query for extracting answer entities from a KG. The proposed system encodes linguistic features of a question to understand complex question patterns for generating accurate SPARQL queries. Afterward, we developed a system demonstrator for question answering over unstructured documents about climate change. Pre-trained language models are leveraged to index unstructured text documents into a dense space for document retrieval and question answering. Finally, we propose an automatic evaluation metric, incorporating several core aspects of natural language understanding (language competence, syntactic and semantic variation). A comprehensive evaluation exhibits the effectiveness of our proposed metric over the state-of-the-art approaches. Overall, our contributions exhibit that the effective incorporation of external knowledge into a language model significantly improves the performance of conversational question answering. We made all the resources and code used in the proposed systems publicly available.
  • Publication
    Robust Information Extraction From Unstructured Documents
    In computer science, robustness can be thought of as the ability of a system to handle erroneous or nonstandard input during execution. This thesis studies the robustness of the methods that extract structured information from unstructured documents containing human language texts. Unfortunately, these methods usually suffer from various problems that prevent achieving robustness to the nonstandard inputs encountered during system execution in real-world scenarios. Throughout the thesis, the key components of the information extraction workflow are analyzed and several novel techniques and enhancements that lead to improved robustness of this process are presented. Firstly, a deep learning-based text recognition method, which can be trained almost exclusively using synthetically generated documents, and a novel data augmentation technique, which improves the accuracy of text recognition on low-quality documents, are presented. Moreover, a novel noise-aware training method that encourages neural network models to build a noise-resistant latent representation of the input is introduced. This approach is shown to improve the accuracy of sequence labeling performed on misrecognized and mistyped text. Further improvements in robustness are achieved by applying noisy language modeling to learn a meaningful representation of misrecognized and mistyped natural language tokens. Furthermore, for the restoration of structural information from documents, a holistic table extraction system is presented. It exhibits high recognition accuracy in a scenario, where raw documents are used as input and the target information is contained in tables. Finally, this thesis introduces a novel evaluation method of the table recognition process that works in a scenario, where the exact location of table objects on a page is not available in the ground-truth annotations. Experimental results are presented on optical character recognition, named entity recognition, part-of-speech tagging, syntactic chunking, table recognition, and interpretation, demonstrating the advantages and the utility of the presented approaches. Moreover, the code and the resources from most of the experiments have been made publicly available to facilitate future research on improving the robustness of information extraction systems.
  • Publication
    Robust Speech Recognition via Adaptation for German Oral History Interviews
    Automatic speech recognition systems often achieve remarkable performance when trained on thousands of hours of manually annotated and time-aligned speech. However, when applied in other conditions and domains than they were trained on, the systems' recognition quality often deteriorates, substantially limiting their real-world application. One of these applications is the automatic transcription of oral history interviews, i.e., interviews with witnesses of historical events. For the past twenty years, oral history interviews have been among the most challenging use cases for speech recognition due to a lack of representative training data, diverse and often poor recording conditions, and the spontaneous and occasionally colloquial nature of the speech. This thesis proposes and studies the combination of different domain adaptation approaches to overcome the lack of representative training data and cope with the unpredictability of oral history interviews. We employ and investigate data augmentation to adapt broadcast training data to cover the challenging recording conditions of oral history interviews. We compare data augmentation approaches to conventional speech enhancement. To improve the system's performance further, we study domain adaptation via fine-tuning to adapt the acoustic models trained robustly on thousands of hours of annotated speech using a minimal amount of manually transcribed oral history interviews. We employ automatic transcript-alignment to generate adaptation data from transcribed but not time-aligned interviews and investigate the influence of different adaptation data sizes on domain overfitting and generalization. We reduce domain overfitting and improve the generalization of the adapted models employing cross-lingual adaptation in a multi-staged setup to leverage the vast availability of English speech corpora. Additionally, in this thesis, a human word error rate for German oral history interviews recorded under clean conditions is experimentally estimated to study and highlight the challenges of transcription even for humans and put current results of automatic transcription into perspective. The proposed methods are evaluated on a representative oral history test set for the target domain and several additional German test sets from different domains. With this evaluation, we assure high robustness, obtain a reliable estimate of the real-world performance for conditions not seen in training, and avoid selecting models that suffer from domain overfitting. Overall, we halved the word error rate compared to the baseline using the proposed methods, simultaneously improving the recognition performance on the other domains by a substantial margin.
  • Publication
    Knowledge Extraction Methods for the Analysis of Contractual Agreements
    The ubiquitous availability of the Internet results in a massive number of apps, software, and online services with accompanying contractual agreements in the form of end-user license agreement and privacy policy. Often the textual documents describing rights, policies, and conditions comprise many pages and can not be reasonably assumed to be read and understood by humans. Although everyone is exposed to such consent forms, the majority tend to ignore them due to their length and complexity. However, the cost of ignoring terms and conditions is not always negligible, and occasionally people have to pay (money or other means) as a result of their oversight. In this thesis, we focus on the interpretation of contractual agreements for the benefit of end-users. Contractual agreements encompass both the privacy policies and the general terms and conditions related to software and services. The main characteristics of such agreements are their use of legal terminologies and limited vocabulary. This feature has pros and cons. On one hand, the clear structure and legal language facilitate the mapping between the human-readable agreements and machine-processable concepts. On the other hand, the legal terminologies make the contractual agreement complex, subjective, and, therefore, open to interpretation. This thesis addresses the problem of contractual agreement analysis from both perspectives. In order to provide a structured presentation of contractual agreements, we apply text mining and semantic technologies to develop approaches that extract important information f rom the agreements and retrieve helpful links and resources for better comprehension. Our approaches are based on ontology-based information extraction, machine learning, and semantic similarity and aim to deliver tedious consent forms in a user friendly and visualized format. The ontology-based information extraction approach processes the human-readable license agreement guided by a domain ontology to extract deontic modalities and presents a summarized output to the end-user. In the extraction phase, we focus on three key rights and conditions: permission, prohibition, duty, and cluster the extracted excerpts according to their similarities. The clustering is based on semantic similarity employing a distributional semantics approach on large word embeddings database. The machine learning method employs deep neural networks to classify a privacy policys paragraphs into pre-defined categories. Since the prediction results of the trained model are promising, we further use the predicted classes to assign five risk colors (Green, Yellow, Red) to five privacy icons (Expected Use, Expected Collection, Precise Location, Data Retention and Children Privacy). Furthermore, given that any contractual agreement must comply with the relevant legislation, we utilize text semantic similarity to map an agreements content to regulatory documents. The semantic similarity-based approach finds candidate sentences in an agreement that are potentially related to specific articles in the regulation. Then, for each candidate sentence, the relevant article and provision is found according to their semantic similarity. The achieved results from our proposed approaches allow us to conclude that although semi-automatic approaches lead to information loss, they save time and effort by producing instant results and facilitate the end-users understanding of legal texts.
  • Publication
    Discovering structure in speech recordings: Unsupervised learning of word and phoneme like units for automatic speech recognition
    Während Sprachaufzeichnungen einfach erstellt werden können, kann die Transkription dieser Aufzeichnungen sehr teuer und zeitaufwendig sein. Daher können Methoden zum automatischen Erstellen solcher Transkriptionen für nicht annotierte Daten dabei helfen, das Training von Spracherkennern für Sprachen zu vereinfachen, für die wenige oder keine annotierten Trainingsdaten verfügbar sind. Diese Arbeit untersucht und stellt Methoden zum automatischen Lernen von Transkriptionen allein aus Audioaufzeichnungen vor. Dabei werden Algorithmen zum Erlernen von Phonemen, den kleinsten Einheiten der Sprache, und Worten, vorgestellt. Diese Methoden können zum automatischen Training eines Spracherkenners aus nicht annotierten Daten verwendet werden. Diese Arbeit untersucht die Methoden zum Erlernen von Phonemen und Worten jeweils separat. Der Hauptfokus dieser Arbeit liegt auf dem unüberwachten Lernen von Worten in hierarchischen Modellen, bestehend aus Phonem- und Worttranskriptionen. Drei oft verwendete Verfahren werden untersucht, zum einen heuristische Methoden und zum anderen zwei Varianten statistischer modellbasierter Verfahren. Die erste Variante basiert auf einem probabilistischen Aussprachelexikon, während das zweite Verfahren auf der Segmentierung von Wortgittern beruht. Schließlich wird ein vollständig unüberwachtes System aus einer Kombination von unüberwachtem Phonemlernen und unüberwachter Wortsegmentierung präsentiert. Diese Arbeit schließt mit der Integration des unüberwachten Phonem- und unüberwachten Wortlernens in eine semantische Inferenz ab, um die Verwendbarkeit von unüberwacht gelernten Phonemen und Worten in einem übergeordneten System sowie ihre Fähigkeit, die Erkennungsergebnisse zu verbessern, zu demonstrieren.
  • Publication
    Towards Complex Question Answering over Knowledge Graphs
    Over the past decade, Knowledge Graphs (KG) have emerged as a prominent repository for storing facts about the world in a linked data architecture. Providing machines with the capability of exploring such Knowledge Graphs and answering natural language questions over them, has been an active area of research. The purpose of this work, is to delve further into the research of retrieving information stored in KGs, based on the natural language questions posed by the user. Knowledge Graph Question Answering (KGQA) aims to produce a concise answer to a user question, such that the user is exempt from using KG vocabulary and overheads of learning a formal query language. Existing KGQA systems have achieved excellent results over Simple Questions, where the information required is limited to a single triple and a single formal query pattern. Our motivation is to improve the performance of KGQA over Complex Questions, where formal query patterns significantly vary, and a single triple is not confining for all the required information. Complex KGQA provides several challenges such as understanding semantics and syntactic structure of questions, Entity Linking, Relation Linking and Answer Representation. Lack of suitable datasets for complex question answering further adds to research gaps. Hence, in this thesis, we focus the research objective of laying the foundations for the advancement of the state-of-the-art for Complex Question Answering over Knowledge Graphs, by providing techniques to solve various challenges and provide resources to fill the research gaps. First, we propose Normalized Query Structure (NQS), which is a linguistic analyzer module that helps the QA system to detect inputs and intents and the relation between them in the users' question. NQS acts like an intermediate language between natural language questions and formal expressions to ease the process of query formulation for complex questions. We then developed a framework named LC-QuAD to generate large scale question answering dataset by reversing the process of question answering, thereby translating natural language questions from the formal query using intermediate templates. Our goal is to use this framework for high variations in the query patterns and create a large size dataset with minimum human effort. The first version of the dataset consists of 5,000 complex questions. By extending the LC-QuAD framework to support Reified KGs and crowd-sourcing, we published the second version of the dataset as LC-QuAD 2.0, consisting of 30,000 questions with their paraphrases and has higher complexity and new variations in the questions. To overcome the problem of Entity Linking and Relation Linking in KGQA, we develop EARL, a module performing these two tasks as a single joint task for complex question answering. We develop approaches for this module, first by formalizing the task as an instance of the Generalized Traveling Salesman Problem (GTSP) and the second approach uses machine learning to exploit the connection density between nodes in the Knowledge Graph. Lastly, we create another large scale dataset to answer verbalization and provide results for multiple baseline system s on it. The Verbalization dataset is introduced to make the system's response more human-like. The NQS based KGQA system was next to the best system in terms of accuracy on the QALD-5 dataset. We empirically prove that NQS is robust to tackle paraphrases of the questions. EARL achieves the state of the art results in Entity Linking and Relation Linking for question answering on several KGQA datasets. The dataset curated in this thesis has helped the research community to move forward in the direction of improving the accuracy of complex question answering as a task as other researchers too developed several KGQA systems and modules around these published datasets. With the large-scale datasets, we have encouraged the use of large scale machine learning, deep learning and emergence of new techniques to advance the state-of-the-art in complex question answering over knowledge graphs. We further developed core components for the KGQA pipeline to overcome the challenges of Question Understanding, Entity-Relation Linking and Answer Verbalization and thus achieve our research objective. All the proposed approaches mentioned in this thesis and the published resources are available at https://github.com/AskNowQA and are released under the umbrella project AskNow.
  • Publication
    Mining Frequent Itemsets from Transactional Data Streams with Probabilistic Error Bounds
    Frequent itemset mining is a classical data mining task with a broad range of applications, including fraud discovery and product recommendation. The enumeration of frequent itemsets has two main benefits for such applications: First, frequent itemsets provide a human-understandable representation of knowledge. This is crucial as human experts are involved in designing systems for these applications. Second, many efficient algorithms are known for mining frequent itemsets. This is essential as many of todays realworld applications produce ever-growing data streams. Examples of these are online shopping, electronic payment or phone call transactions. With limited physical main memory, the analysis of data streams can, in general, be only approximate. State-of-the-art algorithms for frequent itemset mining from such streams bound their error by processing the transactions in blocks of fixed size, either each transaction individually or in mini-batches. In theory, single transaction-based updates provide the most up-todate result after each transaction, but this enumeration is inefficient in practice as the number of frequent itemsets for a single transaction can be exponential in its cardinality. Mini-batch-based algorithms are faster but can only produce a new result at the end of each batch. In this thesis, the binary choice between up-to-date results and speed is eliminated. To provide more flexibility, we develop new algorithms with a probabilistic error bound that can process an arbitrary number of transactions in each batch. State-of-the-art algorithms mining frequent itemsets from data streams with minibatches derive the size of the mini-batch from a user-defined error parameter and hence couple their error bound to the size of the update. By introducing a dynamic error bound that adapts to the length of the data stream the error is decoupled from the size of the update. The benefits of this approach are twofold: First, the dynamic error bound is independent of the size of the update. Hence, an arbitrary number of transactions can be processed without losing the error bound. Second, the bound becomes tighter as more transactions arrive and thus the tolerated error decreases, in contrast to algorithms with static thresholds. Our approach is extensively compared to the state-of-the-art in an empirical evaluation. The results confirm that the dynamic app roach is not only more flexible but also outperforms the state-of-the-art in terms of F-score for a large number of data streams. As it is easier for experts to extract knowledge from a smaller collection, we consider mining a compact pattern set. Especially useful are parameterized pattern classes for which the expert can regulate the size of the output. An example of such a parameterized pattern class are strongly closed itemsets. Additionally, they are stable against small changes in the data stream. We present an algorithm mining strongly closed itemsets from data streams. It builds on reservoir sampling and is thus capable of producing a result after any number of transactions, once the initial sample is complete. The high approximation quality of the algorithm is empirically demonstrated and the potential of strongly closed patterns for two stream mining tasks is shown: concept drift detection and product configuration recommendation.