Now showing 1 - 10 of 21
  • Publication
    Using Probabilistic Soft Logic to Improve Information Extraction in the Legal Domain
    ( 2020) ; ;
    Schmude, Timothée
    ;
    Völkening, Malte
    ;
    Rostalski, Frauke
    ;
    Extracting information from court process documents to populate a knowledge base produces data valuable to legal faculties, publishers and law firms. A challenge lies in the fact that the relevant information is interdependent and structured by numerous semantic constraints of the legal domain. Ignoring these dependencies leads to inferior solutions. Hence, the objective of this paper is to demonstrate how the extraction pipeline can be improved by the use of probabilistic soft logic rules that reflect both legal and linguistic knowledge. We propose a probabilistic rule model for the overall extraction pipeline, which enables to both map dependencies between local extraction models and to integrate additional domain knowledge in the form of logical constraints. We evaluate the performance of the model on a German court sentences corpus.
  • Publication
    Improving Word Embeddings Using Kernel PCA
    Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.
  • Publication
    Noise Reduction in Distant Supervision for Relation Extraction Using Probabilistic Soft Logic
    The performance of modern relation extraction systems is to a great degree dependent on the size and quality of the underlying training corpus and in particular on the labels. Since generating these labels by human annotators is expensive, Distant Supervision has been proposed to automatically align entities in a knowledge base with a text corpus to generate annotations. However, this approach suffers from introducing noise, which negatively affects the performance of relation extraction systems. To tackle this problem, we propose a probabilistic graphical model which simultaneously incorporates different sources of knowledge such as domain experts knowledge about the context and linguistic knowledge about the sentence structure in a principled way. The model is defined using the declarati ve language provided by Probabilistic Soft Logic. Experimental results show that the proposed approach, compared to the original distantly supervised set, not only improves the quality of such generated training data sets, but also the performance of the final relation extraction model. The performance of modern relation extraction systems is to a great degree dependent on the size and quality of the underlying training corpus and in particular on the labels. Since generating these labels by human annotators is expensive, Distant Supervision has been proposed to automatically align entities in a knowledge base with a text corpus to generate annotations. However, this approach suffers from introducing noise, which negatively affects the performance of relation extraction systems. To tackle this problem, we propose a probabilistic graphical model which simultaneously incorporates different sources of knowledge such as domain experts knowledge about the context and linguistic knowledge about the sentence structure in a principled way. The model is defined using the declarati ve language provided by Probabilistic Soft Logic. Experimental results show that the proposed approach, compared to the original distantly supervised set, not only improves the quality of such generated training data sets, but also the performance of the final relation extraction model.
  • Publication
    Making Efficient Use of a Domain Expert's Time in Relation Extraction
    Scarcity of labeled data is one of the most frequent problems faced in machine learning. This is particularly true in relation extraction in text mining, where large corpora of texts exists in many application domains, while labeling of text data requires an expert to invest much time to read the documents. Overall, state-of-the art models, like the convolutional neural network used in this paper, achieve great results when trained on large enough amounts of labeled data. However, from a practical point of view the question arises whether this is the most efficient approach when one takes the manual effort of the expert into account. In this paper, we report on an alternative approach where we first construct a relation extraction model using distant supervision, and only later make use of a domain expert to refine the results. Distant supervision provides a mean of labeling data given known relations in a knowledge base, but it suffers from noisy labeling. We introduce an active learning based extension, that allows our neural network to incorporate expert feedback and report on first results on a complex data set.
  • Publication
    Secure Top-k subgroup discovery
    Supervised descriptive rule discovery techniques like subgroup discovery are quite popular in applications like fraud detection or clinical studies. Compared with other descriptive techniques, like classical support/confidence association rules, subgroup discovery has the advantage that it comes up with only the top-k patterns, and that it makes use of a quality function that avoids patterns uncorrelated with the target. If these techniques are to be applied in privacy-sensitive scenarios involving distributed data, precise guarantees are needed regarding the amount of information leaked during the execution of the data mining. Unfortunately, the adaptation of secure multi-party protocols for classical support/confidence association rule mining to the task of subgroup discovery is impossible for fundamental reasons. The source is the different quality function and the restriction to a fixed number of patterns - i.e. exactly the desired features of subgroup discovery. In this paper, we present a new protocol which allows distributed subgroup discovery while avoiding the disclosure of the individual databases. We analyze the properties of the protocol, describe a prototypical implementation and present experiments that demonstrate the feasibility of the approach.
  • Publication
    Enabling the reuse of data mining processes in healthcare by integrating data semantics
    ( 2011) ;
    Anguita, A.
    ;
    Biomedical researchers today deal with analyzing clinical and genomic data. However, such analysis scenarios are typically non-standardized and not easily reusable. Data mining patterns guide in the application of data mining solutions to new practical problems. For the reuse of data mining solutions it is important that the new data set shares the semantics of the original one. This is particularly important in the medical domain, where data is often semantically heterogeneous. A formal representation of requirements and pre-requisites would allow improving the efficiency of the reutilization process. This addresses in particular the most time consuming phases in a data mining project, which are data understanding and data preparation. We show how the integration of semantic information i nto data mining patterns enables the formal checking of data requirements in analysis scenarios. Our approach is based on the encoding of data requirements in a query targeted at a semantically annotated data source, and thus allows reusing concepts of semantic mediation.
  • Publication
    Towards an environment for data mining based analysis processes in bioinformatics & personalized medicine
    ( 2011) ;
    Rossi, Simona
    ;
    Buffa, Francesca
    ;
    Delorenzi, Mauro
    ;
    Bioinformatics and data mining procedures are collaborating to implement and evaluate tools and procedures for prediction of disease recurrence and progression, response to treatment, as well as new insights into various oncogenic pathways [1], [2], [3], [4] by taking into account the user needs and their heterogeneity. Based on these advances, medicine is undergoing a revolution that is even transforming the nature of health care from reactive to proactive [5]. The p-medicine (www.p-medicine.eu) consortium is creating a biomedical platform to facilitate the translation from current practice to a predictive, personalized, preventive, participatory and psycho-cognitive medicine by integrating VPH models, clinical practice, imaging and genomics data. In this paper, we present the challenges for data mining based analysis in bio- and medical informatics and our approach towards a data mining environment addressing these requirements in the p-medicine platform.