Now showing 1 - 7 of 7
  • Publication
    Addressing a new Paradigm Shift: An Empirical Study on Novel Project Characteristics for Foundation Model Projects
    In recent years, data science and machine learning (ML) has become common across sectors and industries. Project methodologies are aimed at supporting projects and try catching up with ML trends and paradigm shifts. However, they are hardly successful, since still 80% of data science projects never reach deployment. The latest paradigm shift in the area of ML - the trend of generative AI and foundation models - changes the nature of data science projects and is not yet addressed by existing project methodologies. In this work, we present novel requirements that arise from real-world projects incorporating foundation models based on 29 case studies from the NLU domain. Furthermore, we assess existing data science methodologies and identify their shortcomings. Finally, we provide guidance on adapting projects to address the new challenges in the development and operation of foundation model based solutions.
  • Publication
    Natural Language Processing in der Medizin. Whitepaper
    Künstliche Intelligenz (KI) ist in der Medizin angekommen und bereits jetzt schon unverzichtbar. Zusammen mit der Digitalisierung beschleunigt KI die Verbreitung einer datengetriebenen und personalisierten Behandlung von Patient*innen. Gerade im Krankenhaus kann KI dabei helfen, Mitarbeitende zu unterstützen, Behandlungsergebnisse zu verbessern und Kosten einzusparen. KI-Anwendungen sind mittlerweile dazu fähig, radiologische Bildgebungen auszuwerten, herapieentscheidungen zu unterstützen und Sprachdiktate zu transkribieren. Im Besonderen wurde die Textverarbeitung durch Algorithmen des Natural Language Processing (NLP) revolutioniert, die auf einer KI basieren, die sich mit natürlicher Sprache beschäftigt. Gemeint ist damit das Lesen, Verstehen und Schreiben von Texten wie beispielsweise medizinischer Befunde, Dokumentationen oder Leitlinien.
  • Publication
    Using Probabilistic Soft Logic to Improve Information Extraction in the Legal Domain
    ( 2020) ; ;
    Schmude, Timothée
    ;
    Völkening, Malte
    ;
    Rostalski, Frauke
    ;
    Extracting information from court process documents to populate a knowledge base produces data valuable to legal faculties, publishers and law firms. A challenge lies in the fact that the relevant information is interdependent and structured by numerous semantic constraints of the legal domain. Ignoring these dependencies leads to inferior solutions. Hence, the objective of this paper is to demonstrate how the extraction pipeline can be improved by the use of probabilistic soft logic rules that reflect both legal and linguistic knowledge. We propose a probabilistic rule model for the overall extraction pipeline, which enables to both map dependencies between local extraction models and to integrate additional domain knowledge in the form of logical constraints. We evaluate the performance of the model on a German court sentences corpus.
  • Publication
    Improving Word Embeddings Using Kernel PCA
    Word-based embedding approaches such as Word2Vec capture the meaning of words and relations between them, particularly well when trained with large text collections; however, they fail to do so with small datasets. Extensions such as fastText reduce the amount of data needed slightly, however, the joint task of learning meaningful morphology, syntactic and semantic representations still requires a lot of data. In this paper, we introduce a new approach to warm-start embedding models with morphological information, in order to reduce training time and enhance their performance. We use word embeddings generated using both word2vec and fastText models and enrich them with morphological information of words, derived from kernel principal component analysis (KPCA) of word similarity matrices. This can be seen as explicitly feeding the network morphological similarities and letting it learn semantic and syntactic similarities. Evaluating our models on word similarity and analogy tasks in English and German, we find that they not only achieve higher accuracies than the original skip-gram and fastText models but also require significantly less training data and time. Another benefit of our approach is that it is capable of generating a high-quality representation of infrequent words as, for example, found in very recent news articles with rapidly changing vocabularies. Lastly, we evaluate the different models on a downstream sentence classification task in which a CNN model is initialized with our embeddings and find promising results.
  • Publication
    Robust End-User-Driven Social Media Monitoring for Law Enforcement and Emergency Monitoring
    Nowadays social media mining is broadly used in the security sector to support law enforcement and to increase response time in emergency situations. One approach to go beyond the manual inspection is to use text mining technologies to extract latent topics, analyze their geospatial distribution and to identify the sentiment from posts. Although widely used, this approach has proven to be technically difficult for end-users: the language used on social media platforms rapidly changes and the domain varies according to the use case. This paper presents a monitoring architecture that analyses streams from social media, combines different machine learning approaches and can be easily adapted and enriched by user knowledge without the need for complex tuning. The framework is modeled based on the requirements of two H2020-projects in the area of community policing and emergency response.
  • Publication
    Making Efficient Use of a Domain Expert's Time in Relation Extraction
    Scarcity of labeled data is one of the most frequent problems faced in machine learning. This is particularly true in relation extraction in text mining, where large corpora of texts exists in many application domains, while labeling of text data requires an expert to invest much time to read the documents. Overall, state-of-the art models, like the convolutional neural network used in this paper, achieve great results when trained on large enough amounts of labeled data. However, from a practical point of view the question arises whether this is the most efficient approach when one takes the manual effort of the expert into account. In this paper, we report on an alternative approach where we first construct a relation extraction model using distant supervision, and only later make use of a domain expert to refine the results. Distant supervision provides a mean of labeling data given known relations in a knowledge base, but it suffers from noisy labeling. We introduce an active learning based extension, that allows our neural network to incorporate expert feedback and report on first results on a complex data set.