Now showing 1 - 10 of 11
  • Publication
    Anonymization of German financial documents using neural network-based language models with contextual word representations
    The automatization and digitalization of business processes have led to an increase in the need for efficient information extraction from business documents. However, financial and legal documents are often not utilized effectively by text processing or machine learning systems, partly due to the presence of sensitive information in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we develop an anonymization method for German financial and legal documents using state-of-the-art natural language processing methods based on recurrent neural nets and transformer architectures. We present a web-based application to anonymize financial documents and a large-scale evaluation of different deep learning techniques.
  • Publication
    ALiBERT: Improved automated list inspection (ALI) with BERT
    ( 2021-08-16) ; ;
    Stenzel, Marc Robin
    ;
    ; ;
    Khameneh, Tim Dilmaghani
    ;
    Warning, Ulrich
    ;
    Kliem, Bernd
    ;
    Loitz, Rüdiger
    We consider Automated List Inspection (ALI), a content-based text recommendation system that assists auditors in matching relevant text passages from notes in financial statements to specific law regulations. ALI follows a ranking paradigm in which a fixed number of requirements per textual passage are shown to the user. Despite achieving impressive ranking performance, the user experience can still be improved by showing a dynamic number of recommendations. Besides, existing models rely on a feature-based language model that needs to be pre-trained on a large corpus of domain-specific datasets. Moreover, they cannot be trained in an end-to-end fashion by jointly optimizing with language model parameters. In this work, we alleviate these concerns by considering a multi-label classification approach that predicts dynamic requirement sequences. We base our model on pre-trained BERT that allows us to fine-tune the whole model in an end-to-end fashion, thereby avoiding the need for training a language representation model. We conclude by presenting a detailed evaluation of the proposed model on two German financial datasets.
  • Publication
    Automatic Indexing of Financial Documents via Information Extraction
    ( 2021) ; ;
    Bell , Thiago
    ;
    Gebauer, Michael
    ;
    Ulusay, Bilge
    ;
    Uedelhoven, Daniel
    ;
    Dilmaghani, Tim
    ;
    Loitz, Rüdiger
    ;
    ; ;
    The problem of extracting information from large volumes of unstructured documents is pervasive in the domain of financial business. Enterprises and investors need automatic methods that can extract information from these documents, particularly for indexing and efficiently retrieving information. To this end, we present a scalable end-to-end document processing system for indexing and information retrieval from large volumes of financial documents. While we show our system works for the use case of financial document processing, the entire system itself is agnostic of the document type and machine learning model type. Thus, it can be applied to any large-scale document processing task involving domain-specific extractors.
  • Publication
    Utilizing Representation Learning for Robust Text Classification Under Datasetshift
    Within One-vs-Rest (OVR) classification, a classifier differentiates a single class of interest (COI) from the rest, i.e. any other class. By extending the scope of the rest class to corruptions (dataset shift), aspects of outlier detection gain relevancy. In this work, we show that adversarially trained autoencoders (ATA) representative of autoencoder-based outlier detection methods, yield tremendous robustness improvements over traditional neural network methods such as multi-layer perceptrons (MLP) and common ensemble methods, while maintaining a competitive classification performance. In contrast, our results also reveal that deep learning methods solely optimized for classification, tend to fail completely when exposed to dataset shift.
  • Publication
    Tackling Contradiction Detection in German Using Machine Translation and End-to-End Recurrent Neural Networks
    Natural Language Inference, and specifically Contradiction Detection, is still an unexplored topic with respect to German text. In this paper, we apply Recurrent Neural Network (RNN) methods to learn contradiction-specific sentence embeddings. Our data set for evaluation is a machine-translated version of the Stanford Natural Language Inference (SNLI) corpus. The results are compared to a baseline using unsupervised vectorization techniques, namely tf-idf and Flair, as well as state-of-the art transformer-based (MBERT) methods. We find that the end-to-end models outperform the models trained on unsupervised embeddings, which makes them the better choice in an empirical use case. The RNN methods also perform superior to MBERT on the translated data set.
  • Publication
    Toxicity Detection in Online Comments with Limited Data: A Comparative Analysis
    We present a comparative study on toxicity detection, focusing on the problem of identifying toxicity types of low prevalence and possibly even unobserved at training time. For this purpose, we train our models on a dataset that contains only a weak type of toxicity, and test whether they are able to generalize to more severe toxicity types. We find that representation learning and ensembling exceed the classification performance of simple classifiers on toxicity detection, while also providing significantly better generalization and robustness. All models benefit from a larger training set size, which even extends to the toxicity types unseen during training.
  • Publication
    A Community Detection Based Approach for Exploring Patterns in Player Reviews
    Optimizing player retention and engagement by providing tailored game content to their audience remain as a challenging task for game developers. Tracking and analyzing player engagement data such as in-game behavioral data as well as out-game, such as online text reviews or social media postings, are crucial in identifying user concerns and capturing user preferences. In particular, studying and understanding user reviews has therefore become an integral component of any game development process and is pursued as a research area actively. In this paper, we are interested in extracting latent and influential topics by analyzing text reviews on a popular game community website. Towards addressing this, we present an exploratory analysis with the application of a hierarchical community detection-based hybrid algorithm that extract topics from a given corpus of game reviews. Our analysis reveals interesting topics and sub-topics which can be used for further downstream analysis.
  • Publication
    Fraunhofer IAIS at FinCausal 2020, Tasks 1 & 2: Using Ensemble Methods and Sequence Tagging to Detect Causality in Financial Documents
    The FinCausal 2020 shared task aims to detect causality on financial news and identify those parts of the causal sentences related to the underlying cause and effect. We apply ensemble-based and sequence tagging methods for identifying causality, and extracting causal subsequences. Our models yield promising results on both sub-tasks, with the prospect of further improvement given more time and computing resources. With respect to task 1, we achieved an F1 score of 0.9429 on the evaluation data, and a corresponding ranking of 12/14. For task 2, we were ranked 6/10, with an F1 score of 0.76 and an ExactMatch score of 0.1912.
  • Publication
    Leveraging Contextual Text Representations for Anonymizing German Financial Documents
    ( 2020) ; ; ;
    Fürst, Benedikt
    ;
    Ismail, H.
    ;
    ; ; ;
    Stenzel, Robin
    ;
    Khameneh, Tim Dilmaghani
    ;
    Krapp, V.
    ;
    Huseynov, I.
    ;
    Schlums, J.
    ;
    Stoll, U.
    ;
    Warning, U.
    ;
    Kliem, B.
    ;
    ;
    Despite the high availability of financial and legal documents they are often not utilized by text processing or machine learning systems, even though the need for automated processing and extraction of useful patterns from these documents is increasing. This is partly due to the presence of sensitive entities in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we consider the task of anonymization in financial and legal documents using state-of-the-art natural language processing methods. Towards this, we present a web-based application to anonymize financial documents and also a largescale evaluation of different deep learning techniques.
  • Publication
    Towards Automated Auditing with Machine Learning
    ( 2019) ; ; ; ; ; ; ;
    Stenzel, Robin
    ;
    Bell, Thiago
    ;
    ; ; ;
    Warning, U.
    ;
    Fürst, Benedikt
    ;
    Khameneh, Tim Dilmaghani
    ;
    Thom, D.
    ;
    Huseynov, I.
    ;
    Kahlert, R.
    ;
    Schlums, J.
    ;
    Ismail, H.
    ;
    Kliem, B.
    ;
    Loitz, Rüdiger
    We present the Automated List Inspection (ALI) tool that utilizes methods from machine learning, natural language processing, combined with domain expert knowledge to automate financial statement auditing. ALI is a content based context-aware recommender system, that matches relevant text passages from the notes to the financial statement to specific law regulations. In this paper, we present the architecture of the recommender tool which includes text mining, language modeling, unsupervised and supervised methods that range from binary classification models to deep recurrent neural networks. Next to our main findings, we present quantitative and qualitative comparisons of the algorithms as well as concepts for how to further extend the functionality of the tool.