Now showing 1 - 8 of 8
  • Publication
    Anonymization of German financial documents using neural network-based language models with contextual word representations
    The automatization and digitalization of business processes have led to an increase in the need for efficient information extraction from business documents. However, financial and legal documents are often not utilized effectively by text processing or machine learning systems, partly due to the presence of sensitive information in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we develop an anonymization method for German financial and legal documents using state-of-the-art natural language processing methods based on recurrent neural nets and transformer architectures. We present a web-based application to anonymize financial documents and a large-scale evaluation of different deep learning techniques.
  • Publication
    Decoupling Autoencoders for Robust One-vs-Rest Classification
    One-vs-Rest (OVR) classification aims to distinguish a single class of interest from other classes. The concept of novelty detection and robustness to dataset shift becomes crucial in OVR when the scope of the rest class extends from the classes observed during training to unseen and possibly unrelated classes. In this work, we propose a novel architecture, namely Decoupling Autoencoder (DAE) to tackle the common issue of robustness w.r.t. out-of-distribution samples which is prevalent in classifiers such as multi-layer perceptrons (MLP) and ensemble architectures. Experiments on plain classification, outlier detection, and dataset shift tasks show DAE to achieve robust performance across these tasks compared to the baselines, which tend to fail completely, when exposed to dataset shift. W hile DAE and the baselines yield rather uncalibrated predictions on the outlier detection and dataset shift task, we found that DAE calibration is more stable across all tasks. Therefore, calibration measures applied to the classification task could also improve the calibration of the outlier detection and dataset shift scenarios for DAE.
  • Publication
    Utilizing Representation Learning for Robust Text Classification Under Datasetshift
    Within One-vs-Rest (OVR) classification, a classifier differentiates a single class of interest (COI) from the rest, i.e. any other class. By extending the scope of the rest class to corruptions (dataset shift), aspects of outlier detection gain relevancy. In this work, we show that adversarially trained autoencoders (ATA) representative of autoencoder-based outlier detection methods, yield tremendous robustness improvements over traditional neural network methods such as multi-layer perceptrons (MLP) and common ensemble methods, while maintaining a competitive classification performance. In contrast, our results also reveal that deep learning methods solely optimized for classification, tend to fail completely when exposed to dataset shift.
  • Publication
    Toxicity Detection in Online Comments with Limited Data: A Comparative Analysis
    We present a comparative study on toxicity detection, focusing on the problem of identifying toxicity types of low prevalence and possibly even unobserved at training time. For this purpose, we train our models on a dataset that contains only a weak type of toxicity, and test whether they are able to generalize to more severe toxicity types. We find that representation learning and ensembling exceed the classification performance of simple classifiers on toxicity detection, while also providing significantly better generalization and robustness. All models benefit from a larger training set size, which even extends to the toxicity types unseen during training.
  • Publication
    Leveraging Contextual Text Representations for Anonymizing German Financial Documents
    ( 2020) ; ; ;
    Fürst, Benedikt
    ;
    Ismail, H.
    ;
    ; ; ;
    Stenzel, Robin
    ;
    Khameneh, Tim Dilmaghani
    ;
    Krapp, V.
    ;
    Huseynov, I.
    ;
    Schlums, J.
    ;
    Stoll, U.
    ;
    Warning, U.
    ;
    Kliem, B.
    ;
    ;
    Despite the high availability of financial and legal documents they are often not utilized by text processing or machine learning systems, even though the need for automated processing and extraction of useful patterns from these documents is increasing. This is partly due to the presence of sensitive entities in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we consider the task of anonymization in financial and legal documents using state-of-the-art natural language processing methods. Towards this, we present a web-based application to anonymize financial documents and also a largescale evaluation of different deep learning techniques.
  • Publication
    Guided Reinforcement Learning via Sequence Learning
    Applications of Reinforcement Learning (RL) suffer from high sample complexity due to sparse reward signals and inadequate exploration. Novelty Search (NS) guides as an auxiliary task, in this regard to encourage exploration towards unseen behaviors. However, NS suffers from critical drawbacks concerning scalability and generalizability since they are based off instance learning. Addressing these challenges, we previously proposed a generic approach using unsupervised learning to learn representations of agent behaviors and use reconstruction losses as novelty scores. However, it considered only fixed-length sequences and did not utilize sequential information of behaviors. Therefore, we here extend this approach by using sequential auto-encoders to incorporate sequential dependencies. Experimental results on benchmark tasks show that this sequence learning aids exploration outperforming previous novelty search methods.
  • Publication
    From Imbalanced Classification to Supervised Outlier Detection Problems: Adversarially Trained Auto Encoders
    Imbalanced datasets pose severe challenges in training well performing classifiers. This problem is also prevalent in the domain of outlier detection since outliers occur infrequently and are generally treated as minorities. One simple yet powerful approach is to use autoencoders which are trained on majority samples and then to classify samples based on the reconstruction loss. However, this approach fails to classify samples whenever reconstruction errors of minorities overlap with that of majorities. To overcome this limitation, we propose an adversarial loss function that maximizes the loss of minorities while minimizing the loss for majorities. This way, we obtain a well-separated reconstruction error distribution that facilitates classification. We show that this approach is robust i n a wide variety of settings, such as imbalanced data classification or outlier- and novelty detection.
  • Publication
    Towards Automated Auditing with Machine Learning
    ( 2019) ; ; ; ; ; ; ;
    Stenzel, Robin
    ;
    Bell, Thiago
    ;
    ; ; ;
    Warning, U.
    ;
    Fürst, Benedikt
    ;
    Khameneh, Tim Dilmaghani
    ;
    Thom, D.
    ;
    Huseynov, I.
    ;
    Kahlert, R.
    ;
    Schlums, J.
    ;
    Ismail, H.
    ;
    Kliem, B.
    ;
    Loitz, Rüdiger
    We present the Automated List Inspection (ALI) tool that utilizes methods from machine learning, natural language processing, combined with domain expert knowledge to automate financial statement auditing. ALI is a content based context-aware recommender system, that matches relevant text passages from the notes to the financial statement to specific law regulations. In this paper, we present the architecture of the recommender tool which includes text mining, language modeling, unsupervised and supervised methods that range from binary classification models to deep recurrent neural networks. Next to our main findings, we present quantitative and qualitative comparisons of the algorithms as well as concepts for how to further extend the functionality of the tool.