Now showing 1 - 10 of 31
  • Publication
    Anonymization of German financial documents using neural network-based language models with contextual word representations
    The automatization and digitalization of business processes have led to an increase in the need for efficient information extraction from business documents. However, financial and legal documents are often not utilized effectively by text processing or machine learning systems, partly due to the presence of sensitive information in these documents, which restrict their usage beyond authorized parties and purposes. To overcome this limitation, we develop an anonymization method for German financial and legal documents using state-of-the-art natural language processing methods based on recurrent neural nets and transformer architectures. We present a web-based application to anonymize financial documents and a large-scale evaluation of different deep learning techniques.
  • Publication
    Utilizing Representation Learning for Robust Text Classification Under Datasetshift
    Within One-vs-Rest (OVR) classification, a classifier differentiates a single class of interest (COI) from the rest, i.e. any other class. By extending the scope of the rest class to corruptions (dataset shift), aspects of outlier detection gain relevancy. In this work, we show that adversarially trained autoencoders (ATA) representative of autoencoder-based outlier detection methods, yield tremendous robustness improvements over traditional neural network methods such as multi-layer perceptrons (MLP) and common ensemble methods, while maintaining a competitive classification performance. In contrast, our results also reveal that deep learning methods solely optimized for classification, tend to fail completely when exposed to dataset shift.
  • Publication
    Combining Machine Learning and Simulation to a Hybrid Modelling Approach: Current and Future Directions
    In this paper, we describe the combination of machine learning and simulation towards a hybrid modelling approach. Such a combination of data-based and knowledge-based modelling is motivated by applications that are partly based on causal relationships, while other effects result from hidden dependencies that are represented in huge amounts of data. Our aim is to bridge the knowledge gap between the two individual communities from machine learning and simulation to promote the development of hybrid systems. We present a conceptual framework that helps to identify potential combined approaches and employ it to give a structured overview of different types of combinations using exemplary approaches of simulation-assisted machine learning and machine-learning assisted simulation. We also discuss an advanced pairing in the context of Industry 4.0 where we see particular further potential for hybrid systems. In this paper, we describe the combination of machine learning and simulation towards a hybrid modelling approach. Such a combination of data-based and knowledge-based modelling is motivated by applications that are partly based on causal relationships, while other effects result from hidden dependencies that are represented in huge amounts of data. Our aim is to bridge the knowledge gap between the two individual communities from machine learning and simulation to promote the development of hybrid systems. We present a conceptual framework that helps to identify potential combined approaches and employ it to give a structured overview of different types of combinations using exemplary approaches of simulation-assisted machine learning and machine-learning assisted simulation. We also discuss an advanced pairing in the context of Industry 4.0 where we see particular further potential for hybrid systems.
  • Publication
    Towards Map-Based Validation of Semantic Segmentation Masks
    ( 2020) ; ;
    Hueger, Fabian
    ;
    Schneider, Jan David
    ;
    Artificial intelligence for autonomous driving must meet strict requirements on safety and robustness. We propose to validate machine learning models for self-driving vehicles not only with given ground truth labels, but also with additional a-priori knowledge. In particular, we suggest to validate the drivable area in semantic segmentation masks using given street map data. We present first results, which indicate that prediction errors can be uncovered by map-based validation.
  • Publication
    Two Attempts to Predict Author Gender in Cross-Genre Settings in Dutch
    ( 2019)
    Brito, Eduardo
    ;
    ;
    This paper describes the systems designed by the FraunhoferIAIS team at the CLIN29 shared task on cross-genre gender detection in Dutch. We show two alternative classification approaches: a rather standard one consisting of feature engineering and a random forest classifier; and an alternative one involving a LSTM classifier. Both are enhanced by a LDA model trained on stems. We considered various features such as frequency of function words, parts-of-speech and sentiment among others. We achieved 53.77% average accuracy in the cross-genre settings.
  • Publication
    Triple Classification Using Regions and Fine-Grained Entity Typing
    ( 2019)
    Dong, Tiansi
    ;
    Wang, Zhigang
    ;
    Li, Juanzi
    ;
    ;
    Cremers, Armin B.
    A Triple in knowledge-graph takes a form that consists of head, relation, tail. Triple Classification is used to determine the truth value of an unknown Triple. This is a hard task for 1-to-N relations using the vector-based embedding approach. We propose a new region-based embedding approach using fine-grained type chains. A novel geometric process is presented to extend the vectors of pre-trained entities into n-balls (n-dimensional balls) under the condition that head balls shall contain their tail balls. Our algorithm achieves zero energy loss, therefore, serves as a case study of perfectly imposing tree structures into vector space. An unknown Triple (h, r, x) will be predicted as true, when x's n-ball is located in the r-subspace of h's n-ball, following the same construction of know n tails of h. The experiments are based on large datasets derived from the benchmark datasets WN11, FB13, and WN18. Our results show that the performance of the new method is related to the length of the type chain and the quality of pre-trained entity-embeddings, and that performances of long chains with well-trained entity-embeddings outperform other methods in the literature.
  • Publication
    Towards Shortest Paths via Adiabatic Quantum Computing
    Since first working quantum computers are now available, accelerated developments of this technology may be expected. This will likely impact graph- or network analysis because quantum computers promise fast solutions for many problems in these areas. In this paper, we explore the use of adiabatic quantum computing in finding shortest paths. We devise an Ising energy minimization formulation for this task and discuss how to set up a system of quantum bits to find minimum energy states of the model. In simulation experiments, we numerically solve the corresponding Schrödinger equations and observe our approach to work well. This evidences that shortest path computation can at least be assisted by quantum computers.
  • Publication
    A Comparison of Methods for Player Clustering via Behavioral Telemetry
    The analysis of user behavior in digital games has been aided by the introduction of user telemetry in game development, which provides unprecedented access to quantitative data on user behavior from the installed game clients of the entire population of players. Player behavior telemetry datasets can be exceptionally complex, with features recorded for a varying population of users over a temporal segment that can reach years in duration. Categorization of behaviors, whether through descriptive methods (e.g. segmention) or unsupervised/supervised learning techniques, is valuable for finding patterns in the behavioral data, and developing profiles that are actionable to game developers. There are numerous methods for unsupervised clustering of user behavior, e.g. k-means/c-means, Non-negative Matrix Factorization, or Principal Component Analysis. Although all yield behavior categorizations, interpretation of the resulting categories in terms of actual play behavior can be difficult if not impossible. In this paper, a range of unsupervised techniques are applied together with Archetypal Analysis to develop behavioral clusters from playtime data of 70,014 World of Warcraft players, covering a five year interval. The techniques are evaluated with respect to their ability to develop actionable behavioral profiles from the dataset.
  • Publication
    Integrating lateral swaying of pedestrians into simulations
    Traditionally, pedestrian simulations are a standard tool in public space design, crowd management, and evacuation management. In particular, when minimizing evacuation times or identifiying hazardous locations, it is of vital importance that simulations are as accurate and realistic as possible. Although today's pedestrian simulation models give satisfying results in many cases, they are not realistic in highly crowded scenes. In this paper, we describe a characteristic motion pattern that is commonly observed in areas of high pedestrian density and that has not been taken into account in state-of-the-art pedestrian models. Hence, we extend an existing pedestrian model by integrating this characteristic motion pattern and show that our proposed model gives more realistic trajectories.
  • Publication
    Early drought stress detection in cereals: Simplex volume maximization for hyperspectral image analysis
    ( 2012)
    Römer, Christoph
    ;
    ;
    Ballvora, Agim
    ;
    Pinto, Francisco
    ;
    Rossini, Micol
    ;
    Cinzia, Panigada
    ;
    Behmann, Jan
    ;
    Léon, Jens
    ;
    ; ; ;
    Rascher, Uwe
    ;
    Plümer, Lutz
    Early water stress recognition is of great relevance in precision plant breeding and production. Hyperspectral imaging sensors can be a valuable tool for early stress detection with high spatio-temporal resolution. They gather large, high dimensional data cubes posing a significant challenge to data analysis. Classical supervised learning algorithms often fail in applied plant sciences due to their need of labelled datasets, which are difficult to obtain. Therefore, new approaches for unsupervised learning of relevant patterns are needed. We apply for the first time a recent matrix factorisation technique, simplex volume maximisation (SiVM), to hyperspectral data. It is an unsupervised classification approach, optimised for fast computation of massive datasets. It allows calculation of how similar each spectrum is to observed typical spectra. This provides the means to express how likely it is that one plant is suffering from stress. The method was tested for drought stress, applied to potted barley plants in a controlled rain-out shelter experiment and to agricultural corn plots subjected to a two factorial field setup altering water and nutrient availability. Both experiments were conducted on the canopy level. SiVM was significantly better than using a combination of established vegetation indices. In the corn plots, SiVM clearly separated the different treatments, even though the effects on leaf and canopy traits were subtle.