Now showing 1 - 9 of 9
  • Publication
    Learning Weakly Convex Sets in Metric Spaces
    ( 2021-09-10)
    Stadtländer, Eike
    ;
    ;
    We introduce the notion of weak convexity in metric spaces, a generalization of ordinary convexity commonly used in machine learning. It is shown that weakly convex sets can be characterized by a closure operator and have a unique decomposition into a set of pairwise disjoint connected blocks. We give two generic efficient algorithms, an extensional and an intensional one for learning weakly convex concepts and study their formal properties. Our experimental results concerning vertex classification clearly demonstrate the excellent predictive performance of the extensional algorithm. Two non-trivial applications of the intensional algorithm to polynomial PAC-learnability are presented. The first one deals with learning k-convex Boolean functions, which are already known to be efficiently PAC-learnable. It is shown how to derive this positive result in a fairly easy way by the generic intensional algorithm. The second one is concerned with the Euclidean space equipped with the Manhattan distance. For this metric space, weakly convex sets form a union of pairwise disjoint axis-aligned hyperrectangles. We show that a weakly convex set that is consistent with a set of examples and contains a minimum number of hyperrectangles can be found in polynomial time. In contrast, this problem is known to be NP-complete if the hyperrectangles may be overlapping.
  • Publication
    A Novel Regression Loss for Non-Parametric Uncertainty Optimization
    ( 2021)
    Sicking, Joachim
    ;
    ;
    Pintz, Maximilian
    ;
    ;
    Fischer, Asja
    ;
    Quantification of uncertainty is one of the most promising approaches to establish safe machine learning. Despite its importance, it is far from being generally solved, especially for neural networks. One of the most commonly used approaches so far is Monte Carlo dropout, which is computationally cheap and easy to apply in practice. However, it can underestimate the uncertainty. We propose a new objective, referred to as second-moment loss (SML), to address this issue. While the full network is encouraged to model the mean, the dropout networks are explicitly used to optimize the model variance. We intensively study the performance of the new objective on various UCI regression datasets. Comparing to the state-of-the-art of deep ensembles, SML leads to comparable prediction accuracies and uncertainty estimates while only requiring a single model. Under distribution shift, we observe moderate improvements. As a side result, we introduce an intuitive Wasserstein distance-based uncertainty measure that is non-saturating and thus allows to resolve quality differences between any two uncertainty estimates.
  • Publication
    Mining Tree Patterns with Partially Injective Homomorphisms
    ( 2019)
    Schulz, Till Hendrik
    ;
    ;
    Welke, Pascal
    ;
    One of the main differences between inductive logic programming (ILP) and graph mining lies in the pattern matching operator applied: While it is mainly defined by relational homomorphism (i.e., subsumption) in ILP, subgraph isomorphism is the most common pattern matching operator in graph mining. Using the fact that subgraph isomorphisms are injective homomorphisms, we bridge the gap between ILP and graph mining by considering a natural transition from homomorphisms to subgraph isomorphisms that is defined by partially injective homomorphisms, i.e., which require injectivity only for subsets of the vertex pairs in the pattern. Utilizing positive complexity results on deciding homomorphisms from bounded tree-width graphs, we present an algorithm mining frequent trees from arbitrary graphs w.r.t. partially injective homomorphisms. Our experimental results show that the predictive performance of the patterns obtained is comparable to that of ordinary frequent subgraphs. Thus, by preserving much from the advantageous properties of homomorphisms and subgraph isomorphisms, our approach provides a trade-off between efficiency and predictive power.
  • Publication
    Support Estimation in Frequent Itemset Mining by Locality Sensitive Hashing
    The main computational effort in generating all frequent itemsets in a transactional database is in the step of deciding whether an itemset is frequent, or not. We present a method for estimating itemset supports with two-sided error. In a preprocessing step our algorithm first partitions the database into groups of similar transactions by using locality sensitive hashing and calculates a summary for each of these groups. The support of a query itemset is then estimated by means of these summaries. Our preliminary empirical results indicate that the proposed method results in a speed-up of up to a factor of 50 on large datasets. The F-measure of the output patterns varies between 0.83 and 0.99.
  • Publication
    Pedestrian quantity estimation with trajectory patterns
    In street-based mobility mining, traffic volume estimation receives increasing attention as it provides important applications such as emergency support systems, quality-of-service evaluation and billboard placement. In many real world scenarios, empirical measurements are usually sparse due to some constraints. On the other hand, pedestrians generally show some movement preferences, especially in closed environments, e.g., train stations. We propose a Gaussian process regression based method for traffic volume estimation, which incorporates topological information and prior knowledge on preferred trajectories with a trajectory pattern kernel. Our approach also enables effectively finding most informative sensor placements. We evaluate our method with synthetic German train station pedestr ian data and real-world episodic movement data from the zoo of Duisburg. The empirical analysis demonstrates that incorporating trajectory patterns can largely improve the traffic prediction accuracy, especially when traffic networks are sparsely monitored.
  • Publication
    Text mining and multimedia search in a large content repository
    ( 2009)
    Paaß, Gerhard
    ;
    ;
    Methods of acquiring, seeking and processing knowledge are a strategically vital issue in the context of globalized competition. One of the main subjects currently being researched is the development of semantic technologies that are capable of recognizing and classifying the content and meaning of information (words, pictures or sounds). In the context of the joint project CONTENTUS we show how different text mining techniques in a workflow are able to extract useful semantic information from text. In a comprehensive multimedia search engine these annotations together with text, metadata, and semantic clues extracted from multimedia documents (speech, music, video) may be used to give more focused access to information.
  • Publication
    Toolkit-based high-performance data mining of large data on MapReduce clusters
    The enormous growth of data in a variety of applications has increased the need for high performance data mining based on distributed environments. However, standard data mining toolkits per se do not allow the usage of computing clusters. The success of MapReduce for analyzing large data has raised a general interest in applying this model to other, data intensive applications. Unfortunately current research has not lead to an integration of GUI based data mining toolkits with distributed file system based MapReduce systems. This paper defines novel principles for modeling and design of the user interface, the storage model and the computational model necessary for the integration of such systems. Additionally, it introduces a novel system architecture for interactive GUI based data mining of large data on clusters based on MapReduce that overcomes the limitations of data mining toolkits. As an empirical demonstration we show an implementation based on Weka and Hadoop.
  • Publication
    Context-based clustering of image search results
    In this work we propose to cluster image search results based on the textual contents of the referring webpages. The natural ambiguity and context-dependence of human languages lead to problems that plague modern image search engines: A user formulating a query usually has in mind just one topic, while the results produced to satisfy this query may (and usually do) belong to the different topics. Therefore, only part of the search results are relevant for a user. One of the possible ways to improve the user's experience is to cluster the results according to the topics they belong to and present the clustered results to the user. As opposed to the clustering based on visual features, an approach utilising the text information in the webpages containing the image is less computationally intensive and provides the resulting clusters with semantically meaningful names.