Now showing 1 - 10 of 18
  • Publication
    Learning Weakly Convex Sets in Metric Spaces
    ( 2021-09-10)
    Stadtländer, Eike
    ;
    ;
    We introduce the notion of weak convexity in metric spaces, a generalization of ordinary convexity commonly used in machine learning. It is shown that weakly convex sets can be characterized by a closure operator and have a unique decomposition into a set of pairwise disjoint connected blocks. We give two generic efficient algorithms, an extensional and an intensional one for learning weakly convex concepts and study their formal properties. Our experimental results concerning vertex classification clearly demonstrate the excellent predictive performance of the extensional algorithm. Two non-trivial applications of the intensional algorithm to polynomial PAC-learnability are presented. The first one deals with learning k-convex Boolean functions, which are already known to be efficiently PAC-learnable. It is shown how to derive this positive result in a fairly easy way by the generic intensional algorithm. The second one is concerned with the Euclidean space equipped with the Manhattan distance. For this metric space, weakly convex sets form a union of pairwise disjoint axis-aligned hyperrectangles. We show that a weakly convex set that is consistent with a set of examples and contains a minimum number of hyperrectangles can be found in polynomial time. In contrast, this problem is known to be NP-complete if the hyperrectangles may be overlapping.
  • Publication
    Maximum Margin Separations in Finite Closure Systems
    ( 2021)
    Seiffahrt, Florian
    ;
    ;
    Monotone linkage functions provide a measure for proximities between elements and subsets of a ground set. Combining this notion with Vapniks idea of support vector machines, we extend the concepts of maximal closed set and half-space separation in finite closure systems to those with maximum margin. In particular, we define the notion of margin for finite closure systems by means of monotone linkage functions and give a greedy algorithm computing a maximum margin closed set separation for two sets efficiently. The output closed sets are maximum margin half-spaces, i.e., form a partitioning of the ground set if the closure system is Kakutani. We have empirically evaluated our approach on different synthetic datasets. In addition to binary classification of finite subsets of the Euclidean space, we considered also the problem of vertex classification in graphs. Our experimental results provide clear evidence that maximal closed set separation with maximum margin results in a much better predictive performance than that with arbitrary maximal closed sets.
  • Publication
    Effective approximation of parametrized closure systems over transactional data streams
    Strongly closed itemsets, defined by a parameterized closure operator, are a generalization of ordinary closed itemsets. Depending on the strength of closedness, the family of strongly closed itemsets typically forms a tiny subfamily of ordinary closed itemsets that is stable against changes in the input. In this paper we consider the problem of mining strongly closed itemsets from transactional data streams. Utilizing their algebraic and algorithmic properties, we propose an algorithm based on reservoir sampling for approximating this type of itemsets in the landmark streaming setting, prove its correctness, and show empirically that it yields a considerable speed-up over a straightforward naive algorithm without any significant loss in precision and recall. We motivate the problem setting considered by two practical applications. In particular, we first experimentally demonstrate that the above properties, i.e., compactness and stability, make strongly closed itemsets an excellent indicator of certain types of concept drifts in transactional data streams. As a second application we consider computer-aided product configuration, a real-world problem raised by an industrial project. For this problem, which is essentially exact concept identification, we propose a learning algorithm based on a certain type of subset queries formed by strongly closed itemsets and show on real-world datasets that it requires significantly less query evaluations than a naive algorithm based on membership queries.
  • Publication
    Maximal Closed Set and Half-Space Separations in Finite Closure Systems
    ( 2020)
    Seiffarth, Florian
    ;
    ;
    Motivated by various binary classification problems in structured data (e.g., graphs or other relational and algebraic structures), we investigate some algorithmic properties of closed set and half-space separation in abstract closure systems. Assuming that the underlying closure system is finite and given by the corresponding closure operator, we formulate some negative and positive complexity results for these two separation problems. In particular, we prove that deciding half-space separability in abstract closure systems is NP-complete in general. On the other hand, for the relaxed problem of maximal closed set separation we propose a simple greedy algorithm and show that it is efficient and has the best possible lower bound on the number of closure operator calls. As a second direction to overcome the negative result above, we consider Kakutani closure systems and show first that our greedy algorithm provides an algorithmic characterization of this kind of set systems. As one of the major potential application fields, we then focus on Kakutani closure systems over graphs and generalize a fundamental characterization result based on the Pasch axiom to graph structure partitioning of finite sets. Though the primary focus of this work is on the generality of the results obtained, we experimentally demonstrate the practical usefulness of our approach on vertex classification in different graph datasets.
  • Publication
    Support Estimation in Frequent Itemset Mining by Locality Sensitive Hashing
    The main computational effort in generating all frequent itemsets in a transactional database is in the step of deciding whether an itemset is frequent, or not. We present a method for estimating itemset supports with two-sided error. In a preprocessing step our algorithm first partitions the database into groups of similar transactions by using locality sensitive hashing and calculates a summary for each of these groups. The support of a query itemset is then estimated by means of these summaries. Our preliminary empirical results indicate that the proposed method results in a speed-up of up to a factor of 50 on large datasets. The F-measure of the output patterns varies between 0.83 and 0.99.
  • Publication
    Mining Tree Patterns with Partially Injective Homomorphisms
    ( 2019)
    Schulz, Till Hendrik
    ;
    ;
    Welke, Pascal
    ;
    One of the main differences between inductive logic programming (ILP) and graph mining lies in the pattern matching operator applied: While it is mainly defined by relational homomorphism (i.e., subsumption) in ILP, subgraph isomorphism is the most common pattern matching operator in graph mining. Using the fact that subgraph isomorphisms are injective homomorphisms, we bridge the gap between ILP and graph mining by considering a natural transition from homomorphisms to subgraph isomorphisms that is defined by partially injective homomorphisms, i.e., which require injectivity only for subsets of the vertex pairs in the pattern. Utilizing positive complexity results on deciding homomorphisms from bounded tree-width graphs, we present an algorithm mining frequent trees from arbitrary graphs w.r.t. partially injective homomorphisms. Our experimental results show that the predictive performance of the patterns obtained is comparable to that of ordinary frequent subgraphs. Thus, by preserving much from the advantageous properties of homomorphisms and subgraph isomorphisms, our approach provides a trade-off between efficiency and predictive power.
  • Publication
    Probabilistic and exact frequent subtree mining in graphs beyond forests
    ( 2019)
    Welke, Pascal
    ;
    ;
    Motivated by the impressive predictive power of simple patterns, we consider the problem of mining frequent subtrees in arbitrary graphs. Although the restriction of the pattern language to trees does not resolve the computational complexity of frequent subgraph mining, in a recent work we have shown that it gives rise to an algorithm generating probabilistic frequent subtrees, a random subset of all frequent subtrees, from arbitrary graphs with polynomial delay. It is based on replacing each transaction graph in the input database with a forest formed by a random subset of its spanning trees. This simple technique turned out to be quite powerful on molecule classification tasks. It has, however, the drawback that the number of sampled spanning trees must be bounded by a polynomial of the size of the transaction graphs, resulting in less impressive recall even for slightly more complex structures beyond molecular graphs. To overcome this limitation, in this work we propose an algorithm mining probabilistic frequent subtrees also with polynomial delay, but by replacing each graph with a forest formed by an exponentially large implicit subset of its spanning trees. We demonstrate the superiority of our algorithm over the simple one on threshold graphs used e.g. in spectral clustering. In addition, providing sufficient conditions for the completeness and efficiency of our algorithm, we obtain a positive complexity result on exact frequent subtree mining for a novel, practically and theoretically relevant graph class that is orthogonal to all graph classes defined by some constant bound on monotone graph properties.
  • Publication
    Probabilistic frequent subtrees for efficient graph classification and retrieval
    ( 2018)
    Welke, Pascal
    ;
    ;
    Frequent subgraphs proved to be powerful features for graph classification and prediction tasks. Their practical use is, however, limited by the computational intractability of pattern enumeration and that of graph embedding into frequent subgraph feature spaces. We propose a simple probabilistic technique that resolves both limitations. In particular, we restrict the pattern language to trees and relax the demand on the completeness of the mining algorithm, as well as on the correctness of the pattern matching operator by replacing transaction and query graphs with small random samples of their spanning trees. In this way we consider only a random subset of frequent subtrees, called probabilistic frequent subtrees, that can be enumerated efficiently. Our extensive empirical evaluation on artificial and benchmark molecular graph datasets shows that probabilistic frequent subtrees can be listed in practically feasible time and that their predictive and retrieval performance is very close even to those of complete sets of frequent subgraphs. We also present different fast techniques for computing the embedding of unseen graphs into (probabilistic frequent) subtree feature spaces. These algorithms utilize the partial order on tree patterns induced by subgraph isomorphism and, as we show empirically, require much less evaluations of subtree isomorphism than the standard brute-force algorithm. We also consider partial embeddings, i.e., when only a part of the feature vector has to be calculated. In particular, we propose a highly effective practical algorithm that significantly reduces the number of pattern matching evaluations required by the classical min-hashing algorithm approximating Jaccard-similarities.
  • Publication
    A logic-based approach to relation extraction from texts
    ( 2010) ;
    Paaß, Gerhard
    ;
    Reichartz, F.
    ;
    In recent years, text mining has moved far beyond the classical problem of text classification with an increased interest in more sophisticated processing of large text corpora, such as, for example, evaluations of complex queries. This and several other tasks are based on the essential step of relation extraction. This problem becomes a typical application of learning logic programs by considering the dependency trees of sentences as relational structures and examples of the target relation as ground atoms of a target predicate. In this way, each example is represented by a definite first-order Horn-clause. We show that an adaptation of Plotkin's least general generalization (LGG) operator can effectively be applied to such clauses and propose a simple and effective divide-and-conquer algorithm for listing a certain set of LGGs. We use these LGGs to generate binary features and compute the hypothesis by applying SVM to the feature vectors obtained. Empirical results on the ACE--2003 benchmark dataset indicate that the performance of our approach is comparable to state-of-the-art kernel methods.
  • Publication
    Frequent subgraph mining in outerplanar graphs
    In recent years there has been an increased interest in frequent pattern discovery in large databases of graph structured objects. While the frequent connected subgraph mining problem for tree datasets can be solved in incremental polynomial time, it becomes intractable for arbitrary graph databases. Existing approaches have therefore resorted to various heuristic strategies and restrictions of the search space, but have not identified a practically relevant tractable graph class beyond trees. In this paper, we consider the class of outerplanar graphs, a strict generalization of trees, develop a frequent subgraph mining algorithm for outerplanar graphs, and show that it works in incremental polynomial time for the practically relevant subclass of well-behaved outerplanar graphs, i.e., which have only polynomially many simple cycles. We evaluate the algorithm empirically on chemo- and bioinformatics applications.