Now showing 1 - 6 of 6
  • Publication
    A generalized Weisfeiler-Lehman graph kernel
    ( 2022-04-27)
    Schulz, Till Hendrik
    ;
    ;
    Welke, Pascal
    ;
    After more than one decade, Weisfeiler-Lehman graph kernels are still among the most prevalent graph kernels due to their remarkable predictive performance and time complexity. They are based on a fast iterative partitioning of vertices, originally designed for deciding graph isomorphism with one-sided error. The Weisfeiler-Lehman graph kernels retain this idea and compare such labels with respect to equality. This binary valued comparison is, however, arguably too rigid for defining suitable graph kernels for certain graph classes. To overcome this limitation, we propose a generalization of Weisfeiler-Lehman graph kernels which takes into account a more natural and finer grade of similarity between Weisfeiler-Lehman labels than equality. We show that the proposed similarity can be calculated efficiently by means of the Wasserstein distance between certain vectors representing Weisfeiler-Lehman labels. This and other facts give rise to the natural choice of partitioning the vertices with the Wasserstein k-means algorithm. We empirically demonstrate on the Weisfeiler-Lehman subtree kernel, which is one of the most prominent Weisfeiler-Lehman graph kernels, that our generalization significantly outperforms this and other state-of-the-art graph kernels in terms of predictive performance on datasets which contain structurally more complex graphs beyond the typically considered molecular graphs.
  • Publication
    Decision Snippet Features
    ( 2021-05-05)
    Welke, Pascal
    ;
    Alkhoury, Fouad
    ;
    ;
    Decision trees excel at interpretability of their prediction results. To achieve required prediction accuracies, however, often large ensembles of decision trees random forests are considered, reducing interpretability due to large size. Additionally, their size slows down inference on modern hardware and restricts their applicability in low-memory embedded devices. We introduce Decision Snippet Features, which are obtained from small subtrees that appear frequently in trained random forests. We subsequently show that linear models on top of these features achieve comparable and sometimes even better predictive performance than the original random forest, while reducing the model size by up to two orders of magnitude.
  • Publication
    HOPS: Probabilistic Subtree Mining for Small and Large Graphs
    ( 2020)
    Welke, Pascal
    ;
    Seiffahrt, Florian
    ;
    ;
    Frequent subgraph mining, i.e., the identification of relevant patterns in graph databases, is a well-known data mining problem with high practical relevance, since next to summarizing the data, the resulting patterns can also be used to define powerful domain-specific similarity functions for prediction. In recent years, significant progress has been made towards subgraph mining algorithms that scale to complex graphs by focusing on tree patterns and probabilistically allowing a small amount of incompleteness in the result. Nonetheless, the complexity of the pattern matching component used for deciding subtree isomorphism on arbitrary graphs has significantly limited the scalability of existing approaches. In this paper, we adapt sampling techniques from mathematical combinatorics to the problem of probabilistic subtree mining in arbitrary databases of many small to medium-size graphs or a single large graph. By restricting on tree patterns, we provide an algorithm tha t approximately counts or decides subtree isomorphism for arbitrary transaction graphs in sub-linear time with one-sided error. Our empirical evaluation on a range of benchmark graph datasets shows that the novel algorithm substantially outperforms state-of-the-art approaches both in the task of approximate counting of embeddings in single large graphs and in probabilistic frequent subtree mining in large databases of small to medium sized graphs.
  • Publication
    Probabilistic frequent subtree kernels
    ( 2016)
    Welke, Pascal
    ;
    ;
    We propose a new probabilistic graph kernel. It is defined by the set of frequent subtrees generated from a small random sample of spanning trees of the transaction graphs. In contrast to the ordinary frequent subgraph kernel it can be computed efficiently for any arbitrary graphs. Due to its probabilistic nature, the embedding function corresponding to our graph kernel is not always correct. Our empirical results on artificial and real-world chemical datasets, however, demonstrate that the graph kernel we propose is much faster than other frequent pattern based graph kernels, with only marginal loss in predictive accuracy.
  • Publication
    Min-hashing for probabilistic frequent subtree feature spaces
    ( 2016)
    Welke, Pascal
    ;
    ;
    We propose a fast algorithm for approximating graph similarities. For its advantageous semantic and algorithmic properties, we define the similarity between two graphs by the Jaccard-similarity of their images in a binary feature space spanned by the set of frequent subtrees generated for some training dataset. Since the feature space embedding is computationally intractable, we use a probabilistic subtree isomorphism operator based on a small sample of random spanning trees and approximate the Jaccard-similarity by min-hash sketches. The partial order on the feature set defined by subgraph isomorphism allows for a fast calculation of the min-hash sketch, without explicitly performing the feature space embedding. Experimental results on real-world graph datasets show that our technique results in a fast algorithm. Furthermore, the approximated similarities are well-suited for classification and retrieval tasks in large graph datasets.
  • Publication
    On the complexity of frequent subtree mining in very simple structures
    ( 2015)
    Welke, Pascal
    ;
    ;
    We study the complexity of frequent subtree mining in very simple graphs beyond forests. We show for d-tenuous outerplanar graphs that frequent subtrees can be listed with polynomial delay if the cycle degree, i.e., the maximum number of blocks that share a common vertex, is bounded by some constant. The crucial step in the proof of this positive result is a polynomial time algorithm deciding subgraph isomorphism from trees into d-tenuous outerplanar graphs of bounded cycle degree. We obtain this algorithm by generalizing the algorithm of Shamir and Tsur that decides subgraph isomorphism between trees. Our results may also be of some interest to algorithmic graph theory, as they indicate that even for very simple structures, the cycle degree is a crucial parameter for the tractability of subgraph isomorphism. We also discuss some interesting problems towards generalizing the positive result of this work.