Now showing 1 - 4 of 4
  • Publication
    Listing closed sets of strongly accessible set systems with applications to data mining
    We study the problem of listing all closed sets of a closure operator a that is a partial function on the power set of some finite ground set E, i.e., sigma : F -> F with F subset of P(E). A very simple divide-and-conquer algorithm is analyzed that correctly solves this problem if and only if the domain of the closure operator is a strongly accessible set system. Strong accessibility is a strict relaxation of greedoids as well as of independence systems. This algorithm turns out to have delay O (vertical bar E vertical bar (T-F + T-sigma + vertical bar E vertical bar)) and space O (vertical bar E vertical bar + S-F + S-sigma), where T-F, S-F, T-sigma, and S-sigma are the time and space complexities of checking membership in F and computing a, respectively. In contrast, we show that the problem becomes intractable for accessible set systems. We relate our results to the data mining problem of listing all support-closed patterns of a dataset and show that there is a corresponding closure operator for all datasets if and only if the set system satisfies a certain confluence property.
  • Publication
    Efficient discovery of interesting patterns based on strong closedness
    Finding patterns that are interesting to a user in a certain application context is one of the central goals of data mining research. Regarding all patterns above a certain frequency threshold as interesting is one way of defining interestingness. In this paper, however, we argue that in many applications, a different notion of interestingness is required in order to be able to capture "long", and thus particularly informative, patterns that are correspondingly of low frequency. To identify such patterns, our proposed measure of interestingness is based on the degree or strength of closedness of the patterns. We show that (i) indeed this definition selects long interesting patterns that are difficult to identify with frequency-based approaches, and (ii) that it selects patterns that are robust against noise and/or dynamic changes. We prove that the family of interesting patterns proposed here forms a closure system and use the corresponding closure operator to design a mining algorithm listing these patterns in amortized quadratic time. In particular, for nonsparse datasets its time complexity is O(nm) per pattern, where n denotes the number of items and m the size of the database. This is equal to the best known time bound for listing ordinary closed frequent sets, which is a special case of our problem. We also report empirical results with real-world datasets.