Now showing 1 - 10 of 10
  • Publication
    An enhanced relevance criterion for more concise supervised pattern discovery
    Supervised local pattern discovery aims to find subsets of a database with a high statistical unusualness in the distribution of a target attribute. Local pattern discovery is often used to generate a human-understandable representation of the most interesting dependencies in a data set. Hence, the more crisp and concise the output is, the better. Unfortunately, standard algorithm often produce very large and redundant outputs. In this paper, we introduce delta-relevance, a definition of a more strict criterion of relevance. It will allow us to significantly reduce the output space, while being able to guarantee that every local pattern has a delta-relevant representative which is almost as good in a clearly defined sense. We show empirically that delta-relevance leads to a considerable reduction of the amount of returned patterns. We also demonstrate that in a top-k setting, the removal of not delta-relevant patterns improves the quality of the result set.
  • Publication
    Secure Top-k subgroup discovery
    Supervised descriptive rule discovery techniques like subgroup discovery are quite popular in applications like fraud detection or clinical studies. Compared with other descriptive techniques, like classical support/confidence association rules, subgroup discovery has the advantage that it comes up with only the top-k patterns, and that it makes use of a quality function that avoids patterns uncorrelated with the target. If these techniques are to be applied in privacy-sensitive scenarios involving distributed data, precise guarantees are needed regarding the amount of information leaked during the execution of the data mining. Unfortunately, the adaptation of secure multi-party protocols for classical support/confidence association rule mining to the task of subgroup discovery is impossible for fundamental reasons. The source is the different quality function and the restriction to a fixed number of patterns - i.e. exactly the desired features of subgroup discovery. In this paper, we present a new protocol which allows distributed subgroup discovery while avoiding the disclosure of the individual databases. We analyze the properties of the protocol, describe a prototypical implementation and present experiments that demonstrate the feasibility of the approach.
  • Publication
    Secure distributed subgroup discovery in horizontally partitioned data
    Supervised descriptive rule discovery techniques like subgroup discovery are quite popular in applications like fraud detection or clinical studies. Compared with other descriptive techniques, like classical support/confidence association rules, subgroup discovery has the advantage that it comes up with only the top-k patterns, and that it makes use of a quality function that avoids patterns uncorrelated with the target. If these techniques are to be applied in privacy-sensitive scenarios involving distributed data, precise guarantees are needed regarding the amount of information leaked during the execution of the data mining. Unfortunately, the adaptation of secure multi-party protocols for classical support/confidence association rule mining to the task of subgroup discovery is impossible for fundamental reasons. The source is the different quality function and the restriction to a fixed number of patterns -i.e. exactly the desired features of subgroup discovery. In this paper, we present new protocols which allow distributed subgroup discovery while avoiding the disclosure of the individual databases. We analyze the properties of the protocols, describe a prototypical implementation and present experiments that demonstrate the feasibility of the approach.
  • Publication
    On subgroup discovery in numerical domains
    Subgroup discovery is a Knowledge Discovery task that aims at finding subgroups of a population with high generality and distributional unusualness. While several subgroup discovery algorithms have been presented in the past, they focus on databases with nominal attributes or make use of discretization to get rid of the numerical attributes. In this paper, we illustrate why the replacement of numerical attributes by nominal attributes can result in suboptimal results. Thereafter, we present a new subgroup discovery algorithm that prunes large parts of the search space by exploiting bounds between related numerical subgroup descriptions. The same algorithm can also be applied to ordinal attributes. In an experimental section, we show that the use of our new pruning scheme results in a huge performance gain when more that just a few split-points are considered for the numerical attributes.
  • Publication
    Integrated Web services platform for the facilitation of fraud detection in health care e-government services
    ( 2009)
    Tagaris, A.
    ;
    Konnis, G.
    ;
    Benetou, X.
    ;
    Dimakopoulos, T.
    ;
    Kassis, K.
    ;
    Athanasiadis, N.
    ;
    ; ;
    Koutsouris, D.
    Public healthcare is a basic service provided by governments to citizens which is increasingly coming under pressure as the European population ages and the ratio of working to elderly persons falls. A way to make public spending on healthcare more efficient is to ensure that the money is spent on legitimate causes. This paper presents the work of the iWebCare project where a flexible, on-line, fraud detection, web services platform was designed and developed. It aims to help those in the Healthcare business, minimize the loss of funds to fraud. The Platform is able to detect erroneous or suspicious records in submitted health care data sets, ensuring homogeneity and consistency and promoting awareness and harmonization of fraud detection practices across health care systems i n the EU. Critical objectives included, the development of an ontology of health care data associated with semantic rules, implementation and initial population of an ontology and rules repository, development of a fraud detection engine and implementation of a data mining module. The potential impact of this work can be substantial. More money on healthcare mean better healthcare. Living conditions and the trust of citizens in public healthcare will be improved.
  • Publication
    Optimistic estimate pruning strategies for fast exhaustive subgroup discovery
    (Fraunhofer IAIS, 2008) ; ;
    Shabaani, N.
    ;
    Subgroup discovery is the task of finding subgroups of a population which exhibit both distributional unusualness and high generality at the same time. Since the corresponding evaluation functions are not monotonic, the standard pruning techniques from monotonic problems such as frequent set discovery cannot be used. In this paper, we show that optimistic estimate pruning, previously considered only in a very simple and heuristic way, can be developed into a sound and highly effective pruning approach for subgroup discovery. We present and prove new optimistic estimates for several commonly used subgroup quality functions, describe a subgroup discovery algorithm with novel exploration strategies based on optimistic estimates, and show that this algorithm significantly outperforms previous algorithms by a wide margin of an order of magnitude or more.
  • Publication
    Tight optimistic estimates for fast subgroup discovery
    Subgroup discovery is the task of finding subgroups of a population which exhibit both distributional unusualness and high generality. Due to the non monotonicity of the corresponding evaluation functions, standard pruning techniques cannot be used for subgroup discovery, requiring the use of optimistic estimate techniques instead. So far, however, optimistic estimate pruning has only been considered for the extremely simple case of a binary target attribute and up to now no attempt was made to move beyond suboptimal heuristic optimistic estimates. In this paper, we show that optimistic estimate pruning can be developed into a sound and highly effective pruning approach for subgroup discovery. Based on a precise definition of optimality we show that previous estimates have been tight only in special cases. Thereafter, we present tight optimistic estimates for the most popular binary and multi-class quality functions, and present a family of increasingly efficient approximations to these optimal functions. As we show in empirical experiments, the use of our newly proposed optimistic estimates can lead to a speed up of an order of magnitude compared to previous approaches.