Now showing 1 - 10 of 41
No Thumbnail Available
Publication

Lies and propaganda

2007 , Mehta, B. , Hofmann, T. , Fankhauser, P.

No Thumbnail Available
Publication

Statistical relationship determination in automatic thesaurus construction

2005 , Chen, L. , Fankhauser, P. , Thiel, U. , Kamps, T.

No Thumbnail Available
Publication

A precise blocking method for record linkage

2005 , Lehti, P. , Fankhauser, P.

Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.

No Thumbnail Available
Publication

Beyond webservices - conceptual modelling for service oriented architectures

2004 , Fankhauser, P.

No Thumbnail Available
Publication

Unsupervised duplicate detection using sample non-duplicates

2006 , Lehti, P. , Fankhauser, P.

The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Typical current methods require deep understanding of the application domain or a good representative training set, which entails significant costs. In this paper we present an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy. Evaluations show that our approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.

No Thumbnail Available
Publication

Process-annotated service discovery facilitated by an n-gram-based index

2005 , Mahleko, B. , Wombacher, A. , Fankhauser, P.

No Thumbnail Available
Publication

SWQL - A query language for data integration based on OWL

2005 , Lehti, P. , Fankhauser, P.

The Web Ontology Language OWL has been advocated as a suitable model for semantic data integration. Data integration requires expressive means to map between heterogeneous OWL schemas. This paper introduces SWQL (Semantic Web Query Language), a strictly typed query language for OWL, and shows how it can be used for mapping between heterogeneous schemas. In contrast to existing RDF query languages which focus on selection and navigation, SWQL also supports construction and user-defined functions to allow for instantiating integrated global schemas in OWL.

No Thumbnail Available
Publication

Probabilistic iterative duplicate detection

2005 , Lehti, P. , Fankhauser, P.

The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.

No Thumbnail Available
Publication

A grammar-based index for matching business processes

2005 , Mahleko, B. , Wombacher, A. , Fankhauser, P.

No Thumbnail Available
Publication

Overview on decentralized establishment of consistent multi-lateral collaborations based on asynchronous communication

2005 , Wombacher, A. , Fankhauser, P. , Aberer, K.