Now showing 1 - 10 of 44
  • Publication
    Efficient entity resolution for large heterogeneous information spaces
    ( 2011)
    Papadakis, G.
    ;
    Loannou, E.
    ;
    Niederée, C.
    ;
    Fankhauser, P.
    We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledg e about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient. Copyright 2011 ACM.
  • Publication
    Lies and propaganda
    ( 2007)
    Mehta, B.
    ;
    Hofmann, T.
    ;
    Fankhauser, P.
  • Publication
    Unsupervised duplicate detection using sample non-duplicates
    ( 2006)
    Lehti, P.
    ;
    Fankhauser, P.
    The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Typical current methods require deep understanding of the application domain or a good representative training set, which entails significant costs. In this paper we present an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy. Evaluations show that our approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.
  • Publication
    Cross system personalization by factor analysis
    ( 2006)
    Mehta, B.
    ;
    Hofmann, T.
    ;
    Fankhauser, P.
    Today, personalization in information systems occurs separately within each system that one interacts with. However, there are several potential improvements w.r.t. such isolated approaches. Thus, investments of users in personalizing a system, either through explicit provision of information, or through long and regular use are not transferable to other systems. Moreover, users have little or no control over their profile, since it is deeply buried in personalization engines. Cross-system personalization, i.e. personalization that shares personal information across different systems in a user-centric way, overcomes these problems. User profiles, which are originally scattered across multiple systems, are combined to obtain maximum leverage. This paper discusses an approach in support of cross-system personalization, where a large number of users cross from one system to another, carrying their user profiles with them. These sets of corresponding profiles can be used to learn a mapping between the user profiles of the two systems. In this work, we present and evaluate the use of factor analysis for the purpose of computing recommendations for a new user crossing over from one system to another.
  • Publication
    SWQL - A query language for data integration based on OWL
    ( 2005)
    Lehti, P.
    ;
    Fankhauser, P.
    The Web Ontology Language OWL has been advocated as a suitable model for semantic data integration. Data integration requires expressive means to map between heterogeneous OWL schemas. This paper introduces SWQL (Semantic Web Query Language), a strictly typed query language for OWL, and shows how it can be used for mapping between heterogeneous schemas. In contrast to existing RDF query languages which focus on selection and navigation, SWQL also supports construction and user-defined functions to allow for instantiating integrated global schemas in OWL.
  • Publication
    Matchmaking for business processes based on conjunctive finite state automata
    ( 2005)
    Wombacher, A.
    ;
    Fankhauser, P.
    ;
    Mahleko, B.
    ;
    Neuhold, E.
    Web services have a potential to enhance B2B e-commerce over the internet by allowing companies and organisations to publish their business processes on service directories where potential trading partners can find them. This can give rise to new business paradigms based on ad-hoc trading relations as companies, particularly small to medium scale, can cheaply and flexibly enter into fruitful contracts, e.g., through subcontracting from big companies. More business process support by the web service infrastructure is however needed before such a paradigm change can materialise. The current infrastructure does not provide sufficient support for searching and matching business processes. We believe that such a service is needed and will enable companies and organisations to establish ad-hoc business relations without relying on manually negotiated frame contracts like RosettaNet PIPs. This paper gives a formal semantics to business process matchmaking and an operational de scription for matchmaking. Copyright
  • Publication
    A precise blocking method for record linkage
    ( 2005)
    Lehti, P.
    ;
    Fankhauser, P.
    Identifying approximately duplicate records between databases requires the costly computation of distances between their attributes. Thus duplicate detection is usually performed in two phases, an efficient blocking phase that determines few potential candidate duplicates based on simple criteria, followed by a second phase performing an in-depth comparison of the candidate duplicates. This paper introduces and evaluates a precise and efficient approach for the blocking phase, which requires only standard indices, but performs as well as other approaches based on special purpose indices, and outperforms other approaches based on standard indices. The key idea of the approach is to use a comparison window with a size that depends dynamically on a maximum distance, rather than using a window with fixed size.
  • Publication
    Statistical relationship determination in automatic thesaurus construction
    ( 2005)
    Chen, L.
    ;
    Fankhauser, P.
    ;
    Thiel, U.
    ;
    Kamps, T.
  • Publication
    Probabilistic iterative duplicate detection
    ( 2005)
    Lehti, P.
    ;
    Fankhauser, P.
    The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.