Now showing 1 - 10 of 47
No Thumbnail Available
Publication

Efficient entity resolution for large heterogeneous information spaces

2011 , Papadakis, G. , Loannou, E. , Niederée, C. , Fankhauser, P.

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledg e about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient. Copyright 2011 ACM.

No Thumbnail Available
Publication

The missing links: Discovering hidden same-as links among a billion of triples

2010 , Papadakis, G. , Demartini, G. , Fankhauser, P. , Kärger, P.

The Semantic Web is constantly gaining momentum, as more and more Web sites and content providers adopt its principles. At the core of these principles lies the Linked Data movement, which demands that data on the Web shall be annotated and linked among different sources, instead of being isolated in data silos. In order to materialize this vision of a web of semantics, existing resource identifiers should be reused and shared between different Web sites. This is not always the case with the current state of the Semantic Web, since multiple identifiers are, more often than not, redundantly introduced for the same resources. In this paper we introduce a novel approach to automatically detect redundant identifiers solely by matching the URIs of information resources. The approach, based on a common pattern among Semantic Web URIs, provides a simple and practical method for duplicate detection. We apply this method on a large snapshot of the current Semantic Web comprising 1.15 billion statements and estimate the number of hidden duplicates in it. The outcomes of our experiments confirm the effectiveness as well as the efficiency of our method, and suggest that URI matching can be used as a scalable filter for discovering implicit same-as links.

No Thumbnail Available
Publication

Unsupervised duplicate detection using sample non-duplicates

2006 , Lehti, P. , Fankhauser, P.

The problem of identifying objects in databases that refer to the same real world entity, is known, among others, as duplicate detection or record linkage. Objects may be duplicates, even though they are not identical due to errors and missing data. Typical current methods require deep understanding of the application domain or a good representative training set, which entails significant costs. In this paper we present an unsupervised, domain independent approach to duplicate detection that starts with a broad alignment of potential duplicates, and analyses the distribution of observed similarity values among these potential duplicates and among representative sample non-duplicates to improve the initial alignment. Additionally, the presented approach is not only able to align flat records, but makes also use of related objects, which may significantly increase the alignment accuracy. Evaluations show that our approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.

No Thumbnail Available
Publication

Probabilistic iterative duplicate detection

2005 , Lehti, P. , Fankhauser, P.

The problem of identifying approximately duplicate records between databases is known, among others, as duplicate detection or record linkage. To this end, typically either rules or a weighted aggregation of distances between the individual attributes of potential duplicates is used. However, choosing the appropriate rules, distance functions, weights, and thresholds requires deep understanding of the application domain or a good representative training set for supervised learning approaches. In this paper we present an unsupervised, domain independent approach that starts with a broad alignment of potential duplicates, and analyses the distribution of observed distances among potential duplicates and among non-duplicates to iteratively refine the initial alignment. Evaluations show that this approach supersedes other unsupervised approaches and reaches almost the same accuracy as even fully supervised, domain dependent approaches.

No Thumbnail Available
Publication

Language models & topic models for personalizing tag recommendation

2010 , Krestel, R. , Fankhauser, P.

More and more content on the Web is generated by users. To organize this information and make it accessible via current search technology, tagging systems have gained tremendous popularity. Especially for multimedia content they allow to annotate resources with keywords (tags) which opens the door for classic text-based information retrieval. To support the user in choosing the right keywords, tag recommendation algorithms have emerged. In this setting, not only the content is decisive for recommending relevant tags but also the user's preferences. In this paper we introduce an approach to personalized tag recommendation that combines a probabilistic model of tags from the resource with tags from the user. As models we investigate simple language models as well as Latent Dirichlet Allocation. Extensive experiments on a real world dataset crawled from a big tagging system show that personalization improves tag recommendation, and our approach significantly outperforms st ate-of-the-art approaches.

No Thumbnail Available
Publication

Lies and propaganda

2007 , Mehta, B. , Hofmann, T. , Fankhauser, P.

No Thumbnail Available
Publication

A grammar-based index for matching business processes

2005 , Mahleko, B. , Wombacher, A. , Fankhauser, P.

No Thumbnail Available
Publication

DivQ: Diversification for keyword search over structured databases

2010 , Demidova, E. , Fankhauser, P. , Zhou, X. , Nejdl, W.

Keyword queries over structured databases are notoriously ambiguous. No single interpretation of a keyword query can satisfy all users, and multiple interpretations may yield overlapping results. This paper proposes a scheme to balance the relevance and novelty of keyword search results over structured databases. Firstly, we present a probabilistic model which effectively ranks the possible interpretations of a keyword query over structured data. Then, we introduce a scheme to diversify the search results by re-ranking query interpretations, taking into account redundancy of query results. Finally, we propose -nDCG-W and WS-recall, an adaptation of -nDCG and S-recall metrics, taking into account graded relevance of subtopics. Our evaluation on two real-world datasets demonstrates that search results obtained using the proposed diversification algorithms better characterize possible answers available in the database than the results of the initial relevance ranking.

No Thumbnail Available
Publication

Cross system personalization by factor analysis

2006 , Mehta, B. , Hofmann, T. , Fankhauser, P.

Today, personalization in information systems occurs separately within each system that one interacts with. However, there are several potential improvements w.r.t. such isolated approaches. Thus, investments of users in personalizing a system, either through explicit provision of information, or through long and regular use are not transferable to other systems. Moreover, users have little or no control over their profile, since it is deeply buried in personalization engines. Cross-system personalization, i.e. personalization that shares personal information across different systems in a user-centric way, overcomes these problems. User profiles, which are originally scattered across multiple systems, are combined to obtain maximum leverage. This paper discusses an approach in support of cross-system personalization, where a large number of users cross from one system to another, carrying their user profiles with them. These sets of corresponding profiles can be used to learn a mapping between the user profiles of the two systems. In this work, we present and evaluate the use of factor analysis for the purpose of computing recommendations for a new user crossing over from one system to another.

No Thumbnail Available
Publication

Statistical relationship determination in automatic thesaurus construction

2005 , Chen, L. , Fankhauser, P. , Thiel, U. , Kamps, T.