Fraunhofer-Gesellschaft

Publica

Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Author clustering using compression-based dissimilarity scores: Notebook for PAN at CLEF 2017

 
: Halvani, O.; Graner, L.

:
Fulltext (PDF; )

Cappellato, L.:
CLEF 2017, Conference and Labs of the Evaluation Forum. Working Notes. Online resource : Dublin, Ireland, September 11-14, 2017
Dublin, 2017 (CEUR Workshop Proceedings 1866)
http://ceur-ws.org/Vol-1866/
Paper 59, 11 pp.
Conference and Labs of the Evaluation Forum (CLEF) <2017, Dublin>
English
Conference Paper, Electronic Publication
Fraunhofer SIT ()

Abstract
The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish authorship links between documents in a cluster and provide a list of document pairs, ranked according to a confidence score. We present a simple scheme to handle both scenarios. In order to group the documents by their authors, we use k-Medoids, where the optimal k is determined through the computation of silhouettes. To determine links between the documents in each cluster, we apply a predefined compressor as well as a dissimilarity measure. The resulting compression-based dissimilarity scores are then used to rank all document pairs. The proposed scheme does not require (text-)preprocessing, feature engineering or hyperparameter optimization, which are often necessary in author clustering and/or other related fields. However, the achieved results indicate that there is room for improvement.

: http://publica.fraunhofer.de/documents/N-502815.html