Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Authorship verification via k-nearest neighbor estimation. Notebook for PAN at CLEF 2013

: Halvani, O.; Steinebach, M.; Zimmermann, R.

Fulltext ()

Forner, P.:
CLEF 2013. Working notes. Online resource : Valencia, Spain, September 23-26, 2013
La Clusaz: CEUR, 2013 (CEUR Workshop Proceedings 1179)
ISSN: 1613-0073
URN: urn:nbn:de:0074-1179-2
CLEF Initiative (International Conference) <4, 2013, Valencia>
Conference Paper, Electronic Publication
Fraunhofer SIT ()

In this paper we describe our k-Nearest Neighbor (k-NN) based Authorship Verification method for the Author Identification (AI) task of the PAN 2013 challenge. The method follows an ensemble classification technique based on the combination of suitable feature categories. For each chosen feature category we apply a k-NN classifier to calculate a style deviation score between the training documents of the true author A and the document from an author, who claims to be A. Depending on the score and a given threshold, a decision for or against the alleged author is generated and stored into a list. Afterwards, the final decision regarding the alleged authorship is determined through a majority vote among all decisions within this list. The method provides a number of benefits as for instance the independence of linguistic resources like ontologies, thesauruses or even language models. A further benefit is the language-independency among different Indo-European languages as the approach is applicable on languages like Spanish, English, Greek or German. Another benefit is the low runtime of the method, since there is no need for deep linguistic processing like POS-tagging, chunking or parsing. Moreover, the method can be extended or modified for instance by replacing the classification function, the threshold or the underlying features including their parameters (e.g. n-Gram sizes or the amount of feature frequencies). In addition to the PAN 2013 AI-training-corpus, where we gained an overall accuracy score of 80%, we also evaluated the algorithm on our own dataset with an accuracy of 77.50%.