Options
2012
Conference Paper
Title
Encoplot - tuned for high recall (also proposing a new plagiarism detection score)
Abstract
This article describes the latest changes to our plagiarism detection system Encoplot. We have sent the modified system to the PAN@CLEF 2012 automatic detection of plagiarism challenge, where it ranked 2nd by the F-measure and 3rd by the "plagdet" scoring method that we had previously shown to be flawed to some extent. The main changes have been done to the heuristic that tries to recognize the clusters of N-grams matches as matching passages in the pair of documents examined. We have aimed for high recall under difficult conditions (sparse matches) which are typical for real-life rephrasing by people. The result of the evaluation on the training and test PAN 2012 corpora shows that we have achieved our goal of improving the performance of this piece of the Encoplot plagiarism dete ction system. In the final part of this article we analyze the anomalies of the plagdet scoring method, show that those are not negligible, and propose a modified plagdet version that lowers those anomalies.