• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Evaluation of Document Deduplication Algorithms for Large Text Corpora
 
  • Details
  • Full
Options
March 4, 2025
Conference Paper
Title

Evaluation of Document Deduplication Algorithms for Large Text Corpora

Abstract
Performance of large language models (LLMs) is correlated with diverse and high-quality training data. One aspect of data quality is the presence of duplicate documents in the training data. In this paper, we evaluate five algorithms for deduplicating large data sets, namely MinHash/LSH, Exact Hashes, SimHash, Scalable Bloom filter, and Suffix Array. We report on their precision, recall, memory requirements and runtime on deduplicating OpenSubtitles and Oscar data for five languages (EN, DE, ES, FR, IT). We find that the best overall performance is achieved by using a MinHash/LSH, but other options such as Scalable Bloom filter can be more suitable in resource-critical situations. While precision varies between 0.833 and 0.985 across algorithms, recall varies between 0.247 and 0.989, indicating different levels of aggressiveness. We conclude that MinHash/LSH is the most suitable algorithm to deduplicate pretraining data for LLMs.
Author(s)
Leveling, Johannes
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Helmer, Lennard  orcid-logo
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Stein, Benny Jörg  orcid-logo
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Wegener, Dennis  orcid-logo
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Sheikh, Zoha
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Fernandes, Elanton
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Abdelwahab, Hammam
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Mainwork
Machine Learning, Optimization, and Data Science. 10th International Conference, LOD 2024. Pt.I  
Project(s)
Aufbau eines Gaia-X Knotens für große KI-Sprachmodelle und innovative Sprachapplikations-Services  
Funder
Bundesministerium für Wirtschaft und Klimaschutz  
Conference
International Conference on Machine Learning, Optimization, and Data 2024  
DOI
10.1007/978-3-031-82481-4_27
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Keyword(s)
  • Text Deduplication

  • Data Preprocessing

  • LLM Training Data Preparation

  • Large Text Corpora

  • MinHash/LSH

  • Bloom filter

  • Exact Hashes

  • SimHash

  • Suffix Array

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024