Machine Learning-Based Detection of AI-Generated Text via Stylistic and Statistical Feature Modeling

Schäfer, Karla; Steinebach, Martin

doi:10.1109/Trustcom66490.2025.00134

November 14, 2025

Conference Paper

Abstract

Through the advances of large-language models (LLMs) AI- generated text can be created with ease. But, these tools can also pose a threat, e.g. through the creation of disinformation. In this work, we analysed texts generated by three LLMs: GPT-3.5, LLaMA3, and Qwen from the CUDRT dataset. We extracted 220 stylistic and statistical features of human and AI-generated text using the LFTK library. First, we analysed the features using the pearson correlation. Second, we trained five machine learning models and tested the classifiers on detecting completely AI-generated, polished, rewritten texts, and summaries created by AI. We calculated an F1-score of 90%+ for the text generated entirely by AI, depending on the LLM used. We found that AI-generated texts, independent of LLM, can be identified through a high kuperman age, i.e. high word complexity, whereby human-written texts are written with higher lexical variation and richness. We provide an explanation for the classification results and a comparison with RoBERTa (fine-tuned).

Author(s)

Schäfer, Karla

Fraunhofer-Institut für Sichere Informationstechnologie SIT

Steinebach, Martin

Fraunhofer-Institut für Sichere Informationstechnologie SIT

Mainwork

IEEE 24th International Conference on Trust, Security and Privacy in Computing and Communications, TrustCom 2025. Proceedings

Conference

International Conference on Trust, Security and Privacy in Computing and Communications 2025

Options

Machine Learning-Based Detection of AI-Generated Text via Stylistic and Statistical Feature Modeling