Calibrating POI-based Synthetic Speech Detection

Le Roux, Thomas; Cuccovillo, Luca; Aichroth, Patrick

doi:10.1145/3733567.3735565

June 30, 2025

Conference Paper

Abstract

Recent advances in deep learning have yielded increasingly sophisticated speech generation systems (Text-To-Speech or Voice Conversion algorithms) capable of producing realistic synthetic speech material that is often indistinguishable from human voices. Although these technologies support a wide range of legitimate applications, they also facilitate malicious uses, including impersonation and misinformation, thereby posing significant societal threats. As a result, synthetic speech detection has emerged as an urgent research focus. Despite numerous proposed methods, a persistent generalization problem remains: detectors struggle to classify out-of-domain samples unseen during training, hence adapting and staying consistent when facing real-world scenarios.
We tackle this limitation with a Person-of-Interest framework that exploits speaker-specific characteristics for Synthetic Speech Detection, thereby enhancing generalizability across diverse generators. Specifically, we introduce an ensemble approach that addresses a previously unstudied calibration problem: the system uses only recording-level statistics to self-calibrate, leveraging the abstraction capabilities of a large-scale, pre-trained audio model. Experiments demonstrate that our method achieves strong performance, high generalizability, and robustness across various datasets.

Author(s)

Le Roux, Thomas

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Cuccovillo, Luca

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Aichroth, Patrick

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Mainwork

MAD 2025, 4th ACM International Workshop on Multimedia AI against Disinformation. Proceedings

Conference

International Workshop on Multimedia AI against Disinformation 2025

Options

Calibrating POI-based Synthetic Speech Detection