Audio Transformer for Synthetic Speech Detection via Formant Magnitude and Phase Analysis

Cuccovillo, LucaLucaCuccovilloGerhardt, MilicaMilicaGerhardtAichroth, PatrickPatrickAichroth2024-05-082024-05-082024https://publica.fraunhofer.de/handle/publica/46771910.1109/ICASSP48485.2024.104459322-s2.0-85195419268This paper introduces a novel multi-task transformer for synthetic speech detection. The network encodes magnitude and phase of the input speech with a feature bottleneck, used to autoencode the input magnitude, to predict the trajectory of the fundamental frequency (f0), and to discern if the input speech is synthetic or natural. The approach achieves state-of-the-art performance on the ASVspoof 2019 LA dataset while still retaining interpretability, with an AUC score of 0.910.enMedia ForensicsSignal processingTransformersMultitaskingAcousticsTrajectorySpeech synthesissynthetic speech detectionaudio deepfakesaudio transformervoice formantsAudio Transformer for Synthetic Speech Detection via Formant Magnitude and Phase Analysisconference paper