Audio Spectrogram Transformer for Synthetic Speech Detection via Speech Formant Analysis

Cuccovillo, Luca; Gerhardt, Milica; Aichroth, Patrick

doi:10.1109/WIFS58808.2023.10374615

2024

Conference Paper

Abstract

In this paper, we address the challenge of synthetic speech detection, which has become increasingly important due to the latest advancements in text-to-speech and voice conversion technologies. We propose a novel multi-task neural network architecture, designed to be interpretable and specifically tailored for audio signals. The architecture includes a feature bottle-neck, used to autoencode the input spectrogram, predict the fundamental frequency (f0) trajectory, and classify the speech as synthetic or natural. Hence, the synthesis detection can be considered a byproduct of attending to the energy distribution among vocal formants, providing a clear understanding of which characteristics of the input signal influence the final outcome. Our evaluation on the ASVspoof 2019 LA partition indicates better performance than the current state of the art, with an AUC score of 0.900.

Author(s)

Cuccovillo, Luca

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Gerhardt, Milica

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Aichroth, Patrick

Fraunhofer-Institut für Digitale Medientechnologie IDMT

Mainwork

IEEE Workshop on Information Forensics and Security, WIFS 2023

Conference

International Workshop on Information Forensics and Security 2023

Options

Audio Spectrogram Transformer for Synthetic Speech Detection via Speech Formant Analysis