The AudioLabs System for the Blizzard Challenge 2023

Zalkow, Frank; Sani, Paolo; Fast, Michael; Bauer, Judith; Joshaghani, Mohammad; Kayyar Lakshminarayana, Kishor; Habets, Emanuel; Dittmar, Christian

doi:10.21437/Blizzard.2023-8

2023

Conference Paper

Abstract

In this paper, we describe our contribution to the Blizzard Challenge 2023. This challenge has the goal of understanding and comparing research techniques in building corpus-based speech synthesizers on the same data. The 2023 edition of the challenge focuses on the French language and low-resource settings. Our text-to-speech (TTS) synthesis system consists of three main building blocks. First, a non-autoregressive acoustic model converts symbolic input sequences (phonemes) into mel-scaled speech spectrograms. Second, a post-processing model based on a generative adversarial network (GAN) enhances the predicted mel spectrograms. Third, the GAN-based neural vocoder StyleMelGAN converts the enhanced spectrogram into a time-domain speech waveform.