Audio-visual speech synthesis using vision transformer - enhanced autoencoders with ensemble of loss functions

Ghosh, Subhayu; Sarkar, Snehashis; Ghosh, Sovan; Zalkow, Frank; Jana, Nanda Dulal

doi:10.1007/s10489-024-05380-7

2024

Journal Article

Abstract

Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker’s speech into another’s audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifying vocal features from the source to the target speaker, akin to voice conversion (VC), and then synthesizing the audio-visual stream for the target speaker, termed audio-visual synthesis (AVS). In this work, a novel AVSS approach is proposed using vision transformer (ViT)-based Autoencoders (AEs), enriched with a combination of cycle consistency and reconstruction loss functions, with the aim of enhancing synthesis quality. Leveraging ViT’s attention mechanism, this method effectively captures spectral and temporal features from input speech. The combination of cycle consistency and reconstruction loss improves synthesis quality and aids in preserving essential information. The proposed framework is trained and tested on benchmark datasets, and compared extensively with state-of-the-art (SOTA) methods. The experimental results demonstrate the superiority of the proposed approach over existing SOTA models, in terms of quality and intelligibility for AVSS, indicating the potential for real-world applications.

Author(s)

Ghosh, Subhayu

National Institute of Technology Durgapur

Sarkar, Snehashis

National Institute of Technology Durgapur

Ghosh, Sovan

National Institute of Technology Durgapur

Zalkow, Frank

Fraunhofer-Institut für Integrierte Schaltungen IIS

Jana, Nanda Dulal

National Institute of Technology Durgapur

Journal

Applied intelligence

Options

Audio-visual speech synthesis using vision transformer - enhanced autoencoders with ensemble of loss functions