Ghosh, SubhayuSubhayuGhoshSarkar, SnehashisSnehashisSarkarGhosh, SovanSovanGhoshZalkow, FrankFrankZalkowJana, Nanda DulalNanda DulalJana2024-03-272024-03-272024https://publica.fraunhofer.de/handle/publica/46453710.1007/s10489-024-05380-7Audio-visual speech synthesis (AVSS) has garnered attention in recent years for its utility in the realm of audio-visual learning. AVSS transforms one speaker’s speech into another’s audio-visual stream while retaining linguistic content. This approach extends existing AVSS methods by first modifying vocal features from the source to the target speaker, akin to voice conversion (VC), and then synthesizing the audio-visual stream for the target speaker, termed audio-visual synthesis (AVS). In this work, a novel AVSS approach is proposed using vision transformer (ViT)-based Autoencoders (AEs), enriched with a combination of cycle consistency and reconstruction loss functions, with the aim of enhancing synthesis quality. Leveraging ViT’s attention mechanism, this method effectively captures spectral and temporal features from input speech. The combination of cycle consistency and reconstruction loss improves synthesis quality and aids in preserving essential information. The proposed framework is trained and tested on benchmark datasets, and compared extensively with state-of-the-art (SOTA) methods. The experimental results demonstrate the superiority of the proposed approach over existing SOTA models, in terms of quality and intelligibility for AVSS, indicating the potential for real-world applications.enAudio-visual speech synthesis using vision transformer - enhanced autoencoders with ensemble of loss functionsjournal article