Options
2025
Journal Article
Title
Enhanced Audio-Visual Speech Synthesis Via Multi-Discriminative Learning
Abstract
Audio-visual speech synthesis (AVSS) aims to produce an audio-visual stream that conveys a target speaker's speech. In this study, the AVSS system takes the input speech of a source speaker and generates the audio-visual stream of the target speaker while preserving the linguistic content of the source speech. The process involves two main components: voice conversion (VC), which adapts the vocal features from the source to the target speaker, and audio-visual synthesis (AVS), which generates the synchronized audio-visual stream from the transformed speech. This paper presents a novel generative framework based on multi-discriminative learning to enhance the realism and quality of AVSS outputs. The proposed approach integrates multiple discriminators, including capsule networks, co-occurrence neural networks, and vision transformers (ViTs), within the VC model to leverage their unique strengths in capturing diverse speech features. Additionally, the AVS model incorporates a co-occurrence neural network to improve video quality and achieve better temporal alignment between audio and visual data. Experimental evaluations on standard benchmarks demonstrate that the proposed method achieves significant improvements in both audio and video quality, offering a substantial advancement in AVSS technology.
Author(s)