Options
2025
Conference Paper
Title
Personalized Speech Synthesis for Zero-Shot Keyword Spotting
Abstract
Usually, keyword spotting (KWS) systems can only detect the specific keywords they were trained to detect. Moreover, a sufficiently large number of spoken samples needs to be provided for each keyword, which may be impractical, particularly in dynamic applications. In this work, we present a methodology for generating synthetic speech samples to enhance KWS models, specifically to adapt to unseen words associated with known speakers. We first refine the SoundStream neural encoder to achieve high-quality encoding and decoding of the target speaker's voice. Subsequently, we adapt the SpearTTS model to create phonetically diverse sentences through a use-case generator module. The generated sentences are then strongly labeled to capture individual words. In experiments, we trained a template-based KWS model using this synthetic dataset and evaluated its performance against a set of real-world data. Our findings demonstrate the efficacy of synthetic data in improving KWS adaptability to new vocabularies.
Conference