Personalized Speech Synthesis for Zero-Shot Keyword Spotting

Gökgöz, Fahrettin; Cornaggia-Urrigshardt, Alessia; Wilkinghoff, Kevin

2025

Conference Paper

Abstract

Usually, keyword spotting (KWS) systems can only detect the specific keywords they were trained to detect. Moreover, a sufficiently large number of spoken samples needs to be provided for each keyword, which may be impractical, particularly in dynamic applications. In this work, we present a methodology for generating synthetic speech samples to enhance KWS models, specifically to adapt to unseen words associated with known speakers. We first refine the SoundStream neural encoder to achieve high-quality encoding and decoding of the target speaker's voice. Subsequently, we adapt the SpearTTS model to create phonetically diverse sentences through a use-case generator module. The generated sentences are then strongly labeled to capture individual words. In experiments, we trained a template-based KWS model using this synthetic dataset and evaluated its performance against a set of real-world data. Our findings demonstrate the efficacy of synthetic data in improving KWS adaptability to new vocabularies.

Author(s)

Gökgöz, Fahrettin

Fraunhofer-Institut für Kommunikation, Informationsverarbeitung und Ergonomie FKIE

Cornaggia-Urrigshardt, Alessia

Fraunhofer-Institut für Kommunikation, Informationsverarbeitung und Ergonomie FKIE

Wilkinghoff, Kevin

Aalborg University

Mainwork

Speech Communication. 16th ITG Conference 2025

Conference

Conference on Speech Communication 2025

Options

Personalized Speech Synthesis for Zero-Shot Keyword Spotting