Options
2025
Conference Paper
Title
Discrete Audio Representations from SoundStream: A Dual Approach to Efficient Transmission and Speech Detection
Abstract
This paper investigates the application of Sound-Stream, a state-of-the-art neural audio codec, to achieve efficient audio transmission and effective speech detection in resource-constrained environments. We analyze SoundStream's architecture, emphasizing its innovative use of Residual Vector Quantization (RVQ) to create compact, discrete audio representations while preserving essential audio features. Additionally, we introduce a novel voice activity detection (VAD) algorithm designed to identify relevant speech segments within the transmitted audio. Our evaluation employs objective metrics, including Deep Noise Suppression Mean Opinion Score (DNSMOS), Non-Intrusive Speech Quality Assessment (NISQA), Short-Time Objective Intelligibility (STOI), and the density distribution of speech over codebooks. Furthermore, we assess the performance of our VAD using the F-measure. The results demonstrate SoundStream's capability to maintain high audio fidelity and intelligibility despite varying encoding stages, while the VAD algorithm effectively ensures speech detection. This study highlights the potential of these methodologies to enhance audio processing in diverse applications, particularly in scenarios where bandwidth and clarity are critical. This paper was originally presented at the NATO Science and Technology Organization Symposium (ICMCIS) organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-209-RSY - the ICMCIS, held in Oeiras, Portugal, 13-14 May 2025'.