Options
2025
Conference Paper
Title
Real-Time Single-Channel Speaker-Conditioned Target Speaker Extraction Using TCN-Conformer with Efficient Self-Attention Mechanisms
Abstract
Speaker-conditioned target speaker extraction systems aim to extract the target speaker from a mixture of speakers by utilizing auxiliary information about the target speaker. Typically, such systems consist of a speaker embedder network and a speaker separator network. While self-attention mechanisms have demonstrated remarkable performance in speech processing tasks, including target speaker extraction, their high memory usage and computational complexity pose challenges for real-time applications. To address these limitations, we integrate a linear self-attention mechanism into the separator network, significantly reducing memory and computational costs, and thereby making the system more suitable for real-time applications. Furthermore, we evaluate the performance of this linear self-attention-based speaker extraction system against a system using memory-efficient self-attention. Experimental results on two-speaker, three-speaker, and noisy two-speaker mixtures show that linear self-attention not only improves speaker extraction performance compared to both traditional and memory-efficient self-attention but also significantly reduces the real-time factor and computational cost.
Conference