Real-Time Single-Channel Speaker-Conditioned Target Speaker Extraction Using TCN-Conformer with Efficient Self-Attention Mechanisms

Sinha, Ragini; Rollwage, Christian; Doclo, Simon

doi:10.23919/EUSIPCO63237.2025.11226272

2025

Conference Paper

Abstract

Speaker-conditioned target speaker extraction systems aim to extract the target speaker from a mixture of speakers by utilizing auxiliary information about the target speaker. Typically, such systems consist of a speaker embedder network and a speaker separator network. While self-attention mechanisms have demonstrated remarkable performance in speech processing tasks, including target speaker extraction, their high memory usage and computational complexity pose challenges for real-time applications. To address these limitations, we integrate a linear self-attention mechanism into the separator network, significantly reducing memory and computational costs, and thereby making the system more suitable for real-time applications. Furthermore, we evaluate the performance of this linear self-attention-based speaker extraction system against a system using memory-efficient self-attention. Experimental results on two-speaker, three-speaker, and noisy two-speaker mixtures show that linear self-attention not only improves speaker extraction performance compared to both traditional and memory-efficient self-attention but also significantly reduces the real-time factor and computational cost.