Speech Separation for an Unknown Number of Speakers Using Transformers with Encoder-Decoder Attractors

Chetupalli, Srikanth Raj; Habets, Emanuël Anco Peter

doi:10.21437/Interspeech.2022-10849

2022

Conference Paper

Abstract

Speaker-independent speech separation for single-channel mixtures with an unknown number of multiple speakers in the waveform domain is considered in this paper. To deal with the unknown number of sources, we incorporate an encoder-decoder attractor (EDA) module into a speech separation network. The neural network architecture consists of a trainable encoder-decoder pair and a masking network. The mask network in the proposed approach is inspired by the transformer-based SepFormer separation system. It contains a dual-path block and a triple path block, each block modeling both short-time and long-time dependencies in the signal. The EDA module first summarises the dual-path block output using an LSTM encoder and generates one attractor vector per speaker in the mixture using an LSTM decoder. The attractors are combined with the dual-path block output to generate speaker channels, which are processed jointly by the triple-path block to predict the mask. Further, a linear-sigmoid layer, with attractors as the input, predicts a binary output to indicate a stopping criterion for attractor generation. The proposed approach is evaluated on the WSJ0-mix dataset with mixtures of up to five speakers. State-of-the-art results are obtained in the speech separation quality and speaker counting for all the mixtures.

Author(s)

Chetupalli, Srikanth Raj

Fraunhofer-Institut für Integrierte Schaltungen IIS

Habets, Emanuël Anco Peter

Mainwork

Interspeech 2022

Conference

International Speech Communication Association (INTERSPEECH Annual Conference) 2022

Options

Speech Separation for an Unknown Number of Speakers Using Transformers with Encoder-Decoder Attractors