Mimilakis, S.I.S.I.MimilakisDrossos, K.K.DrossosSantos, J.F.J.F.SantosVirtanen, T.T.VirtanenBengio, Y.Y.BengioSchuller, G.G.Schuller2022-03-142024-04-152022-03-142018https://publica.fraunhofer.de/handle/publica/40323010.1109/ICASSP.2018.84618222-s2.0-85054258844Singing voice separation based on deep learning relies on the usage of time-frequency masking. In many cases the masking process is not a learnable function or is not encapsulated into the deep learning optimization. Consequently, most of the existing methods rely on a post processing step using the generalized Wiener filtering. This work proposes a method that learns and optimizes (during training) a source-dependent mask and does not need the aforementioned post processing step. We introduce a recurrent inference algorithm, a sparse transformation step to improve the mask generation process, and a learned denoising filter. Obtained results show an increase of 0.49 dB for the signal to distortion ratio and 0.30 dB for the signal to interference ratio, compared to previous state-of-the-art approaches for monaural singing voice separation.enautomatic music analysis621006Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Maskconference paper