Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Probabilistic spatial filter estimation for signal enhancement in multi-channel automatic speech recognition

: Kayser, Hendrik; Moritz, Niko; Anemüller, Jörn


International Speech Communication Association -ISCA-:
Understanding speech processing in humans and machines. Vol.4 : 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016); San Francisco, California, USA, 8-12 September 2016
Red Hook, NY: Curran, 2016
ISBN: 978-1-5108-3313-5
International Speech Communication Association (Interspeech Annual Conference) <17, 2016, San Francisco/Calif.>
Fraunhofer IDMT ()

Speech recognition in multi-channel environments requires target speaker localization, multi-channel signal enhancement and robust speech recognition. We here propose a system that addresses these problems: Localization is performed with a recently introduced probabilistic localization method that is based on support-vector machine learning of GCC-PHAT weights and that estimates a spatial source probability map. The main contribution of the present work is the introduction of a probabilistic approach to (re-)estimation of location-specific steering vectors based on weighting of observed inter-channel phase differences with the spatial source probability map derived in the localization step. Subsequent speech recognition is carried out with a DNN-HMM system using amplitude modulation filter bank (AMFB) acoustic features which are robust to spectral distortions introduced during spatial filtering.
The system has been evaluated on the CHIME-3 multi-channel ASR dataset. Recognition was carried out with and without probabilistic steering vector re-estimation and with MVDR and delay-and-sum beamforming, respectively. Results indicate that the system attains on real-world evaluation data a relative improvement of 31.98% over the baseline and of 21.44% over a modified baseline. We note that this improvement is achieved without exploiting oracle knowledge about speech/non-speech intervals for noise covariance estimation (which is, however, assumed for baseline processing).