Multi-Channel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition
The paper describes a system for automatic speech recognition (ASR) that is benchmarked with data of the 3rd CHiME challenge, a dataset comprising distant microphone recordings of noisy acoustic scenes in public environments. The proposed ASR system employs various methods to increase recognition accuracy and noise robustness. Two different multi-channel speech enhancement techniques are used to eliminate interfering sounds in the audio stream. One speech enhancement method aims at separating the target speaker's voice from background sources based on non-negative matrix factorization (NMF) using variational Bayesian (VB) inference to estimate NMF parameters. The second technique is based on a time-varying minimum variance distortionless response (MVDR) beamformer that uses spatial information to suppress sound signals not arriving from a desired direction. Prior to speech enhancement, a microphone channel failure detector is applied that is based on cross-comparing cha nnels using a modulation-spectral representation of the speech signal. ASR feature extraction employs the amplitude modulation filter bank (AMFB) that implicates prior information of speech to analyze its temporal dynamics. AMFBs outperform the commonly used frame splicing technique of filter bank features in conjunction with a deep neural network (DNN) based ASR system, which denotes an equivalent data-driven approach to extract modulation-spectral information. In addition, features are speaker adapted, a recurrent neural network (RNN) is employed for language modeling, and hypotheses of different ASR systems are combined to further enhance the recognition accuracy. The proposed ASR system achieves an absolute word error rate (WER) of 5.67% on the real evaluation test data, which is 0.16% lower compared to the best score reported within the 3rd CHiME challenge.