Combination strategy based on relative performance monitoring for multi-stream reverberant speech recognition

Xiong, F.; Goetze, S.; Meyer, B.T.

2017

Conference Paper

Abstract

A multi-stream framework with deep neural network (DNN) classifiers is applied to improve automatic speech recognition (ASR) in environments with different reverberation characteristics. We propose a room parameter estimation model to establish a reliable combination strategy which performs on either DNN posterior probabilities or word lattices. The model is implemented by training a multilayer perceptron incorporating auditory-inspired features in order to distinguish between and generalize to various reverberant conditions, and the model output is shown to be highly correlated to ASR performances between multiple streams, i.e., relative performance monitoring, in contrast to conventional mean temporal distance based performance monitoring for a single stream. Compared to traditional multi-condition training, average relative word error rate improvements of 7.7% and 9.4% have been achieved by the proposed combination strategies performing on posteriors and lattices, respectively, when the multi-stream ASR is tested in known and unknown simulated reverberant environments as well as realistically recorded conditions taken from REVERB Challenge evaluation set.

Author(s)

Xiong, F.

Goetze, S.

Meyer, B.T.

Hauptwerk

IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017. Proceedings

Konferenz

International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2017

Options

Combination strategy based on relative performance monitoring for multi-stream reverberant speech recognition