Feature denoising using CNNs for noisy speech recognition
In modern days automatic speech recognition (ASR) systems rise in popularity especially in smartphones and smart home devices. If those ASR systems were to reach the level of human hearing, they could for example be used to remotely control intelligent devices or create live subtitles on television. Those systems would, among other things, vastly increase the living standards of deaf people and revolutionize the human-computer interaction. One of the biggest problems of ASR are background noises like car sounds, conversations or wind. With the exclusion of these noises state-of-the-art ASR systems are already nearing the proficiency of human hearing. A common approach to handle these noises are speech enhancement (SE) systems. In this thesis we examined a speech denoising system based on Convolutional Neural Networks (CNN). CNNs were already successfully used for speech recognition, SE and image denoising in previous studies. Based on these results using CNNs as a speech denoising system stands to reason. The goal of this thesis was to create a CNN based speech enhancer to be used in the Fraunhofer speech recognition pipeline. The network presented consisted solely of convolutional layers and mapped noisy filter bank features onto clean filter bank features. As foundation the speech recognizer from the Eesen toolkit of the Wall Street Journal (WSJ) example was used. In our experiments we found out, that the CNN denoising network can decrease the word error rate (WER) of a speech recognition system up to more than an absolute of 20% in an environment with a moderate signal-to-noise ratio (SNR). At the same time the WER of speech recorded on high SNRs only increased by one percent. Additionally it was shown that the denoising system generalizes onto multiple noise types and onto real world data. The results of our studies showed, that denoising audio using a CNN on the feature level is possible and can improve state-of-the-art speech recognition systems signiffcantly for noisy environments while at the same time only slightly decreasing the performance for clean speech.
Düsseldorf, Univ., Master Thesis, 2017