Prediction of perceived speech quality using deep machine listening

Ooster, J.; Meyer, B.T.; Huber, R.

2018

Conference Paper

Abstract

Subjective ratings of speech quality (SQ) are essential for evaluating algorithms for speech transmission and enhancement. In this paper we explore a non-intrusive model for SQ prediction based on the output of a deep neural net (DNN) from a regular automatic speech recognizer. The degradation of phoneme probabilities obtained from the net is quantified with the mean temporal distance proposed earlier for multi-stream ASR. The SQ predicted with this method is compared with average subject ratings from the TCD-VoIP speech quality database that covers several effects of SQ degradation that can occur in VoIP applications such as clipping, packet loss, echo effects, background noise and competing speakers. Our approach is tailored to speech and therefore not applicable when quality is degraded by a competing speaker, which is reflected by an insignificant correlation between model output and subjective SQ. In all other conditions mentioned above, the model reaches an average correlation of r=0.87, which is higher than the correlation achieved with the baseline ITU-T P.563 (r=0.71) and the American National Standard ANIQUE+ (r=0.75). Since the most robust ASR system is not necessarily the best model to predict SQ, we investigate the effect of the amount of training data on quality prediction.

Author(s)

Ooster, J.

Meyer, B.T.

Huber, R.

Hauptwerk

Interspeech 2018. Online resource

Konferenz

International Speech Communication Association (Interspeech Annual Conference) 2018

Options

Prediction of perceived speech quality using deep machine listening