Detecting double-talk (overlapping speech) in conversations using deep learning
The work presented in this thesis aims to automatically detect double-talks (overlapping speech) in audio recordings of natural conversations using a Deep Convolutional Neural Network. In doing it so, manual engineering of problem specific acoustic features prevelant in classical approaches is avoided. The characteristic challenges arising from the ephemeral nature of natural double-talks, in addition to the standard issues faced in development of a pattern recognition system, are handled using different methods. In particular, careful rebalancing of the training data for tackling the inherent class imbalance, pre-removal of silence, and two standard normalization procedures for reducing the mismatch in training and testing conditions, are all scientifically evaluated for their respective impacts. Furthermore, the shortcoming of the proposed neural network in modelling long-term temporal dependencies is documented, and the attempt for fixing it with Viterbi decoding is reported. Satisfactory results have been achieved on a large and representative testing set, while multiple avenues have been paved for future works.