Mapping representations of speaker characteristics using deep learning
An automatic model-based system is proposed to estimate the corner vowel formant frequencies and the acoustic measure known as the triangle Vowels Space Area (tVSA) directly from unlabeled natural speech. The proposed algorithm is able to estimate the tVSA automatically from the speech signal without phonetical or vowel transcriptions. The i-Vector features are employed as the speaker characteristic representation from which the formant frequencies of the corner vowels of the speaker are estimated by regression classiffiers. Two regression classiffiers, Deep Neural Networks (DNN) and Support Vector Regression (SVR) are investigated in this thesis. The best configuration uses the SVR, which is able to predict the formant frequencies of the test speakers with evaluation measures R2 up to 0 .56719 and rho up to 0.76485.