Fraunhofer-Gesellschaft

Publica

Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Phonotactic language identification for singing

 
: Kruspe, Anna M.

:

International Speech Communication Association -ISCA-:
Understanding speech processing in humans and machines. Vol.5 : 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016); San Francisco, California, USA, 8-12 September 2016
Red Hook, NY: Curran, 2016
ISBN: 978-1-5108-3313-5
pp.3319-3323
International Speech Communication Association (Interspeech Annual Conference) <17, 2016, San Francisco/Calif.>
English
Conference Paper
Fraunhofer IDMT ()
vocal analysis; phoneme recognition; language identification; lyrics alignment

Abstract
In the past decades, many successful approaches for language identification have been published. However, almost none of these approaches were developed with singing in mind. Singing has a lot of characteristics that differ from speech, such as a wider variance of fundamental frequencies and phoneme durations, vibrato, pronunciation differences, and different semantic content.
We present a new phonotactic language identification system for singing based on phoneme posteriorgrams. These posteriorgrams were extracted using acoustic models trained on English speech ( TIMIT) and on an unannotated English-language a-capella singing dataset ( DAMP). SVM models were then trained on phoneme statistics.
The models are evaluated on a set of amateur singing recordings from YouTube, and, for comparison, on the OGI Multilanguage corpus.
While the results on a-capella singing are somewhat worse than the ones previously obtained using i-vector extraction, this approach is easier to implement. Phoneme posteriorgrams need to be extracted for many applications, and can easily be employed for language identification using this approach. The results on singing improve significantly when the utilized acoustic models have also been trained on singing. Interestingly, the best results on the OGI speech corpus are also obtained when acoustic models trained on singing are used.

: http://publica.fraunhofer.de/documents/N-435949.html