• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Using Automatic Speech Recognition in Spoken Corpus Curation
 
  • Details
  • Full
Options
2020
Conference Paper
Title

Using Automatic Speech Recognition in Spoken Corpus Curation

Abstract
The newest generation of speech technology caused a huge increase of audio-visual data nowadays being enhanced with orthographic transcripts such as in automatic subtitling in online platforms. Research data centers and archives contain a range of new and historical data, which are currently only partially transcribed and therefore only partially accessible for systematic querying. Automatic Speech Recognition (ASR) is one option of making that data accessible. This paper tests the usability of a state-of-the-art ASR-System on a historical (from the 1960s), but regionally balanced corpus of spoken German, and a relatively new corpus (from 2012) recorded in a narrow area. We observed a regional bias of the ASR-System with higher recognition scores for the north of Germany vs. lower scores f or the south. A detailed analysis of the narrow region data revealed -- despite relatively high ASR-confidence -- some specific word errors due to a lack of regional adaptation. These findings need to be considered in decisions on further data processing and the curation of corpora, e.g. correcting transcripts or transcribing from scratch. Such geography-dependent analyses can also have the potential for ASR-development to make targeted data selection for training/adaptation and to increase the sensitivity towards varieties of pluricentric languages.
Author(s)
Gorisch, Jan
Institut für Deutsche Sprache
Gref, Michael  
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Schmidt, Thomas
Institut für Deutsche Sprache
Mainwork
12th Language Resources and Evaluation Conference, LREC 2020. Proceedings. Online resource  
Conference
Language Resources and Evaluation Conference (LREC) 2020  
Open Access
DOI
10.24406/publica-fhg-408179
File(s)
N-592788.pdf (415.73 KB)
Rights
CC BY-NC 4.0: Creative Commons Attribution-NonCommercial
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
Keyword(s)
  • oral corpora

  • automatic transcription

  • ASR

  • corpus curation

  • pluricentric

  • spoken German

  • Ripuarian

  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024