• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. Departamento de Nosotros: How machine translated corpora affects language models in MRC tasks
 
  • Details
  • Full
Options
2020
Conference Paper
Title

Departamento de Nosotros: How machine translated corpora affects language models in MRC tasks

Abstract
Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA downstream transfer-learning question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artif acts in terms of the exact match score.
Author(s)
Khvalchik, M.
Galkin, Mikhail  
Mainwork
HI4NLP 2020, Hybrid Intelligence for Natural Language Processing Tasks 2020. Online resource  
Conference
Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP) 2020  
European Conference on Artificial Intelligence (ECAI) 2020  
Link
Link
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024