Fraunhofer-Gesellschaft

Publica

Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

Departamento de Nosotros: How machine translated corpora affects language models in MRC tasks

 
: Khvalchik, M.; Galkin, M.

:
Volltext ()

Gamallo, P.:
HI4NLP 2020, Hybrid Intelligence for Natural Language Processing Tasks 2020. Online resource : Proceedings of the Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP 2020) co-located with 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, August 29, 2020
La Clusaz: CEUR, 2020 (CEUR Workshop Proceedings 2693)
http://ceur-ws.org/Vol-2693/
ISSN: 1613-0073
S.29-33
Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP) <2020, Online>
European Conference on Artificial Intelligence (ECAI) <24, 2020, Online>
Englisch
Konferenzbeitrag, Elektronische Publikation
Fraunhofer IAIS ()

Abstract
Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced languages and their mono- and multilingual LMs often struggle to obtain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA downstream transfer-learning question answering tasks show that presumably multilingual LMs exhibit more resilience to machine translation artif acts in terms of the exact match score.

: http://publica.fraunhofer.de/dokumente/N-614705.html