Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Maria Khvalchik; Mikhail Malkin

Ayuda

Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

Maria Khvalchik ^[2] ; Mikhail Galkin ^[1]
1. [1] Dresden University of Technology
  
  Dresden University of Technology
  
  Kreisfreie Stadt Dresden, Alemania
2. [2] SemanticWebCompany, Austria
Localización: Proceedings of the Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP 2020) co-located with 24th European Conference on Artificial Intelligence (ECAI 2020): Santiago de Compostela, Spain, August 29, 2020 / coord. por Pablo Gamallo Otero, Marcos García González, Patricia Martin-Rodilla, Martín Pereira Fariña, 2020, págs. 29-33
Idioma: inglés
Enlaces
- Texto completo (pdf)
Resumen
- Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced lan- guages and their mono- and multilingual LMs often struggle to ob- tain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA down- stream transfer-learning question answering tasks show that presum- ably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.