Ayuda
Ir al contenido

Dialnet


Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks

    1. [1] Dresden University of Technology

      Dresden University of Technology

      Kreisfreie Stadt Dresden, Alemania

    2. [2] SemanticWebCompany, Austria
  • Localización: Proceedings of the Workshop on Hybrid Intelligence for Natural Language Processing Tasks (HI4NLP 2020) co-located with 24th European Conference on Artificial Intelligence (ECAI 2020): Santiago de Compostela, Spain, August 29, 2020 / coord. por Pablo Gamallo Otero, Marcos García González, Patricia Martin-Rodilla, Martín Pereira Fariña, 2020, págs. 29-33
  • Idioma: inglés
  • Enlaces
  • Resumen
    • Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced lan- guages and their mono- and multilingual LMs often struggle to ob- tain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA down- stream transfer-learning question answering tasks show that presum- ably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno