Ayuda
Ir al contenido

Dialnet


Injection of linguistic knowledge into neural text generation models

  • Autores: Noé Casas Manzanares
  • Directores de la Tesis: José Adrián Rodríguez Fonollosa (dir. tes.), Marta Ruiz Costa-Jussà (codir. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2020
  • Idioma: español
  • Materias:
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Language is an organic construct. It emanates from the need for communication and changes through time, influenced by multiple factors. The resulting language structures are a mix of regular syntactic and morphological constructions together with divergent irregular elements. Linguistics aims at formalizing these structures, providing a rationalization of the underlying phenomena. However, linguistic information alone is not enough to fully characterize the structures in language, as they are intrinsically tied to meaning, which constrains and modulates the applicability of the linguistic phenomena and also to context and domain.

      Classical machine translation approaches, like rule-based systems, relied completely on the linguistic formalisms. Hundreds of morphological and grammatical rules were wired together to analyze input text and translate it into the target language, trying to take into account the semantic load carried by it. While this kind of processing can satisfactorily address most of the low-level language structures, many of the meaning-dependent structures failed to be analyzed correctly.

      On the other hand, the dominant neural language processing systems are trained from raw textual data, handling it as a sequence of discrete tokens. These discrete tokens are normally defined looking for reusable word pieces identified statistically from data. In the whole training process, there is no explicit notion of linguistic knowledge: no morphemes, no morphological information, no relationships among words, or hierarchical groupings.

      This thesis aims at bridging the gap between the neural systems and linguistics-based systems, devising systems that have the flexibility and good results of the former with a base on the linguistic formalisms, with the purposes of improving quality where data alone cannot and forcing human-understandable working dynamics into the otherwise black-box neural systems. For this, we propose techniques to fuse statistical subwords with word-level linguistic information, to remove subwords altogether and rely solely on lemmas and morphological traits of the words, and to drive the text generation process on the ordering defined by syntactic dependencies.

      The main results of the proposed methods are the improvements in translation quality that can be obtained by injecting morphological information into NMT systems when testing on out-of-domain data for morphologically-rich languages, and the control over the generated text that can be gained by means of linking the generation order to the syntactic structure.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno