Disambiguating Spanish se constructions with machine learning techniques

Nuria Aldama García

Ayuda

Disambiguating Spanish se constructions with machine learning techniques

Autores: Nuria Aldama García
Directores de la Tesis: Antonio Moreno Sandoval (dir. tes.)
Lectura: En la Universidad Autónoma de Madrid ( España ) en 2021
Idioma: inglés
Número de páginas: 219
Títulos paralelos:
- Desambiguación de construcciones con se con aprendizaje automático
Tribunal Calificador de la Tesis: Cristina Sánchez López (presid.), Jordi Porta Zamorano (secret.), Pablo Gamallo Otero (voc.)
Programa de doctorado: Programa de Doctorado en Filosofía y Ciencias del Lenguaje por la Universidad Autónoma de Madrid
Materias:
- Lingüística
  - Lingüística aplicada
    - Lingüística informatizada
Enlaces
- Tesis en acceso abierto en: Biblos-e Archivo
Resumen
- Spanish se constructions constitute a linguistic phenomenon that challenges Natural Language Processing (NLP) tasks such as part-of-speech or dependency relation tagging. The three main reasons why se is a hurdling topic for NLP are: rst, the high-frequency of appearance of se in Spanish; second, the nine di erent syntactic constructions where se appears adding information of diverse nature depending on the context; third, the lack of gender and number features se displays that does not help se-type disambiguation. This thesis' main goal is to improve the state-of-the-art results on automatic morphosyntactic se analysis on the basis of two hypotheses: the grouping (GH) and the subcategorization frame (SFH) hypotheses. This thesis proposes a new annotation scheme for se that connects the di erent constructions through a transitivity gradient (Moreno Cabrera, 2004). The new annotation scheme is applied on the SE-corpus, a European Spanish corpus made of 3,100 sentences containing the word se. The SE-corpus belongs to the news, leisure and daily life domain of CORPES XXI (Real Academia Espa~nola, 2018) and it has been manually annotated as part of this research work. The SE-corpus is used to train di erent models using UDPipe1.2 to test whether the new annotation scheme can be learnt by the neural networks that underlie the dependency parser. The resulting models are evaluated on an additional gold standard test corpus made of 100 sentences containing the form se. These sentences are obtained from CORPES XXI, too.
  
  The best model yields a LAS F-score of 86.97 points and a UAS F-score of 89.65 points. Regarding se analysis, the best model yields a LAS F-score of 82.55 points and a UAS F-score of 98.16 points. The main contributions of this thesis are: a new annotation scheme for se adapted to Universal Dependencies' guidelines, manual annotation guidelines for Spanish se disambiguation, the raw and annotated version of the SE-corpus and the best resulting model

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: