Ayuda
Ir al contenido

Dialnet


Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

  • Autores: Audi Primadhanty
  • Directores de la Tesis: Xavier Carreras (dir. tes.), Ariadna Quattoni (dir. tes.), Horacio Rodríguez Hontoria (tut. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2017
  • Idioma: español
  • Tribunal Calificador de la Tesis: Horacio Saggion (presid.), Lluís Padró Cirera (secret.), Andreas Vlachos (voc.)
  • Programa de doctorado: Programa de Doctorado en Inteligencia Artificial por la Universidad Politécnica de Catalunya
  • Materias:
  • Enlaces
    • Tesis en acceso abierto en: TDX
  • Resumen
    • One of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts.

      In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers.

      In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data).

      The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal.

      We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set.

      We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno