Ayuda
Ir al contenido

Dialnet


Information extraction from heterogeneous handwritten documents

  • Autores: Juan Ignacio Toledo Testa
  • Directores de la Tesis: Alicia Fornés Bisquerra (dir. tes.), Josep Llados Canet (codir. tes.)
  • Lectura: En la Universitat Autònoma de Barcelona ( España ) en 2019
  • Idioma: inglés
  • Tribunal Calificador de la Tesis: Véronique Églin (presid.), Oriol Ramos Terrades (secret.), Andreas Fischer (voc.)
  • Programa de doctorado: Programa de Doctorado en Informática por la Universidad Autónoma de Barcelona
  • Enlaces
    • Tesis en acceso abierto en: TESEO
  • Resumen
    • Despite the unstoppable trend towards a fully digital paperless world, there is still an abundance of totally or partially handwritten documents that need to be automatically processed, from historical demographic records to more recent form-like documents. The most common approach to leverage current technology is by applying Document Image Analysis and Recognition techniques on the digital images of those documents acquired with scanners or digital cameras.

      Over the years, as the technology advances, different approaches have been proposed to be able to access to the information contained in document images. In this thesis we explore the whole process of information extraction, with different examples of document types.

      The first step towards extracting information from any document is actually understanding the document and the most common techniques required to process it. For that reason, on the first chapter of this thesis after the introduction, we provide an in depth study of electoral documents, that had not yet drawn much attention from the community. We discuss how the interpretation of document images is never as simple as it seems; even deciding if a mark is present or not in a given position can be challenging, depending on legal requirements for our system. In this chapter we also make a quick overview of some of the most common DIAR techniques that can be applied to different kind of electoral documents. In the end, electoral documents can be seen as a special kind of form documents, where the position in the form determines the semantics of each element of the document. For example the word 'John' could be the name of a candidate or the name of an officer certifying the results for a particular polling place.

      If, in the most simple case, the semantic entities are determined by the spatial position, we should devote some of our efforts to the transcription, as in ``reading'' the information on each field. We also devote a chapter to explore two new approaches to handwriting recognition. Both assume that the handwriting can be modeled as a series of features sequentially extracted from the images. In the first approach, we use variational autoencoders to derive descriptive features from unlabeled text images and whereas in the second approach we use attribute embeddings, that allow us to derive a much more discriminative set of features by making use of the transcription information.

      After learning about the factors to take into consideration when interpreting documents and the techniques required to transcribe handwritten text, we can now extract information from highly structured types of documents, like forms, but we are interested in going one step further. A good challenge are historical handwritten birth or marriage records. This kind of documents are not exactly a form, but they share a similar structure. We can think of them as divided in records with a given set of fields whose position is not fixed in a particular location. We are interested in processing those documents as if they were a form. In order to study that problem, we elaborated a new benchmark and organized an international competition for the community. We devote another chapter to describing all the details of the dataset, required tasks and the metric.

      Finally, in the last chapter, we propose a full information extraction approach, with two variants, based on a combination of Convolutional and Recurrent Neural Networks, that can deal with loosely structured documents as the ones in our proposed benchmark. This approach uses the better performing of our HTR approaches described in the chapter devoted to HTR to get the transcriptions, while deriving the semantic information directly from the word images. As a first step, we prove that classification of isolated handwritten word images into semantic classes is feasible and later we explore two alternatives to leverage contextual information to be able to use record level information to improve the accuracy of the system.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno