From its early stages, the community of Pattern Recognition and Computer Vision has considered the importance on leveraging the structural information when understanding images. Usually, graphs have been selected as the adequate framework to represent this kind of information due to their flexibility and representational power able to codify both, the components, objects or entities and their pairwise relationship. Even though graphs have been successfully applied to a huge variety of tasks, as a result of their symbolic and relational nature, graphs have always suffered from some limitations compared to statistical approaches. Indeed, some trivial mathematical operations do not have an equivalence in the graph domain. For instance, in the core of many pattern recognition application, there is the need to compare two objects. This operation, which is trivial when considering feature vectors, is not properly defined for graphs.
Along this dissertation the main application domain has been on the topic of Document Image Analysis and Recognition. It is a subfield of Computer Vision aiming at understanding images of documents. In this context, the structure and in particular graph representations, provides a complementary dimension to the raw image contents.
In computer vision, the first challenge we face is how to build a meaningful graph representation that is able to encode the relevant characteristics of a given image. This representation should find a trade off between the simplicity of the representation and its flexibility to represent the deformations appearing on each application domain. We applied our proposal to the word spotting application where strokes are divided into graphemes which are the smaller units of a handwritten alphabet.
We have investigated different approaches to speed-up the graph comparison in order that word spotting, or more generally, a retrieval application is able to handle large collections of documents. On the one hand, a graph indexing framework combined with a votation scheme at node level is able to quickly prune unlikely results. On the other hand, making use of graph hierarchical representations, we are able to perform a coarse-to-fine matching scheme which performs most of the comparisons in a reduced graph representation. Besides, the hierarchical graph representation demonstrated to be drivers of a more robust scheme than the original graph. This new information is able to deal with noise and deformations in an elegant fashion. Therefore, we propose to exploit this information in a hierarchical graph embedding which allows the use of classical statistical techniques.
Recently, the new advances on geometric deep learning, which has emerged as a generalization of deep learning methods to non-Euclidean domains such as graphs and manifolds, has raised again the attention to these representation schemes. Taking advantage of these new developments but considering traditional methodologies as a guideline, we proposed a graph metric learning framework able to obtain state-of-the-art results on different tasks.
Finally, the contributions of this thesis have been validated in real industrial use case scenarios. For instance, an industrial collaboration has resulted in the development of a table detection framework in annonymized administrative documents containing sensitive data. In particular, the interest of the company is the automatic information extraction from invoices. In this scenario, graph neural networks have proved to be able to detect repetitive patterns which, after an aggregation process, constitute a table.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados