Ayuda
Ir al contenido

Dialnet


Real-time localization of multi-oriented text in natural scene images

  • Autores: Xavier Gironés Sancho
  • Directores de la Tesis: Carme Julià (dir. tes.)
  • Lectura: En la Universitat Rovira i Virgili ( España ) en 2021
  • Idioma: español
  • Tribunal Calificador de la Tesis: Ernest Valveny Llobet (presid.), Hatem Abd El-Latif Fatahallah Ibrahim Mahmoud Rashwan (secret.), Àgata Lapedriza Garcia (voc.)
  • Programa de doctorado: Programa de Doctorado en Ingeniería Informática y Matemáticas de la Seguridad por la Universidad Rovira i Virgili
  • Materias:
  • Enlaces
    • Tesis en acceso abierto en: TDX
  • Resumen
    • Text in natural scene images carries high-level semantic information useful to interpret the environment. However, while the topic of Optical Character Recognition (OCR) on scanned documents has reached a degree of maturity, the problem of text localization and recognition in unconstrained images remains a challenge. Factors present in natural images, such as background clutter, poor lighting conditions, perspective distortion, blurring, as well as variations in font, scale, and orientation, make the task of text extraction in the wild more complicated than the typical OCR operation. This difficulty has sparked the interest of the scientific community, and the subject of scene text extraction has become an active research area. However, while recent deep learning-based approaches report excellent results in terms of accuracy, they are generally very compute-intensive and therefore not suitable for low-powered hardware architectures or applications with hard real-time constraints.

      This thesis focuses on the problem of text localization in natural scene images from the perspective of time-efficiency. Towards this end, a multi-oriented text localization method in natural images suitable for real-time processing of high-definition video on portable and mobile devices is introduced. The proposed method is based on the connected component (CC) approach: First, CCs are isolated by convolving a multi-scale pyramid with a specifically designed linear spatial filter, followed by hysteresis thresholding. Next, non-textual CCs are pruned employing a cascade of local classifiers fed with increasingly extended feature vectors, where the stroke width feature is estimated in linear time complexity by computing the maximal inscribed squares in the CCs. Candidate CCs and their neighbors are subsequently checked with a context-aware classifier that takes into account the target CCs and their vicinity. Lastly, text sequences are extracted in all pyramid levels and fused using dynamic programming. The proposed method is capable of processing 1080p HD video at nearly 30 frames per second on a standard laptop without requiring a GPU. Furthermore, when benchmarked on the ICDAR 2013 Robust Reading and on the ICDAR 2015 Incidental Scene Text datasets, it performed more than twice faster than the state-of-the-art, while still delivering competitive results in terms of precision and recall.

      Another contribution of this thesis is a new family or rational approximations of the arctangent function valid in the [0, π/2] range, which can be easily extended to two and four quadrants. Compared to state-of-the-art approximations, the proposed third-order function outperformed the existing ones in both accuracy and performance. Specifically, it achieved about 15\% more accuracy and an evaluation time almost 1.5 times faster than the second-best option. In turn, the second-order approximation performed 2.6 and three times faster than the best alternative compared, for two and four quadrants, respectively. Its low evaluation cost, combined with a relatively high accuracy of 0.1620º, makes it valuable for image processing tasks.

      Finally, a new technique for vehicle license plate localization in unconstrained environments is presented as a practical use case leveraging the text localization system described in this research. The proposed method comprises four stages: First, the text groups in the image are quickly localized employing the fast text detector previously introduced. Second, a neural network is used to (i) determine whether the detected text groups correspond to license plates or not, and (ii) estimate the bounding boxes of the license plates. Third, low-resolution patches are extracted from the image areas corresponding to the bounding boxes, and fed to another neural network that regresses the coefficients of tight parallelograms enclosing the license plates. Lastly, a third neural network refines the coefficients obtained in the previous step by focusing on the regions centered on the edges of the parallelograms, which are sampled from high-resolution patches. This work also introduces a new scale-space block that aggregates features computed at multiple scales from a Gaussian pyramid, and a novel loss function that directly optimizes the Intersection-over-Union metric between parallelograms. Experimental results showed that the proposed method was competitive with the state-of-the-art in terms of classification performance, and also achieved high segmentation accuracy. Furthermore, its low computational complexity makes it suitable for real-time applications, even on devices not equipped with a GPU.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno