Ayuda
Ir al contenido

Dialnet


OCR Challenges for a Latvian Pronunciation Dictionary

    1. [1] University of Latvia, Institute of Mathematics and Computer Science
  • Localización: Human Language Technologies – The Baltic Perspective: Proceedings of the Ninth International Conference Baltic HLT 2020 / coord. por Andrius Utka, Jurgita Vaičenonienė, Jolanta Kovalevskaitė, Danguolė Kalinauskaitė, 2024, ISBN 978-1-64368-116-0, págs. 199-206
  • Idioma: inglés
  • Enlaces
  • Resumen
    • This paper covers the devlopment of a custom OCR solution based on the Tesseract open source engine developed for digitization of a Latvian pronunciation dictionary where the pronunciation data is described using a large variety of diacritic markings not supported by standard OCR solutions. We describe our efforts in training a model for these symbols without the additional support of preexisting dictionaries and illustrate how word error rate (WER) and character error rate (CER) are affected by changes in the dataset content and size. We also provide an error analysis and postulate possible causes for common pitfalls. The resulting model achieved a CER of 2.07%, making it suitable for digitization of the whole dictionary in combination with heuristic post-processing and proofreading, resulting in a useful resource for further development of speech technology for Latvian.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno