Ayuda
Ir al contenido

Dialnet


Resumen de Mining historical texts for diachronic spelling variants

Filip Gralinski, Krzysztof Jassem

  • The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus