Filip Gralinski, Krzysztof Jassem
The paper describes a method for finding diachronic spelling variants in a corpus that consists of historical and modern Polish texts. The procedure applies the Levenshtein distance and the similarity measure determined with a Word2vec model. The method was applied for both words and sub-word units. A sample of spelling variants was manually evaluated and compared against an existing morphological analyser for Polish historical texts. The resulting lists of spelling variants and spelling modernisation rules were used in a text modernisation tool and their contribution was evaluated. The paper also presents an analogous method for finding spelling variants that result from erroneous OCR. The obtained lists of OCR variants and rules may serve for the correction of OCR output.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados