Document Similarity by Word Clustering with Semantic Distance

Toshinori Deguchi; Naohiro Ishii

Ayuda

Document Similarity by Word Clustering with Semantic Distance

Toshinori Deguchi ^[1] ; Naohiro Ishii ^[2]
1. [1] National Institute Of Technology
  
  National Institute Of Technology
  
  Japón
2. [2] Advanced Institute of Industrial Technology
  
  Advanced Institute of Industrial Technology
  
  Japón
Localización: Hybrid Artificial Intelligent Systems: 16th International Conference, HAIS 2021. Bilbao, Spain. September 22–24, 2021. Proceedings / coord. por Hugo Sanjurjo González, Iker Pastor López, Pablo García Bringas, Héctor Quintián Pardo, Emilio Santiago Corchado Rodríguez, 2021, ISBN 978-3-030-86271-8, págs. 3-14
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- In information retrieval, Latent Semantic Analysis (LSA) is a method to handle large and sparse document vectors. LSA reduces the dimension of document vectors by producing a set of topics related to the documents and terms statistically. Therefore, it needs a certain number of words and takes no account of semantic relations of words.In this paper, by clustering the words using semantic distances of words, the dimension of document vectors is reduced to the number of word-clusters. Word distance is able to be calculated by using Word-Net or Word2Vec. This method is free from the amount of words and documents. For especially small documents, we use word’s definition in a dictionary and calculate the similarities between documents. For demonstration in standard cases, we use the problem of classification of BBC dataset and evaluate their accuracies, producing document clusters by LSA, word-clustering with WordNet, and word-clustering with Word2Vec.