Resumen de Document Similarity by Word Clustering with Semantic Distance

In information retrieval, Latent Semantic Analysis (LSA) is a method to handle large and sparse document vectors. LSA reduces the dimension of document vectors by producing a set of topics related to the documents and terms statistically. Therefore, it needs a certain number of words and takes no account of semantic relations of words.In this paper, by clustering the words using semantic distances of words, the dimension of document vectors is reduced to the number of word-clusters. Word distance is able to be calculated by using Word-Net or Word2Vec. This method is free from the amount of words and documents. For especially small documents, we use word’s definition in a dictionary and calculate the similarities between documents. For demonstration in standard cases, we use the problem of classification of BBC dataset and evaluate their accuracies, producing document clusters by LSA, word-clustering with WordNet, and word-clustering with Word2Vec.

Acceso de usuarios registrados

¿Es nuevo? Regístrese

Coordinado por: