CenKNN: a scalable and effective text classifier

Guansong Pang; Huidong Jin; Shengyi Jiang

Ayuda

CenKNN: a scalable and effective text classifier

Autores: Guansong Pang, Huidong Jin, Shengyi Jiang
Localización: Data mining and knowledge discovery, ISSN 1384-5810, Vol. 29, Nº 3, 2015, págs. 593-625
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with varying degrees of success. In this paper, by combining the strengths of two widely used text classification techniques, K-Nearest-Neighbor (KNN) and centroid based (Centroid) classifiers, we propose a scalable and effective flat classifier, called CenKNN, to cope with this challenge. CenKNNprojects high-dimensional (often hundreds of thousands) documents into a low-dimensional (normally a few dozen) space spanned by class centroids, and then uses the $$k$$ k -d tree structure to find $$K$$ K nearest neighbors efficiently. Due to the strong representation power of class centroids, CenKNNovercomes two issues related to existing KNNtext classifiers, i.e., sensitivity to imbalanced class distributions and irrelevant or noisy term features. By working on projected low-dimensional data, CenKNNsubstantially reduces the expensive computation time in KNN. CenKNNalso works better than Centroidsince it uses all the class centroids to define similarity and works well on complex data, i.e., non-linearly separable data and data with local patterns within each class. A series of experiments on both English and Chinese, benchmark and synthetic corpora demonstrates that although CenKNNworks on a significantly lower-dimensional space, it performs substantially better than KNNand its five variants, and existing scalable classifiers, including Centroidand Rocchio. CenKNNis also empirically preferable to another well-known classifier, support vector machines, on highly imbalanced corpora with a small number of classes.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: