Clasificación de noticias criminales basada en procesamiento del lenguaje natural y algoritmos de aprendizaje automático

Camilo Ernesto Sarmiento Torres ^[1] ; Néstor Diaz ^[1] ; Rubiel Vargas Cañas ^[1]
1. [1] Universidad del Cauca
  
  Universidad del Cauca
  
  Colombia
Localización: RISTI: Revista Ibérica de Sistemas e Tecnologias de Informação, ISSN-e 1646-9895, Nº. Extra 38, 2020, págs. 117-129
Idioma: español
Enlaces
- Texto completo (pdf)
Resumen
- español
  Camilo Ernesto Sarmiento Torres
- English
  In this work, a classification system of criminal news was developed from different digital press media, supported by natural language processing techniques and machine learning algorithms. Initially, a criminal news data set was constructed where eight types of crime were identified. Subsequently, the documents were pre-processed, the stop words were eliminated, a lemmatization was applied, and a representation of the documents with the bag of words model, where the coefficient of term frequency-inverse document frequency (tf-idf) was estimated.
  
  In addition, eight-word dictionaries were built according to the types of crimes and implemented to estimate the performance of five supervised classification algorithms. The random forest algorithm obtained the best performance with 97.22% of accuracy, 98.36% of precision, 98.35% of sensitivity, F1 score of 98.32%, and MCC of 0.97% in the test performed.