Sampling Techniques to Overcome Class Imbalance in a Cyberbullying Context

David Colton; Markus Hofmann

Ayuda

Sampling Techniques to Overcome Class Imbalance in a Cyberbullying Context

Colton, David ^[1] ; Hofmann, Markus ^[2]
1. [1] IBM Research – Thomas J. Watson Research Center
  
  IBM Research – Thomas J. Watson Research Center
  
  Town of Yorktown, Estados Unidos
2. [2] Technological University Dublin
Localización: Journal of Computer-Assisted Linguistic Research, ISSN-e 2530-9455, Nº. 3, 2019, págs. 21-40
Idioma: inglés
Enlaces
- Texto completo
Resumen
- The majority of datasets suffer from class imbalance where samples of a dominant class significantly outnumber the samples available for the minority class that is to be detected. Prediction and classification machine learning models work best when there are roughly equal numbers of each class type. This paper explores sampling techniques that can be used to overcome this class imbalance problem in a cyberbullying context. A newly classified cyberbullying dataset, including detailed descriptions of the criteria used in its classification, was used to examine the feasibility of applying text mining techniques, to automate the detection of cyberbullying text when the dataset shows a significant class imbalance between the positive, cyberbullying, sample and the negative, not cyberbullying, samples. In this paper, we will investigate if oversampling the minority positive class or undersampling the majority negative class affects the performance of a prediction model. A compromise solution where the positive class is partially oversampled, and the negative class is partially undersampled is also examined. Although not strictly a class imbalance solution, sampling using the most frequently observed features was also explored.