Madrid, España
Leioa, España
This paper introduces innovations both in data augmentation and deep neural network architecture for speech emotion recognition (SER). The novel architecture combines a series of convolutional layers with a final layer of long short-term memory cells to determine emotions in audio signals. The audio signals are conveniently processed to generate mel spectrograms, which are used as inputs to the deep neural network architecture. This paper proposes a selected set of data augmentation techniques that allow to reduce the network overfitting. We achieve an average recognition accuracy of 86.44% on publicly distributed databases, outperforming state-of-the-art methods.
© 2001-2026 Fundación Dialnet · Todos los derechos reservados