Most existing Speech Emotion Recognition (SER)
systems rely on turn-wise processing, which aims at recognizing
emotions from complete utterances and an overly-complicated
pipeline marred by many preprocessing steps and hand-engineered
features. To overcome both drawbacks, we propose a
real-time SER system based on end-to-end deep learning.
Namely, a Deep Neural Network (DNN) that recognizes emotions
from a one second frame of raw speech spectrograms is presented
and investigated. This is achievable due to a deep hierarchical
architecture, data augmentation, and sensible regularization.
Promising results are reported on two databases which are the
eNTERFACE database and the Surrey Audio-Visual Expressed
Emotion (SAVEE) database.
↧