arXiv 2506.15754
Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization
By Tahitoa Leygue, Astrid Sabourin, et al.
Published 2025-06-18
Citation lineage
Review the prior work and downstream research connected to this paper.
State-of-the-art transformer models for Speech Emotion Recognition (SER) rely on temporal feature aggregation, yet advanced pooling methods remain underexplored. We systematically benchmark pooling strategies, including Multi-Query Multi-Head Attentive Statistics Pooling, which achieves a 3.5 percentage point macro F1 gain over average pooling. Attention analysis shows 15 percent of frames capture 80 percent of emot…