arXiv 2506.15754

Explainable speech emotion recognition through attentive pooling: insights from attention-based temporal localization

By Tahitoa Leygue, Astrid Sabourin, et al.

Published 2025-06-18

Citation lineage

Review the prior work and downstream research connected to this paper.

State-of-the-art transformer models for Speech Emotion Recognition (SER) rely on temporal feature aggregation, yet advanced pooling methods remain underexplored. We systematically benchmark pooling strategies, including Multi-Query Multi-Head Attentive Statistics Pooling, which achieves a 3.5 percentage point macro F1 gain over average pooling. Attention analysis shows 15 percent of frames capture 80 percent of emot…

View the original paper on arXiv