Joint spatial and scale attention network for multi-view facial expression recognition

Abstract

Multi-view facial expression recognition (FER) is a challenging task because the appearance of an expression varies greatly due to poses. To alleviate the influences of poses, recently developed methods perform pose normalization, learn pose-invariant features, or learn pose-specific FER classifiers. However, these methods usually rely on a prerequisite pose estimator or expressive region detector that is independent of the subsequent expression analysis. Different from existing methods, we propose a joint spatial and scale attention network (SSA-Net) to localize proper regions for simultaneous head pose estimation (HPE) and FER. Specifically, SSA-Net discovers the regions most relevant to the facial expression at hierarchical scales by a spatial attention mechanism, and the most informative scales are selected in a scale attention learning manner to learn the joint pose-invariant and expression-discriminative representations. Then, we employ a dynamically constrained multi-task learning mechanism with a delicately designed constrain regulation to properly and adaptively train the network to optimize the representations, thus achieving accurate multi-view FER. The effectiveness of the proposed SSA-Net is validated on three multi-view datasets (BU-3DFE, Multi-PIE, and KDEF) and three in-the-wild FER datasets (AffectNet, SFEW, and FER2013). Extensive experiments demonstrate that the proposed framework outperforms existing state-of-the-art methods under both within-dataset and cross-dataset settings, with relative accuracy gains of 2.36%, 1.33%, 3.11%, 2.84%, 15.7%, and 7.57%, respectively.

Publication
Pattern Recognition 139(2023): 109496
Jiabei Zeng
Jiabei Zeng
Associate Professor
Shiguang Shan
Shiguang Shan
Professor