M3F:Multi-Modal Continuous Valence-Arousal Estimation in the Wild

Abstract

In this paper, we propose a multi-modal multi-feature (M3F) approach for in-the-wild valence-arousal estimation. In the proposed M3F framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal. We follow a CNN-RNN paradigm, where the spatio-temporal visual features are extracted with a 3D convolutional network and/or a pretrained 2D convolutional network, and a bidirectional recurrent neural network. We evaluated the M3F framework on the validation set provided by the Affective Behavior Analysis in-the-wild (ABAW) Challenge, held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020, and it significantly outperforms the baseline method.

Publication
Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020
Jiabei Zeng
Jiabei Zeng
Associate Professor