In this paper, we propose a multi-modal multi-feature (M3F) approach for in-the-wild valence-arousal estimation. In the proposed M3F framework, we fuse both visual features from videos and acoustic features from the audio tracks to estimate the valence and arousal. We follow a CNN-RNN paradigm, where the spatio-temporal visual features are extracted with a 3D convolutional network and/or a pretrained 2D convolutional network, and a bidirectional recurrent neural network. We evaluated the M3F framework on the validation set provided by the Affective Behavior Analysis in-the-wild (ABAW) Challenge, held in conjunction with the IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020, and it significantly outperforms the baseline method.