The anatomically based facial action coding system defines a unique set of atomic nonoverlapping facial muscle actions called action units (AUs), which can accurately characterize facial expression. AUs correspond to muscular activities that produce momentary changes in facial appearance. Combinations of AUs can represent any facial expression. As a multilabel classification problem, AU detection suffers from insufficient AU annotations, various head poses, individual differences, and imbalance among different AUs. This article systematically summarizes representative methods that have been proposed since 2016 to facilitate the development of AU detection methods. According to different input data, AU detection methods are categorized on the basis of images, videos, and other modalities. We also discuss how AU detection methods can deal with partial supervision given the large scale of unlabeled data. Image-based methods, including approaches that learn local facial representations, exploit AU relations and utilize multitask and weakly supervised learning methods. Handcrafted or automatically learned local facial representations can represent local deformations caused by active AUs. However, the former is incapable of representing different AUs with adaptive local regions while the latter suffers from insufficient training data. Approaches that exploit AU relations can utilize prior knowledge that some AUs appear together or exclusively at the same time. Such methods adopt either Bayesian or graph neural networks to model manually inferred AU relations from annotations of specified datasets. However, these inflexible methods fail to perform cross dataset evaluation. Multitask AU detection methods are inspired by the phenomena that facial shapes represented by facial landmarks are helpful in AU detection and facial deformations caused by active AUs affect the location distribution of landmarks. Except for detecting facial AUs, such methods typically estimate facial landmarks or recognize facial expressions in a multitask manner. Other tasks of facial emotion analysis, such as emotional dimension estimation, can be incorporated in the multitask learning setting. Video-based methods are categorized into strategies that rely on temporal representation and self-supervised learning. Temporal representation learning methods commonly adopt long short-term memory (LSTM) or 3D convolutional neural networks (3D-CNNs) to model the temporal information. Other temporal representation approaches utilize optical flow between frames to detect facial AUs. Several self-supervised approaches have recently exploited the prior knowledge that facial actions, which are movements of facial muscles and between facial frames, can be used as the self-supervisory signal. Such video-based weakly supervised AU detection methods are reasonable and explainable and can effectively alleviate the problem of insufficient AU annotations. However, these methods rely on massive amounts of unlabeled video data in the training phase and fail to perform AU detection in an end-to-end manner. We also review methods that exploit point cloud or thermal images for AU detection and are capable of alleviating the influence of head pose or illumination. Finally, we compare representative methods and analyze their advantages and drawbacks. The analysis summarizes and discusses challenges and potential directions of AU detection. We conclude that methods capable of utilizing weakly annotated or unlabeled data are important research directions for future investigations. Such methods should be carefully designed according to the prior knowledge of AUs to alleviate the demand for large amounts of labeled data.