A Simple Pyramid Vision Transformer for Human Pose Estimation in Crowds
Multi-person Pose Estimation is essential for several computer vision tasks related to motion analysis and anomaly detection. The impressive and continual progress in this field leads to application in uncooperative real-world scenarios such as detecting anomalous and dangerous behavior from individuals or groups within dense crowds in public places. However, reliably detecting poses within crowds in surveillance footage remains a very challenging task, due to diverse occlusions, illumination changes and long processing time. In this work, we present a simple Pyramid Vision Transformer for Human Pose Estimation achieving competitive results on the COCO Keypoints 2017 while requiring significantly less parameters and thus computation time. A significant improvement is reported over the baselines on the more crowded OCHuman, PoseTrack 2018, and CrowdPose datasets.