Action recognition in still images by learning spatial interest regions from videos
A common approach to human action recognition from still images consists in computing local descriptors for classification. Typically, these descriptors are computed in the vicinity of key points which either result from running a key point detector or from dense sampling of pixel coordinates. Such key points are not a priorly related to human activities and thus might not be very informative with regard to action recognition. Several recent approaches, On the other hand, are based On learning person-object interactions and saliency maps in images. In this article, we investigate the possibility and applicability of identifying action-specific points or regions of interest in still images based on information extracted from video data, in particular, we propose a novel method for extracting spatial interest regions where we apply non-negative matrix factorization to optical flow fields extracted from videos. The resulting basis flows are found to indicate image regions that are specific to certain actions and therefore allow for an informed sampling of key points for feature extraction. We thus present a generative model for action recognition in still images that allows for characterizing joint distributions of regions of interest, local image features (visual words), and human actions. Experimental evaluation shows that (a) our approach is able to extract interest regions that are highly correlated to those body parts most relevant for different actions and (b) our generative model achieves high accuracy in action classification.