Human action can be recognized from a single still image by modeling human–object interactions (HOIs), which infers the mutual spatial structure information between human and the manipulated object as well as their appearance. Existing approaches rely heavily on accurate detection of human and object and estimation of human pose; they are thus sensitive to large variations of human poses, occlusion, and unsatisfactory detection of small size objects. To overcome this limitation, a novel exemplar-based approach is proposed in this paper. Our approach learns a set of spatial pose–object interaction exemplars, which are probabilistic density functions describing spatially how a person is interacting with a manipulated object for different activities. Specifically, a new framework consisting of an exemplar-based HOI descriptor and an associated matching model is formulated for robust human action recognition in still images. In addition, the framework is extended to perform HOI recognition in videos, where the proposed exemplar representation is used for implicit frame selection to negate irrelevant or noisy frames by temporal structured HOI modeling. Extensive experiments are carried out on two image action datasets and two video action datasets. The results demonstrate the effectiveness of our proposed methods and show that our approach is able to achieve state-of-the-art performance, compared with several recently proposed competitors.