Viewpoint variation is a major challenge in video- based human action recognition. We exploit the simultaneous RGB and Depth sensing of RGB-D cameras to address this problem. Our technique capitalizes on the complementary spatio-temporal information in RGB and Depth frames of the RGB-D videos to achieve viewpoint invariant action recognition. We extract view invariant features from the dense trajectories of the RGB stream using a non-linear knowledge transfer model. Simultaneously, view invariant human pose features are extracted using a CNN model for the Depth stream, and Fourier Temporal Pyramid are computed over them. The resulting heterogeneous features are meticulously combined and used for training an L1L2 classifier. To establish the effectiveness of the proposed approach, we benchmark our technique using two standard datasets and compare its performance with twelve existing methods. Our approach achieves up to 7.2% improvement in the accuracy over the nearest competitor.