Feature learning plays a crucial role in the successful human action recognition. There has been a number of approaches extracting action features from depth information and 3D skeletal data. However, either the skeleton information or the depth map is not accurate for feature learning unless complex descriptors are carefully designed and embedded. In this paper, we first propose a data sparsification technique to sparsify the cuboids in the depth video clip. Then, a novel formulation of the cuboid descriptor is proposed based on the 3D Sparse Quantization (3DSQ). Furthermore, we build a Spatial-Temporal Pyramid (STP) structure with max pooling to hierarchically represent the action sample in depth domain. We demonstrate our feature learning technique with action recognition tasks using the public MSRAction3D and MARDaily-Action3D datasets. Experimental results show that the proposed approach outperforms state-of-the-art feature learning approaches and significantly improves the action recognition accuracy.