In real-world applications, factors such as head pose variation, occlusion, and poor image quality make facial expression recognition (FER) an open challenge. In this paper, a novel conditional convolutional neural network enhanced random forest (CoNERF) is proposed for FER in unconstrained environment. Our method extracts robust deep salient features from saliency-guided facial patches to reduce the influence from various distortion types, such as illumination, occlusion, low image resolution, etc. A conditional CoNERF is devised to enhance decision trees with the capability of representation learning from transferred convolutional neural networks and to model facial expression of different perspectives with conditional probabilistic learning. In the learning process, we introduce a neurally connected split function (NCSF) as the node splitting strategy in the CoNERF. Experiments were conducted using public CK+, JAFFE, multi-view BU-3DEF and LFW datasets. Compared to the state-of-the-art methods, the proposed method achieved much improved performance and great robustness with an average accuracy of 94.09% on the multi-view BU-3DEF dataset, 99.02% on CK+ and JAFFE frontal facial datasets, and 60.9% on LFW dataset. In addition, in contrast to deep neural networks which require large-scale training data, conditional CoNERF performs well even when there are only a small amount of training data.