Micro-expression (ME) is one of the key psychological stress reactions. It is a modest, spontaneous facial mechanism. ME has significant applicability in a variety of psychologically-related sectors because to its precision and unpredictability with regard to psychological manifestations. Nevertheless, the current Micro-expression recognition (MER) algorithms have poor accuracy and a limited quantity of ME data, and this study issue has not been thoroughly investigated. Therefore, we present an approach for deep learning based on a Spatio-temporal capsule network (STCP-Net). STCP-Net has four components: a jitter reduction module, a differential feature extraction module, an STCP module, and a fully linked layer. The first two modules are aimed to extract diversifying differential features more precisely and to limit the influence of head jitter. The STCP module is used to extract Spatio-temporal features layer by layer, taking the temporal and geographical connection between features into account. This research runs sufficient trials using the Leave One Subject Out (LOSO) methodology for cross-validation using the CASMEII dataset. The conclusion and analysis demonstrate that the algorithm is innovative and efficient.