Sign Language Recognition (SLR) has achieved significant progress using various deep learning-based models which achieved good performance with the allowable computational power [
1,
2,
3,
4,
21,
22,
26,
39,
40,
41,
42,
43]. The existing SLR systems are still facing many difficulties in achieving good performance because of the inefficiency of the potential information, consider-able gesture for SLR and potential features. One of the common challenges is to capture global body motion skeleton and local arm, hand, and facial expressions simultaneously. Neverova et al. employed a ModDrop framework for initializing individual and gradual fusion modalities for capturing spatial information [
39]. They achieved good performance for spatial and temporal information for multiple modalities. One of the drawbacks of their ideas is that they applied an augmented with audio which is not good for all time. Pu et al. employed connectionist temporal classification (CTC) for sequence modelling and a 3D convolutional residual network (3D-ResNet) for feature learning [
26]. The employed LSTM and CTC decoder with jointly trained by a soft Dynamic Time Warping (soft-DTW) alignment constraint. Finally, they employed 3D-ResNet for training labels with lass and validated with RWTHPHOENIX-Weather and CSL datasets with 36.7% and 32.7-word error rate (WER) sequentially. Koller et al. employed a hybrid CNN-HMM model for combining the two kinds of features, such as the discriminative features of the CNN with the sequence features of Hid-den-Markov-Models (HMMs) [
21]. They claimed they achieved good recognition accuracy for the three benchmark sign language datasets, which reduced 20% WER. Huang et al. proposed an attention-based 3D-convolutional neural net-works (3D-CNNs) for SLR aiming to extract the spatial-temporal feature and selected highlighted information with an attention mechanism [
27]. Finally, they evaluated their model with the CSL and ChaLearn 14 benchmark dataset, where they achieved 95.30% accuracy with the ChaLearn dataset. Pigou et al. proposed a simple temporal feature pooling-based method that proved temporal information is more important as discriminative features for video classification-related research work [
44]. They also focus on the recurrence information with temporal convolution, which can improve the significants of the video classification task. SINCAN et al. proposed a hybrid method combining an LSTM, Feature pooling and CNN method to recognize isolated sign language [
24]. They included the VGG-16 pre-trained model with the CNN part and two parallel architectures for learning RGB and Depth information. Finally, they achieved 93.15% accuracy with Montalbano Italian sign language dataset. Huang et al. applied a continuous sign language recognition approach to eliminate the preprocessing of temporal segmentation, namely Hierarchical Attention Network with Latent Space (LS-HAN) [
28]. They mainly included two-stream CNN, LS and a HAN for video feature extraction, semantic gap bridging and latent space-based recognition, respectively. The main drawback of their work is that they mainly extracted pure-visual features, which are not good for capturing hand gestures and body movements. Zhou et al. proposed a holistic visual appearance-based approach and a 2D human pose-based method to improve the performance with large-scale sign language recognition [
23]. They also applied pose-based temporal graph convolution networks (Pose-TGCN) to extract the pose trajectories’ temporal dependencies and achieved 66% accuracy for the 2000 words glosses. Liu et al. applied a feature extraction approach based on the deep CNN with stack temporal fusion layers with a sequence learning model Bidirectional RNN [
45]. Guo et al. employed a hierarchical LSTM approach with word embedding, including visual content for SLR [
46]. Firstly, spatial-temporal information is extracted by 3D CNN and then compacted into a visemes with the help of an online key based on the adaptive variable length. Their approach is not so much efficient for capturing motion information. The main drawback of image and video pixel-based work is to high computational complexity To overcome these drawbacks, researchers are thinking about the joint point instead of the full image pixels for the hand gesture and action recognition [
47,
48,
49]. Various models have been used in skeleton-based gesture recognition among LSTM [
33] and RNN [
50] among them. Yan et a. applied a graph-based method, namely ST-GCN, for building a dynamics pattern for skeleton-based action recognition with Graph convolutional network(GCN) [
33]. By following the previous task, many researchers have employed some modified versions of the ST-GCN to improve the performance accuracy for Hand Gesture and Hu-man activity recognition work. Li et all; employed an encoder and a decoder for extracting action-specific latent information [
49]. They included two links to do this and finally employed GCN-based action structured GCN to learn temporal and spatial information. Shi et al. employed a two-stream-based GCN for action recognition [
51], and a multi-stream GCN for action recognition [
30]. In the multi-stream GCN, they integrated the GCN with a spatial temporal-based network to extract the more important joints and features from the all features. Zhang et al. proposed a decoupling GCN to recognize skeleton-based action recognition [
29]. Song et al. proposed ResGCN integrating with Part-wise Attention (PartAtt) to improve the performance and computational cost of the skeleton-based action recognition [
31]. But their main drawback is their performance is not so much higher than the existing ResNet performance. Amorin et al. proposed a human skeleton movement-based sign language recognition using ST-GCN, where they proposed to select the potential key points from the whole body key points. Finally, they achieved 85.0% accuracy with their dataset name ASLLVD [
52]. The disadvantage of this work is that they consider only one hand with the body key point. Perez et al. extracted 67 key points, including the face, body and hand gestures, using a special camera, and finally, they achieved good performance with the LSTM [
38]. In the same way, many researchers considered 133 points from the whole body to recognize sign language [
8]. Jiang et al. applied a different approach with a multimodal dataset, including full-body skeleton points and achieved a good performance accuracy [
8]. They also think of reducing the number of skeletons to increase the model’s efficiency. The main problem is that their method did not seem able to achieve good performance and the generalization property for the SLR compared to the existing systems. In addition, researchers focused on the skeleton-based SLR because of the high complexity of the pixel-based system. With the full-body skeleton, we face almost the same computational complexity problems.