Computer Vision using Deep Learning algorithms has served numerous human activity identification applications, particularly those linked to safety and security. However, despite the fact that autistic children are frequently exposed to danger as a result of their activities, many Computer Vision experts have shown little interest in their safety. High-grade autistic children frequently experience the Meltdown Crisis condition, characterized by hostile behaviors and loss of control. This study aims to introduce a monitoring system capable of predicting the Meltdown Crisis condition early and alerting the children’s parents or caregivers before entering more difficult settings. For this endeavor, the suggested system was constructed using a combination of a pre-trained Vision Transformer (ViT) model (Swin-3D-b) and a Residual Network (ResNet) architecture to extract robust features from video sequences in order to extract and learn the spatial and temporal features of the Stereotyped Motor Movements made by autistic children at the beginning of the Meltdown Crisis state, which is referred to as the Pre-Meltdown Crisis state. In order to attain a 92% recall and F1 Score, the final decisions made for data preparation, model construction, and training parameters were tweaked and established experimentally. The best loss value obtained was 0.08. The MeltdownCrisis dataset, which includes realistic scenarios of autistic children’s behaviors in the Pre-Meltdown Crisis state and Normal state the Normal state data being used as a negative class was utilized for evaluation.