Computer Science and Mathematics

Sort by

Article

Computer Vision and Graphics

Real-Time Sports Action Recognition Using a CNN–Transformer Hybrid Deep Learning Framework

Valli Nayagam

Anukarthika S

Muhesh Krishnaa S

Sri Sathya K B

Abstract: The rapid expansion of sports broadcasting and digital media platforms has increased the demand for intelligent systems capable of automatically identifying important sports events for real-time analytics and highlight generation. Manual annotation of sports videos requires significant time and effort and may introduce human errors during analysis. This paper presents a real-time sports action recognition framework using a hybrid CNN–Transformer architecture for detecting critical events in football and cricket videos. The proposed system processes live or recorded video streams through frame extraction, normalization, and spatial feature learning using the MobileNetV2 network. Temporal relationships between consecutive frames are modeled using a Transformer encoder to improve action understanding. The framework classifies events such as pass and goal in football, and four, six, and wicket in cricket. Motion-based filtering and confidence thresholding reduce non-action frames and improve prediction reliability. Detected events are recorded with timestamps and displayed using broadcast-style overlays to support automated highlight generation. Experimental evaluation demonstrates high recognition accuracy and efficient real-time performance on low-cost hardware platforms. The framework provides an effective solution for sports analytics, media automation, and intelligent decision-support systems.

Posted: 02 April 2026

https://doi.org/10.20944/preprints202604.0150.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Multi-Modal Remote Sensing Image Registration Using Curvature Scale Space Contour Point Features

Jianhua Zhu

Changjiang Liu

Danling Liang

Abstract: Multi-modal remote sensing image registration is a challenging task due to differences in resolution, viewpoint, and intensity, which often leads to inaccurate and time-consuming results with existing algorithms. To address these issues, we propose an algorithm based on Curvature Scale Space Contour Point Features (CSSCPF). Our approach combines multi-scale Sobel edge detection, dominant direction determination, an improved curvature scale space corner detector, a new gradient definition, and enhanced SIFT descriptors. Test results on publicly available datasets show that our algorithm outperforms existing methods in overall performance. Our code will be released at https://github.com/JianhuaZhu-IR.

Posted: 02 April 2026

https://doi.org/10.20944/preprints202604.0149.v1

Review

Computer Science and Mathematics

Computer Vision and Graphics

Advances in Neural Video Compression: A Review and Benchmarking

Ge Gao

Chen Feng

Yuxuan Jiang

Tianhao Peng

Ho Man Kwan

Siyue Teng

Chengxi Zeng

Yixuan Li

Changqi Wang

Robbie Hamilton

+3 authors

Abstract: While conventional video coding standards remain predominant in real-world applications, neural video compression has emerged over the past decade as an active research area, offering alternative solutions with potentially significant coding gains through end-to-end optimization. Owing to the rapid pace of recent progress, existing reviews of neural video coding quickly become outdated and often lack a systematic taxonomy and meaningful benchmarking. To address this gap, we provide a comprehensive review of two major classes of neural video codecs - scene-agnostic and scene-adaptive - with a focus on their design characteristics and limitations. More importantly, we benchmark representative state-of-the-art methods from each category under common test conditions recommended by video coding standardization bodies. This provides, to the best of our knowledge, the first first large-scale unified comparison between conventional and neural video codecs under controlled settings. Our results show that neural codecs can already achieve competitive, and in some cases superior, performance relative to VTM and AVM, although they still fall short of ECM in overall coding efficiency under both Low Delay and Random Access configurations. To facilitate future algorithm benchmarking, we will release the full implementations and results at https://nvc-review-2025.github.io, thereby providing a useful resource for the video compression research community.

Posted: 01 April 2026

https://doi.org/10.20944/preprints202604.0035.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Toward a Deeper Understanding of YOLO26: Block-Level Architectural Analysis and Ablation Studies

Marc Tornero-Soria

Antonio-José Sánchez-Salmerón

Eduardo Vendrell-Vidal

Abstract: Public YOLO model releases typically provide high-level architectural descriptions and headline benchmark results, but offer limited empirical attribution of performance to individual blocks under controlled training conditions. This paper presents a modular, block-level analysis of YOLO26’s object detection architecture, detailing the design, function, and contribution of each component. We systematically examine YOLO26’s convolutional modules, bottleneck-based refinement blocks, spatial pyramid pooling, and position-sensitive attention mechanisms. Each block is analyzed in terms of objective and internal flow. In parallel, we conduct targeted ablation studies to quantify the effect of key design choices on accuracy (mAP50–95) and inference latency under a fixed, fully specified training and benchmarking protocol. Experiments use the MS COCO [1] dataset with the standard train2017 split (≈118k images) for training and the full val2017 split (5k images) for evaluation. The result is a self-contained technical reference that supports interpretability, reproducibility, and evidence-based architectural decision-making for real-time detection models.

Posted: 01 April 2026

https://doi.org/10.20944/preprints202603.2518.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

MSA-YOLO: An Optimized UAV Object Detection Algorithm for Low-Visibility Maritime

Longcheng Huang

Mengguang Liao

Shaoning Li

Chuanguang Zhu

Sichun Long

Abstract: Maritime search and rescue is an important component of emergency response frameworks and primarily relies on UAVs for maritime object detection. However, maritime accidents frequently occur in low-visibility environments, such as foggy or low-light conditions, which lead to low contrast, blurred object boundaries, and degraded texture representations. Most existing maritime object detection algorithms are developed for natural light scenes, and their performance deteriorates markedly when deployed directly in low-visibility environments, primarily due to reduced image quality that hinders feature extraction and semantic information aggregation. Although several studies incorporate image enhancement techniques prior to detection to improve image quality, these approaches often introduce significant additional computational overhead, limiting their practical deployment on UAV platforms. To tackle these challenges, this paper proposes a lightweight model built upon a recent YOLO framework, termed Multi-Scale Adaptive YOLO (MSA-YOLO), for maritime detection using UAVs in low-visibility environments. The proposed model systematically optimizes the backbone, neck, and detection head networks. Specifically, an improved StarNet backbone is designed by integrating ECA mechanisms and multi-scale convolutional kernels, which strengthen feature extraction capability while maintaining low computational overhead. In the neck network, a high-frequency enhanced residual block branch is inserted into the C3k2 module to capture richer detailed information, while depthwise separable convolution is utilized to further reduce computational cost. Moreover, a non-parametric attention module is incorporated into the detection head to adaptively optimize features in the classification and regression branches. Finally, a joint loss function that combines bounding box regression, classification, and distribution focal losses is utilized to improve detection accuracy and training stability. Experimental results on the constructed AFO, Zhoushan Island, and Shandong Province datasets demonstrate that, relative to YOLOv11-s, MSA-YOLO reduces model parameters and FLOPs by 52.07% and 41.36%, respectively, while achieving improvements of 1.11% and 1.33% in mAP@0.5:0.95 and mAP@0.5. These results indicate that the proposed method effectively balances computational efficiency and detection accuracy, rendering it suitable for for practical maritime search and rescue applications in low-visibility environments.

Posted: 31 March 2026

https://doi.org/10.20944/preprints202603.2492.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Field-Transformation-Based Light-Field Hologram Generation From a Single RGB Image

Xiaoming Chen

Xiaoyu Jiang

Yingqing Huang

Xi Wang

Chaoqun Ma

Abstract: We propose a field-transformation-based framework for generating phase-only light-field holograms from a single RGB image. The method establishes an explicit pipeline from monocular scene inference to holographic wavefront synthesis, without requiring multi-view capture or task-specific hologram-network training. First, we construct a layered occlusion RGB-D model from the input image using monocular depth estimation, connectivity-based layer decomposition, and occlusion-aware inpainting, which provides a lightweight 3D prior for sparse-view rendering in the small-parallax regime. Second, we transform the rendered sparse RGB-D light field into a target complex wavefront on the recording plane through local frequency mapping, thereby bridging explicit scene geometry and wave-optical field construction. Third, we optimize the phase-only hologram under multi-planeamplitude constraints using a geometrically consistent initial phase and an error-driven adaptive depth-sampling strategy, which improves convergence stability and reconstruction quality under a limited computational budget. Numerical experiments show that the proposed method achieves better depth continuity, occlusion fidelity, and lower speckle noise than representative layer-based and point-based methods, and improves the average PSNR and SSIM by approximately 3 dB and 0.15, respectively, over Hogel-Free Holography. Optical experiments further confirm the physical feasibility and robustness of the proposed framework.

Posted: 30 March 2026

https://doi.org/10.20944/preprints202603.2244.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Spatio-Temporal Analysis of Handball Players’ Actions from Broadcast Videos Using Deep Learning

Kosmas Katsioulas

Ilias Maglogiannis

Abstract: Handball performance analysis is still often conducted through manual review of match videos, while automation on broadcast footage remains challenging due to camera motion, strong perspective effects, and frequent occlusions during dense interactions. This study presents a practical and reproducible monocular pipeline for extracting handball analytics from a single broadcast viewpoint. Players are detected per frame, tracked over time, and projected onto a standardized handball court via homography-based camera calibration. The resulting court-referenced trajectories in metric units enable motion indicators such as distance covered and speed, along with coaching-oriented visual summaries including trajectory overlays and heatmaps. In addition, clip-level action recognition is performed using interpretable kinematic and scene-derived features and lightweight classifiers, with a comparative evaluation across multiple classical models. The modular design keeps intermediate steps explicit, supports reproducibility, and facilitates interpretation of both intermediate outputs and final analytics. Experiments on the UNIRI handball dataset demonstrate that meaningful performance analytics and action understanding can be obtained from single-camera broadcast video using transparent intermediate representations. This work highlights the practical potential of interpretable trajectory-based modeling for under-instrumented sports and provides a reproducible baseline for future extensions incorporating richer contextual cues.

Posted: 27 March 2026

https://doi.org/10.20944/preprints202602.1652.v2

Article

Computer Science and Mathematics

Computer Vision and Graphics

FedGNN-SFD: A Lightweight Federated Graph Neural Network for Multi-Sensor Bearing Fault Diagnosis in Industrial IoT Systems

Yutang Wang

Abstract: Bearing fault diagnosis in industrial Internet of Things (IIoT) systems faces critical challenges: the need for lightweight models deployable on edge devices, privacy constraints preventing centralized data aggregation, and complex inter-sensor correlations in multi-sensor monitoring systems. This paper proposes FedGNN-SFD, a Federated Graph Neural Network that addresses these challenges through a lightweight graph attention architecture. On the CWRU bearing fault dataset with 1,658 samples across 10 fault categories, FedGNN-SFD achieves 87.95% accuracy with only 69,706 parameters—62 times smaller than CNN-1D (4.34M). Comprehensive experiments demonstrate: (1) the graph attention module contributes +12.5% accuracy improvement compared to simple pooling (91.77% vs 79.32%); (2) federated learning with 5 clients and Non-IID data achieves 87.35% accuracy within 15 communication rounds with only 0.6% gap from centralized training; (3) noise robustness analysis shows stable performance under moderate noise conditions; (4) cross-domain validation across simulated load variations demonstrates consistent generalization. The results validate the effectiveness of the proposed architecture for edge deployment scenarios where model efficiency and privacy preservation are prioritized.

Posted: 26 March 2026

https://doi.org/10.20944/preprints202603.2130.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

CAM-HR: Clinical Attention Enhanced CNNs for Hypertensive Retinopathy Classification

Mustafa Yurdakul

Süleyman Burçin Şüyun

Şakir Taşdemir

Abstract: Hypertensive retinopathy (HR) is a retinal vascular disorder caused by long-term hypertension and can lead to severe visual impairment. If not detected early, it could progress to irreversible visual impairment and even blindness. Recent advances in deep learning have enabled automated analysis of retinal images to support clinical diagnosis. In this study, we propose a Clinical Attention Module–enhanced convolutional neural network framework (CAM-HR) for automatic classification of HR stages from Optical Coherence Tomography (OCT) images. In the initial scenario, various state-of-the-art architectures of convolutional neural networks (CNN) have been utilized as baseline models. In the next scenario, the Clinical Attention Module (CAM) is utilized with these architectures to focus on clinically significant regions of the retina, such as vascular structures and lesion locations. The models are evaluated using accuracy, precision, recall, F1-score, and Cohen’s kappa metrics. Experimental results demonstrate that the proposed CAM module consistently improves classification performance across different backbone architectures, achieving the best performance with the ConvNeXt + CAM model. These findings indicate that clinically guided attention mechanisms can significantly enhance automated HR diagnosis from OCT images.

Posted: 20 March 2026

https://doi.org/10.20944/preprints202603.1589.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Virtual Bronchoscopic Pathfinder (VBP): An Open-Source Web-Based System for Airway Segmentation, Cost-Field Path Planning, and Cross-Device 3D Navigation

Young Kim

Sunggyu Choi

Chulmin Park

Woojin Park

Doohee Lee

Abstract: Virtual Bronchoscopic Navigation (VBN) is a critical tool for guiding bronchoscopes toward peripheral pulmonary lesions (PPLs), yet its widespread clinical adoption has been limited by the high cost of proprietary software and the fragility of segmentation-dependent path planning in distal airways. In this study, we present Virtual Bronchoscopic Pathfinder (VBP), a complete, open-source, web-based VBN system that addresses both barriers. VBP integrates five components: (i) a connectivity-aware deep learning model for pulmonary airway segmentation incorporating Connectivity-Aware Surrogate (CAS) and Local-Sensitive Distance (LSD) modules; (ii) TotalSegmentator for automated tumor localization; (iii) a topology-preserving 3D thinning algorithm implemented in C++ for centerline extraction; (iv) a bidirectional Dijkstra algorithm operating on a three-tier anatomical cost field (centerline cost 1.0, airway lumen cost 10.0, parenchyma cost 100.0) to guarantee continuous path generation even under partial skeleton disconnection; and (v) a zero-footprint browser-based visualization interface built on the vtk.js engine, providing synchronized 2D axial viewing and interactive 3D volume rendering. VBP was validated on 306 thin-section CT series (154 subjects) from the public Lung-PET-CT-Dx dataset, achieving a path-generation success rate of 100% across the anatomically valid cohort. The system is publicly accessible at https://vbn.ziovision.ai and was confirmed to operate without client-side installation across desktop, laptop, and mobile device configurations. These results demonstrate that a reliable, accessible, and scalable VBN system can be constructed entirely from open-source components, offering a practical foundation for imaging informatics research and future intra-procedural bronchoscopic guidance.

Posted: 20 March 2026

https://doi.org/10.20944/preprints202603.1617.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Continual Test-Time Adaptation: A Comprehensive Survey

Sarthak Kumar Maharana

Shambhavi Mishra

Yunbei Zhang

Shuaicheng Niu

Taki Hasan Rafi

Jihun Hamm

Marco Pedersoli

Jose Dolz

Yunhui Guo

Abstract: Deep neural networks achieve remarkable performance when training and test data share the same distribution, but this assumption often fails in real-world scenarios where data experiences continual distributional shifts. Continual Test-Time Adaptation (CTTA) tackles this challenge by adapting pretrained models to non-stationary target distributions on-the-fly, without access to source data or labeled targets, while addressing two critical failure modes: catastrophic forgetting of source knowledge and error accumulation from noisy pseudo-labels over time. In this comprehensive survey, we formally define the CTTA problem, analyze the diverse continual domain shift patterns that arise under different evaluation protocols, and propose a hierarchical taxonomy categorizing existing methods into three families: optimization-based strategies (entropy minimization, pseudo-labeling, parameter restoration), parameter-efficient methods (normalization layer adaptation, adaptive parameter selection), and architecture-based approaches (teacher-student frameworks, adapters, visual prompting, masked modeling). We systematically review representative methods in each category and provide comparative benchmarks and experimental results across standard evaluation settings. Finally, we discuss the limitations of current approaches and highlight emerging research directions, including adaptation of foundation models and black-box systems, offering a roadmap for future work in robust continual test-time adaptation, with further resources available at https://github.com/sarthaxxxxx/Awesome-Continual-Test-Time-Adaptation.

Posted: 19 March 2026

https://doi.org/10.20944/preprints202603.1578.v1

Review

Computer Science and Mathematics

Computer Vision and Graphics

Basics of Modern Photogrammetry

Mathias J.P.M. Lemmens

Abstract: Digital photogrammetry emerged around 1980 and decisively accelerated the automation of workflows for converting images into georeferenced datasets for a wide range of appli-cations. The development of innovative technologies, including metric digital cameras; the miniaturization of powerful computers; and positioning and orientation systems, has accelerated since the turn of the century. Advanced photogrammetric and computer vi-sion algorithms have been developed and implemented in software, allowing many work-flows to run on computers from begin to end. Today, final products can be generated largely automatically, minimizing the timespan between image capture, even up to real-time, and acquiring the necessary datasets for the task at hand,. Thanks to the wide avail-ability of commercial and open-source software, the scope of applications has expanded rapidly, leading to a significant growth in the number of new users of photogrammetry. This article aims to serve this new group by providing an overview of the technologies underlying current photogrammetric workflows, starting with the geometric fundamen-tals of camera modeling and georeferencing. Next, we examine the algorithms that have revolutionized workflows and are known by various names, particularly: image match-ing, computer stereo vision, and structure from motion (SFM). Next basic characteristics of final photogrammetric products are briefly discussed. This is followed by methods to assess accuracy of the final product, a key component of extracting geometric information from imagery. The discussion section provides tips for selecting suitable textbooks to deepen your knowledge.

Posted: 18 March 2026

https://doi.org/10.20944/preprints202603.1489.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Fusing Handcrafted Spatial Descriptors with a Lightweight CNN for Semiconductor Wafer Map Defect Classification

Rahul Prabhu

Rishikesh Madhuvairy

Abstract: Automated defect classification in wafer maps is critical for semiconductor yield management and quality control, but pure deep learning models often underperform on rare or spatially subtle defect types and lack interpretability. Handcrafted spatial features can capture physical defect characteristics, yet their integration with modern CNNs is underexplored. We systematically evaluate eight physically motivated spatial descriptors: radial mean and standard deviation, directional entropy, aspect ratio, fail fraction, and zone-wise failure densities (core, mid, edge), by training a lightweight CNN (101k parameters) augmented with each descriptor, both individually and in full combination. To the best of our knowledge, this is the first systematic ablation study to quantify the synergistic effect of fusing physically-informed spatial descriptors with a modern, edge-optimized CNN for this task.On the WM-811K benchmark (eight defect classes, 25,519 labeled wafers), the vision-only baseline achieves 60.0% test accuracy and 0.615 weighted F1. Nearly every single descriptor individually underperforms the baseline, with the best descriptor (fail fraction) reaching only 60.1%. However, the full fusion of all eight descriptors significantly outperforms the baseline, reaching 72.7% accuracy (+12.7 points) and 0.728 weighted F1. This synergy demonstrates that spatial descriptors provide complementary information that is only realizable in combination.Per-class analysis reveals that the combination model substantially improves challenging classes: Donut F1 rises from 0.207 to 0.493, Edge-Loc from 0.384 to 0.672, and Center from 0.579 to 0.760. However, the Loc class remains challenging for all models, likely due to its diffuse spatial patterns.

Posted: 18 March 2026

https://doi.org/10.20944/preprints202603.1447.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Dynamic4D: Enhancing Self-Supervised Learning for Robust and Fine-Grained 4D Point Cloud Video Understanding

Mingxuan Du

Yutian Zeng

Abstract: The proliferation of 4D point cloud videos highlights their potential, but the high cost of obtaining large-scale annotated data severely limits supervised methods. Consequently, self-supervised learning (SSL) is vital for learning generalizable representations from unlabeled 4D data. While existing SSL frameworks, such as Uni4D, have made progress, they often struggle with fine-grained motion understanding in extremely dynamic scenes, maintaining robustness under severe occlusion, and developing explicit predictive capabilities. To address these, we propose Dynamic4D, a novel and robust self-supervised framework tailored for dynamic 4D point cloud understanding. Dynamic4D introduces an Adaptive Causal Temporal Attention (ACTA) mechanism in the encoder for explicit causal temporal modeling and dynamic region-focused learning. Its decoder employs Motion Prediction Tokens (MPT) to directly infer motion vectors for masked regions. A novel adaptive motion-sensitive masking strategy further enhances robustness by intelligently prioritizing high-dynamic zones. Our multi-objective pre-training strategy integrates a new Dynamic Perception Loss alongside geometric reconstruction and latent-space alignment. Extensive experiments on diverse challenging benchmarks demonstrate that Dynamic4D consistently achieves state-of-the-art performance. It substantially outperforms prior methods, validating its superior capacity to learn highly robust, generalizable, and motion-aware representations for complex dynamic 4D point cloud scenes.

Posted: 17 March 2026

https://doi.org/10.20944/preprints202603.1381.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

CausalDrive: Integrating Causal Reasoning and Multimodal Prediction into Large Language Models for Autonomous Driving

Bowen Nian

Mingyu Tan

Abstract: Autonomous driving in urban environments demands deep contextual understanding, anticipation, and transparent explanations, which current purely data-driven systems often lack due to their limited causal reasoning abilities. We introduce CausalDrive, a novel unified framework integrating advanced multimodal perception with explicit causal reasoning within a Large Language Model architecture. Leveraging Mistral-7B, CausalDrive employs a Multimodal Perception Encoder for comprehensive scene understanding, a Causal Graph Induction Module to dynamically infer causal relationships between entities, and a Perceptual-Causal Alignment Module to unify these diverse inputs for the LLM. It is fine-tuned for Causal-aware Multimodal Future Prediction, Explainable Decision Making and Planning, and Causal Scene Question Answering. Extensive experiments on augmented nuScenes and Waymo Open Datasets demonstrate that CausalDrive consistently outperforms state-of-the-art baselines across tasks, achieving superior predictive accuracy, robust planning, and enhanced robustness to noise. Ablation studies confirm the Causal Graph Induction Module's critical contribution. Human evaluations validate its exceptional explainability and helpfulness. Despite higher computational cost, CausalDrive significantly advances intelligent, trustworthy, and human-understandable autonomous driving by explicitly addressing the causal "why" behind events.

Posted: 17 March 2026

https://doi.org/10.20944/preprints202603.1319.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

CLIP-Mono3D: End-to-End Open-Vocabulary Monocular 3D Object Detection via Semantic-Geometric Similarity

Zichong Gu

Shiyi Mu

Hanqi Lyu

Shugong Xu

Abstract: Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision-language semantics into monocular 3D detection. By leveraging CLIP-derived semantic priors and grounding object queries in semantically salient regions, our model achieves robust zero-shot generalization to novel categories without requiring auxiliary 2D detectors. Furthermore, we introduce OV-KITTI, a large-scale benchmark extending KITTI with 40 new categories and over 7,000 annotated 3D bounding boxes. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in open-vocabulary scenarios.

Posted: 16 March 2026

https://doi.org/10.20944/preprints202603.1117.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Video-Based Arabic Sign Language Recognition with Mediapipe and Deep Learning Techniques

Dana El-Rushaidat

Nour Almohammad

Raine Yeh

Kinda Fayyad

Abstract: This paper addresses the critical communication barrier experienced by deaf and hearing-impaired individuals in the Arab world through the development of an affordable, video-based Arabic Sign Language (ArSL) recognition system. Designed for broad accessibility, the system eliminates specialized hardware by leveraging standard mobile or laptop cameras. Our methodology employs Mediapipe for real-time extraction of hand, face, and pose landmarks from video streams. These anatomical features are then processed by a hybrid deep learning model integrating Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), specifically Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN component captures spatial features, such as intricate hand shapes and body movements, within individual frames. Concurrently, BiLSTMs model long-term temporal dependencies and motion trajectories across consecutive frames. This integrated CNN-BiLSTM architecture is critical for generating a comprehensive spatiotemporal representation, enabling accurate differentiation of complex signs where meaning relies on both static gestures and dynamic transitions, thus preventing misclassification that CNN-only or RNN-only models would incur. Rigorously evaluated on the author-created JUST-SL dataset and the publicly available KArSL dataset, the system achieved 96% overall accuracy for JUST-SL and an impressive 99% for KArSL. These results demonstrate the system’s superior accuracy compared to previous research, particularly for recognizing full Arabic words, thereby significantly enhancing communication accessibility for the deaf and hearing-impaired community.

Posted: 12 March 2026

https://doi.org/10.20944/preprints202603.0913.v1

Data Descriptor

Computer Science and Mathematics

Computer Vision and Graphics

SadaColorDataset (SCD): 9 Paper Colors × 4 Illumination Conditions for Robust Color Vision Evaluation

Basit Raza

Sadaf Bibi

Sadia Bibi

Ali Nawaz

Abstract: SadaColorDataset (SCD) is a publicly available image dataset designed to support research on robust color recognition and illumination-related color variation in real mobile captures. The dataset contains 10,843 photographs of nine physical color papers (Black, Blue, Gray, Orange, Pink, Purple, Sky Blue, White, and Yellow) recorded under four everyday lighting conditions: Fluorescent, Indoor, Indoor Night, and Sunlight. All images were captured using an Infinix NOTE 40 smartphone camera (108 MP) with a simple, repeatable setup intended to reflect practical conditions rather than laboratory calibration. For each color–illumination setting, multiple images were collected to cover natural variability due to exposure, white balance, shadows, and reflections. During acquisition, the paper was placed on a ground surface and the phone was mounted on a tripod; the viewpoint was varied by moving the tripod to different positions and orientations. However, because explicit angle labels were not recorded or reliably recoverable from the released file structure/metadata, SCD does not provide calibrated or discrete “angle IDs,” and it should be treated as a dataset with unlabeled viewpoint variation. Along with the images, we release machine-readable metadata and summary files that describe image counts across colors and illuminations and provide basic color statistics (e.g., RGB/CIELAB-derived measures) to facilitate reproducible analysis. SCD is distributed under a public license and is intended for benchmarking illumination robustness, dataset shift, and color stability in mobile vision pipelines.

Posted: 09 March 2026

https://doi.org/10.20944/preprints202603.0599.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Virtual Reality Training System for Enhancing Skill Transferability in Volleyball Players

Chima Okwuokei

Desmond Moru

Clifford Uroh

Samuel Oyefusi

Abstract: Virtual Reality (VR) is increasingly recognized as a valuable tool for sports training, providing immersive environments that support skill acquisition and performance improvement. Comparative studies across hand-intensive sports such as basketball, volleyball, and table tennis show substantial research on VR’s effectiveness in basketball and table tennis, yet volleyball remains relatively underexplored, particularly in terms of skill transfer to real-world play. Research in basketball and table tennis indicates that VR can improve motor coordination, tactical awareness, and user motivation. However, volleyball-specific literature is limited. Existing studies generally focus on areas such as eye–hand coordination and tactical decision-making but provide little evidence on whether VR-acquired skills translate effectively to the court. This paper addresses the gap in volleyball-focused VR research and emphasises the need for further investigation to maximise VR’s potential for volleyball training. Ten beginner-level volleyball players (mean age = 20.4 years) participated in this study, which examined the effectiveness of VR-based serving training. Participants completed an initial physical pre-test to determine their baseline serving performance, followed by a three-week VR training program consisting of structured serving drills. After the program, a post-test assessment was conducted to measure improvement. A paired t-test comparing pre- and post-training results showed a statistically significant improvement in serving performance (p = 0.0147), meeting the 0.05 significance threshold. This indicates that the observed performance gains were unlikely due to chance and demonstrates the positive impact of VR training on serving skills in beginner volleyball players.

Posted: 03 March 2026

https://doi.org/10.20944/preprints202603.0153.v1

Article

Computer Science and Mathematics

Computer Vision and Graphics

Enhancing Early Skin Cancer Detection: A Deep Learning Approach with Multi-Scale Feature Refinement and Fusion

Siyuan Wu

Pengfei Zhao

Huafu Xu

Ziming Wang

Abstract: The global incidence of skin cancer is rising, making it an increasingly critical public health issue. Malignant skin tumors such as melanoma originate from pathological alterations of skin cells, and their accurate early-stage segmentation is crucial for quantitative analysis, early diagnosis, and successful treatment. However, achieving precise and efficient segmentation remains a major challenge, as existing methods often struggle to balance computational efficiency with the ability to capture complex lesion characteristics. To address this challenge, we propose a novel deep learning framework that integrates the PVT v2 backbone with two key modules: Spatial-Aware Feature Enhancement (SAFE) and Multiscale Dual Cross-attention Fusion (MDCF). The SAFE module refines multi-scale encoder features through a dual-branch architecture that bridges the feature discrepancy across network depths by combining fine-grained shallow-layer details with deep semantic information via adaptive offset prediction. The MDCF module establishes bidirectional cross-attention between decoder and encoder features, followed by multi-scale deformable convolutions that capture lesion boundaries and small fragments at heterogeneous receptive fields, thereby enriching semantic details while suppressing background responses. The proposed model was evaluated on two public benchmark datasets (ISIC 2016 and ISIC 2018), achieving Intersection over Union (IoU) scores of 87.33% and 83.67%, respectively, demonstrating superior performance compared to current state-of-the-art methods. These results indicate that our framework significantly enhances skin lesion image analysis and offers a promising tool for improving early detection of skin cancer.

Posted: 02 March 2026

https://doi.org/10.20944/preprints202603.0154.v1

of 35