Computer Science and Mathematics

Sort by

Article
Computer Science and Mathematics
Computer Vision and Graphics

Valli Nayagam

,

Anukarthika S

,

Muhesh Krishnaa S

,

Sri Sathya K B

Abstract: The rapid expansion of sports broadcasting and digital media platforms has increased the demand for intelligent systems capable of automatically identifying important sports events for real-time analytics and highlight generation. Manual annotation of sports videos requires significant time and effort and may introduce human errors during analysis. This paper presents a real-time sports action recognition framework using a hybrid CNN–Transformer architecture for detecting critical events in football and cricket videos. The proposed system processes live or recorded video streams through frame extraction, normalization, and spatial feature learning using the MobileNetV2 network. Temporal relationships between consecutive frames are modeled using a Transformer encoder to improve action understanding. The framework classifies events such as pass and goal in football, and four, six, and wicket in cricket. Motion-based filtering and confidence thresholding reduce non-action frames and improve prediction reliability. Detected events are recorded with timestamps and displayed using broadcast-style overlays to support automated highlight generation. Experimental evaluation demonstrates high recognition accuracy and efficient real-time performance on low-cost hardware platforms. The framework provides an effective solution for sports analytics, media automation, and intelligent decision-support systems.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Jianhua Zhu

,

Changjiang Liu

,

Danling Liang

Abstract: Multi-modal remote sensing image registration is a challenging task due to differences in resolution, viewpoint, and intensity, which often leads to inaccurate and time-consuming results with existing algorithms. To address these issues, we propose an algorithm based on Curvature Scale Space Contour Point Features (CSSCPF). Our approach combines multi-scale Sobel edge detection, dominant direction determination, an improved curvature scale space corner detector, a new gradient definition, and enhanced SIFT descriptors. Test results on publicly available datasets show that our algorithm outperforms existing methods in overall performance. Our code will be released at https://github.com/JianhuaZhu-IR.

Review
Computer Science and Mathematics
Computer Vision and Graphics

Ge Gao

,

Chen Feng

,

Yuxuan Jiang

,

Tianhao Peng

,

Ho Man Kwan

,

Siyue Teng

,

Chengxi Zeng

,

Yixuan Li

,

Changqi Wang

,

Robbie Hamilton

+3 authors

Abstract: While conventional video coding standards remain predominant in real-world applications, neural video compression has emerged over the past decade as an active research area, offering alternative solutions with potentially significant coding gains through end-to-end optimization. Owing to the rapid pace of recent progress, existing reviews of neural video coding quickly become outdated and often lack a systematic taxonomy and meaningful benchmarking. To address this gap, we provide a comprehensive review of two major classes of neural video codecs - scene-agnostic and scene-adaptive - with a focus on their design characteristics and limitations. More importantly, we benchmark representative state-of-the-art methods from each category under common test conditions recommended by video coding standardization bodies. This provides, to the best of our knowledge, the first first large-scale unified comparison between conventional and neural video codecs under controlled settings. Our results show that neural codecs can already achieve competitive, and in some cases superior, performance relative to VTM and AVM, although they still fall short of ECM in overall coding efficiency under both Low Delay and Random Access configurations. To facilitate future algorithm benchmarking, we will release the full implementations and results at https://nvc-review-2025.github.io, thereby providing a useful resource for the video compression research community.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Marc Tornero-Soria

,

Antonio-José Sánchez-Salmerón

,

Eduardo Vendrell-Vidal

Abstract: Public YOLO model releases typically provide high-level architectural descriptions and headline benchmark results, but offer limited empirical attribution of performance to individual blocks under controlled training conditions. This paper presents a modular, block-level analysis of YOLO26’s object detection architecture, detailing the design, function, and contribution of each component. We systematically examine YOLO26’s convolutional modules, bottleneck-based refinement blocks, spatial pyramid pooling, and position-sensitive attention mechanisms. Each block is analyzed in terms of objective and internal flow. In parallel, we conduct targeted ablation studies to quantify the effect of key design choices on accuracy (mAP50–95) and inference latency under a fixed, fully specified training and benchmarking protocol. Experiments use the MS COCO [1] dataset with the standard train2017 split (≈118k images) for training and the full val2017 split (5k images) for evaluation. The result is a self-contained technical reference that supports interpretability, reproducibility, and evidence-based architectural decision-making for real-time detection models.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Longcheng Huang

,

Mengguang Liao

,

Shaoning Li

,

Chuanguang Zhu

,

Sichun Long

Abstract: Maritime search and rescue is an important component of emergency response frameworks and primarily relies on UAVs for maritime object detection. However, maritime accidents frequently occur in low-visibility environments, such as foggy or low-light conditions, which lead to low contrast, blurred object boundaries, and degraded texture representations. Most existing maritime object detection algorithms are developed for natural light scenes, and their performance deteriorates markedly when deployed directly in low-visibility environments, primarily due to reduced image quality that hinders feature extraction and semantic information aggregation. Although several studies incorporate image enhancement techniques prior to detection to improve image quality, these approaches often introduce significant additional computational overhead, limiting their practical deployment on UAV platforms. To tackle these challenges, this paper proposes a lightweight model built upon a recent YOLO framework, termed Multi-Scale Adaptive YOLO (MSA-YOLO), for maritime detection using UAVs in low-visibility environments. The proposed model systematically optimizes the backbone, neck, and detection head networks. Specifically, an improved StarNet backbone is designed by integrating ECA mechanisms and multi-scale convolutional kernels, which strengthen feature extraction capability while maintaining low computational overhead. In the neck network, a high-frequency enhanced residual block branch is inserted into the C3k2 module to capture richer detailed information, while depthwise separable convolution is utilized to further reduce computational cost. Moreover, a non-parametric attention module is incorporated into the detection head to adaptively optimize features in the classification and regression branches. Finally, a joint loss function that combines bounding box regression, classification, and distribution focal losses is utilized to improve detection accuracy and training stability. Experimental results on the constructed AFO, Zhoushan Island, and Shandong Province datasets demonstrate that, relative to YOLOv11-s, MSA-YOLO reduces model parameters and FLOPs by 52.07% and 41.36%, respectively, while achieving improvements of 1.11% and 1.33% in mAP@0.5:0.95 and mAP@0.5. These results indicate that the proposed method effectively balances computational efficiency and detection accuracy, rendering it suitable for for practical maritime search and rescue applications in low-visibility environments.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Xiaoming Chen

,

Xiaoyu Jiang

,

Yingqing Huang

,

Xi Wang

,

Chaoqun Ma

Abstract: We propose a field-transformation-based framework for generating phase-only light-field holograms from a single RGB image. The method establishes an explicit pipeline from monocular scene inference to holographic wavefront synthesis, without requiring multi-view capture or task-specific hologram-network training. First, we construct a layered occlusion RGB-D model from the input image using monocular depth estimation, connectivity-based layer decomposition, and occlusion-aware inpainting, which provides a lightweight 3D prior for sparse-view rendering in the small-parallax regime. Second, we transform the rendered sparse RGB-D light field into a target complex wavefront on the recording plane through local frequency mapping, thereby bridging explicit scene geometry and wave-optical field construction. Third, we optimize the phase-only hologram under multi-planeamplitude constraints using a geometrically consistent initial phase and an error-driven adaptive depth-sampling strategy, which improves convergence stability and reconstruction quality under a limited computational budget. Numerical experiments show that the proposed method achieves better depth continuity, occlusion fidelity, and lower speckle noise than representative layer-based and point-based methods, and improves the average PSNR and SSIM by approximately 3 dB and 0.15, respectively, over Hogel-Free Holography. Optical experiments further confirm the physical feasibility and robustness of the proposed framework.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Kosmas Katsioulas

,

Ilias Maglogiannis

Abstract: Handball performance analysis is still often conducted through manual review of match videos, while automation on broadcast footage remains challenging due to camera motion, strong perspective effects, and frequent occlusions during dense interactions. This study presents a practical and reproducible monocular pipeline for extracting handball analytics from a single broadcast viewpoint. Players are detected per frame, tracked over time, and projected onto a standardized handball court via homography-based camera calibration. The resulting court-referenced trajectories in metric units enable motion indicators such as distance covered and speed, along with coaching-oriented visual summaries including trajectory overlays and heatmaps. In addition, clip-level action recognition is performed using interpretable kinematic and scene-derived features and lightweight classifiers, with a comparative evaluation across multiple classical models. The modular design keeps intermediate steps explicit, supports reproducibility, and facilitates interpretation of both intermediate outputs and final analytics. Experiments on the UNIRI handball dataset demonstrate that meaningful performance analytics and action understanding can be obtained from single-camera broadcast video using transparent intermediate representations. This work highlights the practical potential of interpretable trajectory-based modeling for under-instrumented sports and provides a reproducible baseline for future extensions incorporating richer contextual cues.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Yutang Wang

Abstract: Bearing fault diagnosis in industrial Internet of Things (IIoT) systems faces critical challenges: the need for lightweight models deployable on edge devices, privacy constraints preventing centralized data aggregation, and complex inter-sensor correlations in multi-sensor monitoring systems. This paper proposes FedGNN-SFD, a Federated Graph Neural Network that addresses these challenges through a lightweight graph attention architecture. On the CWRU bearing fault dataset with 1,658 samples across 10 fault categories, FedGNN-SFD achieves 87.95% accuracy with only 69,706 parameters—62 times smaller than CNN-1D (4.34M). Comprehensive experiments demonstrate: (1) the graph attention module contributes +12.5% accuracy improvement compared to simple pooling (91.77% vs 79.32%); (2) federated learning with 5 clients and Non-IID data achieves 87.35% accuracy within 15 communication rounds with only 0.6% gap from centralized training; (3) noise robustness analysis shows stable performance under moderate noise conditions; (4) cross-domain validation across simulated load variations demonstrates consistent generalization. The results validate the effectiveness of the proposed architecture for edge deployment scenarios where model efficiency and privacy preservation are prioritized.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Mustafa Yurdakul

,

Süleyman Burçin Şüyun

,

Şakir Taşdemir

Abstract: Hypertensive retinopathy (HR) is a retinal vascular disorder caused by long-term hypertension and can lead to severe visual impairment. If not detected early, it could progress to irreversible visual impairment and even blindness. Recent advances in deep learning have enabled automated analysis of retinal images to support clinical diagnosis. In this study, we propose a Clinical Attention Module–enhanced convolutional neural network framework (CAM-HR) for automatic classification of HR stages from Optical Coherence Tomography (OCT) images. In the initial scenario, various state-of-the-art architectures of convolutional neural networks (CNN) have been utilized as baseline models. In the next scenario, the Clinical Attention Module (CAM) is utilized with these architectures to focus on clinically significant regions of the retina, such as vascular structures and lesion locations. The models are evaluated using accuracy, precision, recall, F1-score, and Cohen’s kappa metrics. Experimental results demonstrate that the proposed CAM module consistently improves classification performance across different backbone architectures, achieving the best performance with the ConvNeXt + CAM model. These findings indicate that clinically guided attention mechanisms can significantly enhance automated HR diagnosis from OCT images.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Young Kim

,

Sunggyu Choi

,

Chulmin Park

,

Woojin Park

,

Doohee Lee

Abstract: Virtual Bronchoscopic Navigation (VBN) is a critical tool for guiding bronchoscopes toward peripheral pulmonary lesions (PPLs), yet its widespread clinical adoption has been limited by the high cost of proprietary software and the fragility of segmentation-dependent path planning in distal airways. In this study, we present Virtual Bronchoscopic Pathfinder (VBP), a complete, open-source, web-based VBN system that addresses both barriers. VBP integrates five components: (i) a connectivity-aware deep learning model for pulmonary airway segmentation incorporating Connectivity-Aware Surrogate (CAS) and Local-Sensitive Distance (LSD) modules; (ii) TotalSegmentator for automated tumor localization; (iii) a topology-preserving 3D thinning algorithm implemented in C++ for centerline extraction; (iv) a bidirectional Dijkstra algorithm operating on a three-tier anatomical cost field (centerline cost 1.0, airway lumen cost 10.0, parenchyma cost 100.0) to guarantee continuous path generation even under partial skeleton disconnection; and (v) a zero-footprint browser-based visualization interface built on the vtk.js engine, providing synchronized 2D axial viewing and interactive 3D volume rendering. VBP was validated on 306 thin-section CT series (154 subjects) from the public Lung-PET-CT-Dx dataset, achieving a path-generation success rate of 100% across the anatomically valid cohort. The system is publicly accessible at https://vbn.ziovision.ai and was confirmed to operate without client-side installation across desktop, laptop, and mobile device configurations. These results demonstrate that a reliable, accessible, and scalable VBN system can be constructed entirely from open-source components, offering a practical foundation for imaging informatics research and future intra-procedural bronchoscopic guidance.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Sarthak Kumar Maharana

,

Shambhavi Mishra

,

Yunbei Zhang

,

Shuaicheng Niu

,

Taki Hasan Rafi

,

Jihun Hamm

,

Marco Pedersoli

,

Jose Dolz

,

Yunhui Guo

Abstract: Deep neural networks achieve remarkable performance when training and test data share the same distribution, but this assumption often fails in real-world scenarios where data experiences continual distributional shifts. Continual Test-Time Adaptation (CTTA) tackles this challenge by adapting pretrained models to non-stationary target distributions on-the-fly, without access to source data or labeled targets, while addressing two critical failure modes: catastrophic forgetting of source knowledge and error accumulation from noisy pseudo-labels over time. In this comprehensive survey, we formally define the CTTA problem, analyze the diverse continual domain shift patterns that arise under different evaluation protocols, and propose a hierarchical taxonomy categorizing existing methods into three families: optimization-based strategies (entropy minimization, pseudo-labeling, parameter restoration), parameter-efficient methods (normalization layer adaptation, adaptive parameter selection), and architecture-based approaches (teacher-student frameworks, adapters, visual prompting, masked modeling). We systematically review representative methods in each category and provide comparative benchmarks and experimental results across standard evaluation settings. Finally, we discuss the limitations of current approaches and highlight emerging research directions, including adaptation of foundation models and black-box systems, offering a roadmap for future work in robust continual test-time adaptation, with further resources available at https://github.com/sarthaxxxxx/Awesome-Continual-Test-Time-Adaptation.

Review
Computer Science and Mathematics
Computer Vision and Graphics

Mathias J.P.M. Lemmens

Abstract: Digital photogrammetry emerged around 1980 and decisively accelerated the automation of workflows for converting images into georeferenced datasets for a wide range of appli-cations. The development of innovative technologies, including metric digital cameras; the miniaturization of powerful computers; and positioning and orientation systems, has accelerated since the turn of the century. Advanced photogrammetric and computer vi-sion algorithms have been developed and implemented in software, allowing many work-flows to run on computers from begin to end. Today, final products can be generated largely automatically, minimizing the timespan between image capture, even up to real-time, and acquiring the necessary datasets for the task at hand,. Thanks to the wide avail-ability of commercial and open-source software, the scope of applications has expanded rapidly, leading to a significant growth in the number of new users of photogrammetry. This article aims to serve this new group by providing an overview of the technologies underlying current photogrammetric workflows, starting with the geometric fundamen-tals of camera modeling and georeferencing. Next, we examine the algorithms that have revolutionized workflows and are known by various names, particularly: image match-ing, computer stereo vision, and structure from motion (SFM). Next basic characteristics of final photogrammetric products are briefly discussed. This is followed by methods to assess accuracy of the final product, a key component of extracting geometric information from imagery. The discussion section provides tips for selecting suitable textbooks to deepen your knowledge.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Rahul Prabhu

,

Rishikesh Madhuvairy

Abstract: Automated defect classification in wafer maps is critical for semiconductor yield management and quality control, but pure deep learning models often underperform on rare or spatially subtle defect types and lack interpretability. Handcrafted spatial features can capture physical defect characteristics, yet their integration with modern CNNs is underexplored. We systematically evaluate eight physically motivated spatial descriptors: radial mean and standard deviation, directional entropy, aspect ratio, fail fraction, and zone-wise failure densities (core, mid, edge), by training a lightweight CNN (101k parameters) augmented with each descriptor, both individually and in full combination. To the best of our knowledge, this is the first systematic ablation study to quantify the synergistic effect of fusing physically-informed spatial descriptors with a modern, edge-optimized CNN for this task.On the WM-811K benchmark (eight defect classes, 25,519 labeled wafers), the vision-only baseline achieves 60.0% test accuracy and 0.615 weighted F1. Nearly every single descriptor individually underperforms the baseline, with the best descriptor (fail fraction) reaching only 60.1%. However, the full fusion of all eight descriptors significantly outperforms the baseline, reaching 72.7% accuracy (+12.7 points) and 0.728 weighted F1. This synergy demonstrates that spatial descriptors provide complementary information that is only realizable in combination.Per-class analysis reveals that the combination model substantially improves challenging classes: Donut F1 rises from 0.207 to 0.493, Edge-Loc from 0.384 to 0.672, and Center from 0.579 to 0.760. However, the Loc class remains challenging for all models, likely due to its diffuse spatial patterns.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Mingxuan Du

,

Yutian Zeng

Abstract: The proliferation of 4D point cloud videos highlights their potential, but the high cost of obtaining large-scale annotated data severely limits supervised methods. Consequently, self-supervised learning (SSL) is vital for learning generalizable representations from unlabeled 4D data. While existing SSL frameworks, such as Uni4D, have made progress, they often struggle with fine-grained motion understanding in extremely dynamic scenes, maintaining robustness under severe occlusion, and developing explicit predictive capabilities. To address these, we propose Dynamic4D, a novel and robust self-supervised framework tailored for dynamic 4D point cloud understanding. Dynamic4D introduces an Adaptive Causal Temporal Attention (ACTA) mechanism in the encoder for explicit causal temporal modeling and dynamic region-focused learning. Its decoder employs Motion Prediction Tokens (MPT) to directly infer motion vectors for masked regions. A novel adaptive motion-sensitive masking strategy further enhances robustness by intelligently prioritizing high-dynamic zones. Our multi-objective pre-training strategy integrates a new Dynamic Perception Loss alongside geometric reconstruction and latent-space alignment. Extensive experiments on diverse challenging benchmarks demonstrate that Dynamic4D consistently achieves state-of-the-art performance. It substantially outperforms prior methods, validating its superior capacity to learn highly robust, generalizable, and motion-aware representations for complex dynamic 4D point cloud scenes.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Bowen Nian

,

Mingyu Tan

Abstract: Autonomous driving in urban environments demands deep contextual understanding, anticipation, and transparent explanations, which current purely data-driven systems often lack due to their limited causal reasoning abilities. We introduce CausalDrive, a novel unified framework integrating advanced multimodal perception with explicit causal reasoning within a Large Language Model architecture. Leveraging Mistral-7B, CausalDrive employs a Multimodal Perception Encoder for comprehensive scene understanding, a Causal Graph Induction Module to dynamically infer causal relationships between entities, and a Perceptual-Causal Alignment Module to unify these diverse inputs for the LLM. It is fine-tuned for Causal-aware Multimodal Future Prediction, Explainable Decision Making and Planning, and Causal Scene Question Answering. Extensive experiments on augmented nuScenes and Waymo Open Datasets demonstrate that CausalDrive consistently outperforms state-of-the-art baselines across tasks, achieving superior predictive accuracy, robust planning, and enhanced robustness to noise. Ablation studies confirm the Causal Graph Induction Module's critical contribution. Human evaluations validate its exceptional explainability and helpfulness. Despite higher computational cost, CausalDrive significantly advances intelligent, trustworthy, and human-understandable autonomous driving by explicitly addressing the causal "why" behind events.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Zichong Gu

,

Shiyi Mu

,

Hanqi Lyu

,

Shugong Xu

Abstract: Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision-language semantics into monocular 3D detection. By leveraging CLIP-derived semantic priors and grounding object queries in semantically salient regions, our model achieves robust zero-shot generalization to novel categories without requiring auxiliary 2D detectors. Furthermore, we introduce OV-KITTI, a large-scale benchmark extending KITTI with 40 new categories and over 7,000 annotated 3D bounding boxes. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in open-vocabulary scenarios.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Dana El-Rushaidat

,

Nour Almohammad

,

Raine Yeh

,

Kinda Fayyad

Abstract: This paper addresses the critical communication barrier experienced by deaf and hearing-impaired individuals in the Arab world through the development of an affordable, video-based Arabic Sign Language (ArSL) recognition system. Designed for broad accessibility, the system eliminates specialized hardware by leveraging standard mobile or laptop cameras. Our methodology employs Mediapipe for real-time extraction of hand, face, and pose landmarks from video streams. These anatomical features are then processed by a hybrid deep learning model integrating Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), specifically Bidirectional Long Short-Term Memory (BiLSTM) layers. The CNN component captures spatial features, such as intricate hand shapes and body movements, within individual frames. Concurrently, BiLSTMs model long-term temporal dependencies and motion trajectories across consecutive frames. This integrated CNN-BiLSTM architecture is critical for generating a comprehensive spatiotemporal representation, enabling accurate differentiation of complex signs where meaning relies on both static gestures and dynamic transitions, thus preventing misclassification that CNN-only or RNN-only models would incur. Rigorously evaluated on the author-created JUST-SL dataset and the publicly available KArSL dataset, the system achieved 96% overall accuracy for JUST-SL and an impressive 99% for KArSL. These results demonstrate the system’s superior accuracy compared to previous research, particularly for recognizing full Arabic words, thereby significantly enhancing communication accessibility for the deaf and hearing-impaired community.

Data Descriptor
Computer Science and Mathematics
Computer Vision and Graphics

Basit Raza

,

Sadaf Bibi

,

Sadia Bibi

,

Ali Nawaz

Abstract: SadaColorDataset (SCD) is a publicly available image dataset designed to support research on robust color recognition and illumination-related color variation in real mobile captures. The dataset contains 10,843 photographs of nine physical color papers (Black, Blue, Gray, Orange, Pink, Purple, Sky Blue, White, and Yellow) recorded under four everyday lighting conditions: Fluorescent, Indoor, Indoor Night, and Sunlight. All images were captured using an Infinix NOTE 40 smartphone camera (108 MP) with a simple, repeatable setup intended to reflect practical conditions rather than laboratory calibration. For each color–illumination setting, multiple images were collected to cover natural variability due to exposure, white balance, shadows, and reflections. During acquisition, the paper was placed on a ground surface and the phone was mounted on a tripod; the viewpoint was varied by moving the tripod to different positions and orientations. However, because explicit angle labels were not recorded or reliably recoverable from the released file structure/metadata, SCD does not provide calibrated or discrete “angle IDs,” and it should be treated as a dataset with unlabeled viewpoint variation. Along with the images, we release machine-readable metadata and summary files that describe image counts across colors and illuminations and provide basic color statistics (e.g., RGB/CIELAB-derived measures) to facilitate reproducible analysis. SCD is distributed under a public license and is intended for benchmarking illumination robustness, dataset shift, and color stability in mobile vision pipelines.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Chima Okwuokei

,

Desmond Moru

,

Clifford Uroh

,

Samuel Oyefusi

Abstract: Virtual Reality (VR) is increasingly recognized as a valuable tool for sports training, providing immersive environments that support skill acquisition and performance improvement. Comparative studies across hand-intensive sports such as basketball, volleyball, and table tennis show substantial research on VR’s effectiveness in basketball and table tennis, yet volleyball remains relatively underexplored, particularly in terms of skill transfer to real-world play. Research in basketball and table tennis indicates that VR can improve motor coordination, tactical awareness, and user motivation. However, volleyball-specific literature is limited. Existing studies generally focus on areas such as eye–hand coordination and tactical decision-making but provide little evidence on whether VR-acquired skills translate effectively to the court. This paper addresses the gap in volleyball-focused VR research and emphasises the need for further investigation to maximise VR’s potential for volleyball training. Ten beginner-level volleyball players (mean age = 20.4 years) participated in this study, which examined the effectiveness of VR-based serving training. Participants completed an initial physical pre-test to determine their baseline serving performance, followed by a three-week VR training program consisting of structured serving drills. After the program, a post-test assessment was conducted to measure improvement. A paired t-test comparing pre- and post-training results showed a statistically significant improvement in serving performance (p = 0.0147), meeting the 0.05 significance threshold. This indicates that the observed performance gains were unlikely due to chance and demonstrates the positive impact of VR training on serving skills in beginner volleyball players.

Article
Computer Science and Mathematics
Computer Vision and Graphics

Siyuan Wu

,

Pengfei Zhao

,

Huafu Xu

,

Ziming Wang

Abstract: The global incidence of skin cancer is rising, making it an increasingly critical public health issue. Malignant skin tumors such as melanoma originate from pathological alterations of skin cells, and their accurate early-stage segmentation is crucial for quantitative analysis, early diagnosis, and successful treatment. However, achieving precise and efficient segmentation remains a major challenge, as existing methods often struggle to balance computational efficiency with the ability to capture complex lesion characteristics. To address this challenge, we propose a novel deep learning framework that integrates the PVT v2 backbone with two key modules: Spatial-Aware Feature Enhancement (SAFE) and Multiscale Dual Cross-attention Fusion (MDCF). The SAFE module refines multi-scale encoder features through a dual-branch architecture that bridges the feature discrepancy across network depths by combining fine-grained shallow-layer details with deep semantic information via adaptive offset prediction. The MDCF module establishes bidirectional cross-attention between decoder and encoder features, followed by multi-scale deformable convolutions that capture lesion boundaries and small fragments at heterogeneous receptive fields, thereby enriching semantic details while suppressing background responses. The proposed model was evaluated on two public benchmark datasets (ISIC 2016 and ISIC 2018), achieving Intersection over Union (IoU) scores of 87.33% and 83.67%, respectively, demonstrating superior performance compared to current state-of-the-art methods. These results indicate that our framework significantly enhances skin lesion image analysis and offers a promising tool for improving early detection of skin cancer.

of 35

Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated