In recent years, the demand for precise diagnosis of Diabetic Retinopathy (DR) has received considerable attention, prompting the development of numerous Computer-Aided Diagnosis (CAD) methods designed to aid clinicians in interpreting fundus images. Deep learning algorithms have particularly stood out due to their exceptional ability to automatically extract and classify features. For example, Sheikh and Qidwai [
4] applied the MobileNetV2 architecture on a different dataset, utilizing transfer learning to achieve a remarkable 90.8% accuracy in diagnosing DR and 92.3% accuracy in identifying referable diabetic retinopathy (RDR) casesIn [
5], the researchers tackled the problem as a binary classification task, attaining an impressive 91.1% accuracy on the Messidor Dataset and 90.5% on the EyPacs Database. These results underscore the method's strong potential for application in clinical environments. Moreover, the study in [
6] proposed a multi-channel Generative Adversarial Network (GAN) with semi-supervised learning for assessing diabetic retinopathy (DR). The model tackles the issue of mismatched labeled data in diabetic retinopathy (DR) classification through three primary mechanisms: a multi-channel generative approach to produce sub-field images, a multi-channel Generative Adversarial Network (GAN) with semi-supervised learning to effectively utilize both labeled and unlabeled data, and a DR feature extractor designed to capture representative features from high-resolution fundus images. In their study [
4], Touati et al. began the retinopathy workflow by converting images into a hierarchical data format, which included steps such as pre-processing, data augmentation, and training. The Otsu method was employed for image cropping, specifically to isolate the circular-colored retinal regions. Normalization was then applied, where the minimum pixel intensity was subtracted, and the result was divided by the average pixel intensity, bringing the pixel values into the 0 to 1 range. Contrast enhancement was accomplished using adaptive histogram equalization filtering, specifically with CLAHE. In [
7], M. Touati et al. presented an approach that combines image processing with transfer learning techniques. The advanced image processing steps are designed to extract richer features, improving the quality of subsequent analysis. Transfer learning, using the Xception model, speeds up the training process by utilizing pre-existing knowledge.These combined techniques resulted in high training accuracy (92%) and test accuracy (88%), demonstrating the effectiveness of the proposed method. In a separate study, Yaakoob et al. [
8] developed a method for detecting and grading diabetic retinopathy by merging ResNet-50 features with a Random Forest classifier. This approach leverages features from ResNet-50’s average pooling layer and highlights the role of specific layers in improving performance. ResNet helps overcome issues like vanishing gradients, enabling effective training of deeper networks. In article [
9], researchers used feature extraction to identify anomalies in retinal images, allowing for quick diabetic retinopathy (DR) detection on a scale of 0 to 4. Various classification algorithms were tested, with the Naïve Bayes Classifier achieving 83% accuracy.In [
10], Toledo-Cortés et al. presented DLGP-DR, an advanced deep learning model that improved classification and ranking of diabetic retinopathy (DR) using a Gaussian process. DLGP-DR outperformed previous models in accuracy and AUC scores, providing enhanced insights into misclassifications [
11]. Experiments on the Messidor dataset demonstrated that the proposed model outperforms other notable models [
11,
12], in terms of accuracy, AUC, sensitivity, and overall performance, even with only 100 labeled samples. The approach utilizes deep learning with a CNN and attention network, achieving Kappa scores of 0.857 and 0.849, and sensitivity rates of 0.978 and 0. 960.In [
13], TOUATI et al. introduced a ResNet50 model integrated with attention mechanisms, marking a significant advancement in diabetic retinopathy (DR) detection. The model achieved a training accuracy of 98.24% and an F1 Score of 95%, demonstrating superior performance compared to existing methods. The approach described in [
14], named TaNet, leverages transfer learning for classification and has shown excellent results on datasets such as Messidor-2, EYEPACS-1, and APTOS 2019. The model achieved impressive metrics, including 98.75% precision, 98.89% F1-score, and 97.89% recall, outperforming current methods in terms of accuracy and prediction performance. In [
15], four scenarios using the APTOS dataset were tested with HIST, CLAHE, and ESRGAN. The CLAHE and ESRGAN combination achieved the highest accuracy of 97.83% with a CNN, matching experienced ophthalmologists. This underscores the value of advanced preprocessing in improving DR detection and suggests further research on larger datasets could be beneficial. In a manner similar to [
17], which introduced a novel ViT model for predicting diabetic retinopathy severity using the FGADR dataset, [
16] underscores the potential of Vision Transformers in advancing diagnostic accuracy and performance in medical imaging tasks. The study in [
18] presents DR-CCTNet, a modified transformer model designed to improve automated DR diagnosis. Tested on diverse fundus images from five datasets with varying resolutions and qualities, the model utilized advanced image processing and augmentation techniques on a large dataset of 154,882 images. The compact convolutional transformer was found to be the most effective, achieving 90.17% accuracy even with low-pixel images. Key contributions include a robust dataset, innovative augmentation methods, improved image quality through pre-processing, and model optimization for better performance with smaller images.In [
19], a new deep learning model, Residual-Dense System (RDS-DR), was developed for early diabetic retinopathy (DR) diagnosis. This model combines residual and dense blocks to effectively extract and integrate features from retinal images. Trained on 5,000 images, RDS-DR achieved a high accuracy of 97% in classifying DR severity. It outperformed leading models like VGG16, VGG19, Xception, and InceptionV3 in both accuracy and computational efficiency. Beraber [
20] presents a novel approach for detecting and classifying diabetic retinopathy using fundus images. The method employs a feature extraction technique known as "Uniform LocalBinary Pattern Encoded Zeroes" (ULBPEZ), which reduces feature size to 3.5% of its original size for more compact representation. Preprocessing includes histogram matching for brightness standardization, median filtering for noise reduction, adaptive histogram equalization for contrast enhancement, and unsharp masking for detail sharpening. Nafseh Ghafar et al. [
22] emphasize that deep learning (DL) algorithms excel in medical image analysis, especially for fusion, segmentation, registration, and classification tasks. Among machine learning (ML) and deep learning (DL) techniques, support vector machines (SVM) and convolutional neural networks (CNN) are particularly noted for their effectiveness.Yasashvini R et al. [
21] investigated the use of convolutional neural networks (CNN) and hybrid CNNs for diabetic retinopathy classification. They developed several models, including a standard CNN, a hybrid CNN with ResNet, and a hybrid CNN with DenseNet. The models achieved accuracy rates of 96.22%, 93.18%, and 75.61%, respectively. The study found that the hybrid CNN with DenseNet was the most effective for automated diabetic retinopathy classification. Nafseh Ghafar et al. [
22] highlight that healthcare's vast data is ideal for Deep Learning (DL) and Machine Learning (ML) advancements. Medical images from various sources are key for improving analysis. To enhance image quality for CAD systems in diabetes detection, techniques like denoising, normalization, bias field correction, and data balancing are used. These methods reduce noise, standardize intensity, correct intensity variations, and address class imbalances, respectively, to improve image analysis. Yaoming Yang et al. [
23] examined the advancement of Transformers in NLP and CV, highlighting the 2017 introduction of the Transformer, which improved NLP by capturing long-range text dependencies. Their machine learning process involves resizing retinal images to 448 x 448 pixels, normalizing them, and dividing them into 16 x 16-pixel patches with random masks. These patches are processed by a pre-trained Vision Transformer (ViT) to extract features, which are then decoded, reconstructed, and used by a classifier to detect diabetic retinopathy (DR). The study found that using Vision Transformers (ViT) with Masked Autoencoders (MAE) for pre-training on over 100,000 retinal images resulted in better DR detection than pre-training with ImageNet, achieving 93.42% accuracy, 0.9853 AUC, 0.973 sensitivity, and 0.9539 specificity.More recently, in 2021, Nikhil Sathya et al. [
24] introduced an innovative approach by combining Vision Transformers (ViT) with convolutional neural networks (CNNs) for medical image analysis. Jianfang Wu et al. [
25] highlighted the importance of attention mechanisms in natural language processing, noting that transformers, which eschew traditional convolutional layers for multi-head attention, offer advanced capabilities. [
28] Although CNN have proven effective in grading diabetic retinopathy by efficiently extracting pixel-level features, the emergence of transformers offers potential benefits in this field. Integrating CNNs with Vision Transformers (ViTs) has shown to be more effective than relying solely on pure ViTs, as CNNs are limited in handling distant pixel relationships, while ViTs perform exceptionally well in complex tasks like dense prediction and detecting tiny objects. However, ViTs are still considered a black box due to their opaque internal processes, highlighting the need for further research to create explainable ViT models or hybrid CNN-ViT models for diabetic retinopathy classification and similar applications.