1. Introduction
The Eurasian grapevine (
Vitis vinifera L.) holds the title of being the most extensively cultivated and economically significant horticultural crop globally, being cultivate since the ancient times [
1]. Due to its substantial production, this crop plays a crucial part in the economies of many countries [
2]. The fruit is important because it can be used to consumption and also to the production of wine. The number grape varieties present in the world is unknown, but specialists estimate it to be around 5000 to 8000, under 14000 to 24000 different names [
3,
4,
5]. Despite this huge number, only 300 or 400 varieties account most of the grape plantings in the world [
4]. The most common varieties of grapes in the world are Kyoho, Carbenet Sauvignon, Sultanina, Merlot, Temparillo, Airen, Chardonnay, Syrah, Red Globe, Grenache Noir, Pinot Noir, Trebbiano Toscano [
6].
The grape variety plays an important role in the wine production chain and in the leave consumption, since in some cases the they can be more costly than the fruit [
7,
8]. Wine is one of the most popular agri-foods in the four corners of the world [
9]. In 2019, the European Union accounted for 48% of world consumption and 63% of world production [
10]. In terms of value, the wine market share totalled almost 29.6 billion euros in 2020, despite the Covid-19 pandemic crisis [
10]. The varieties used on the production of the drink directly influences its authenticity and classification, and due to its socioeconomic importance, identifying grape varieties became an important part of the production regulation. Furthermore, recent results achieved by Jones and Alves [
11] highlighted that some varieties can be prone to wamer environments, in the context of climate changes, accentuating the need of tools for grapevine variety identification.
Nowadays the identification of grapevine varieties is carried out mostly using ampelography or molecular analysis. Ampelography, defined by Chitwood et al. [
12] as "the science of phenotypic distinction of vines", is one of the most accurate ways of identifying grape varieties through visual analysis. Its authorised reference is Precis D’Ampelographie Pratique [
13]; however, it uses well-defined official descriptors provided in the identity of the plant material for grape identification [
14,
15]. Despite it wide utilisation, Ampelography depends on the person carrying it out, as with any visual analysis task, making the process subjective. It can be exposed to interference from environmental, cultural and genetic conditions, introducing uncertainty into the identification process [
14,
16]. It can be time-consuming and error-prone, just like any other human-based task, and ampelographers are becoming scarce [
17].
Molecular markers is another technique that has been used to identify grape varieties [
17]. Among the used markers, random amplified polymorphic DNA, amplified fragment length polymorphism and microsatellite markers have been used in the grape variety identification [
17]. This technique makes it possible to deal with subjectivity and environmental influence. However, it must be complemented by ampelography due to leaf characteristics that can only be assessed in the field [
3,
18,
19]. In addition, the identification of grape varieties with a focus on production control and regulation would involve several molecular analyses, increasing the costs and time required.
With the advance of computer vision techniques and easier access to data, several studies have emerged with the aim of automatically identifying grapevine varieties. Initially, they were based on classic machine learning classifiers, e.g. Support Vector Machines, Artificial Neural Networks [
20], Nearest Neighbour algorithm [
21], Partial least squares regression [
22], and using manually or statically extracted characteristics, e.g. indices, or the data directly. However, in 2012, with the advent of Deep Learning (DL), more specifically the study by Krizhevsky et al. [
23], computer vision classifiers became capable of reaching or, in some cases, surpassing human capacity. Lately, transfer learning and fine-tuning approaches have allowed these models to be applied to many general computer vision tasks, such as object detection, semantic segmentation and instance segmentation, and in other research domains, for example precision agriculture and medical image analysis. The automatic identification of grapevine varieties has followed this lead, and most studies now use DL-based classifiers in their approaches.
In this study, recent literature on the identification of grapevine varieties using DL-based classification approaches was reviewed. The steps of the DL-based classification process (data preparation, choice of architecture, training and model evaluation) were described for the 18 most relevant studies found in the literature, highlighting their pros and cons. Possible directions for improving this field of research are also presented. To the best of our knowledge, there are no studies in the literature with the same objective. However, this study may have some intersection with Chen et al. [
24], which aimed to review studies that used deep learning for plant image identification. Besides, Mohimont et al. [
25] reviewed studies that used computer vision and DL for yield-related precision viticulture tasks, e.g. flower counting, grape detection, berry counting and yield estimation, while Ferro and Catania [
26] surveyed the technologies employed in precision viticulture, covering topics ranging from sensors to computer vision algorithms for data processing. It is important to emphasise that the explanation of computer vision algorithms is already widespread in the literature and will not be covered in this study. One can refer to Chai et al. [
27] and Khan et al. [
28] for advances in the field of natural scenes, or Dhanya et al. [
29] for developments in the field of agriculture.
The remainder of this article is organised as follows. In
Section 2, the research questions, inclusion criteria, search strategy and extraction of the characteristics of the selected studies are described. Then, in
Section 3, the results are presented, highlighting the approach used in the stage of creating the DL-based classifier. In
Section 4, a discussion around the selected studies is presented, focussing on the pros and cons of the approaches used and also introducing techniques that can still be explored in the context of identifying grapevine varieties using DL-based methods. Finally, in
Section 5, the main conclusions are presented.
3. Results
As shown in
Table 1, 18 studies were identified from the selected sources.
Figure 1 shows a graph comprising the countries of origin of the datasets, the years and the focus of the selected studies. The majority of the studies were published in 2021 and 2023, and most of the datasets used have Portugal as their source of localisation. Therefore, this field of research has been active these days, especially in countries where the grape cultivar is economically relevant. Furthermore, most studies have focused on the leaves to identify grape varieties.
All the selected studies followed the classic process for training DL models in their method. This process can be seen in
Figure 2. First, the data is acquired and prepared for training DL models. Next, pre-processing steps are applied to the data in order to increase the quality of the classification. Then, architectures are selected or created, and subsequently trained on the data. In the final step, these models are evaluated. In order to better understand the different approaches used in the different steps, the pipeline will be followed to guide this discussion.
Firstly, the Datasets and Benchmarks used in the studies will be presented, and then the approaches used in the pre-processing stage will be detailed. Next, the architecture and training process adopted will be discussed. Finally, the metrics and explanation techniques used to evaluate the studies will be discussed.
3.1. Datasets and Benchmarks
Details of the datasets used by the studies included in this review are presented in
Table 2.
Images are the main way of identifying grapevine varieties in DL-based classification studies, although some of them have focused on leaves and others on fruit. Most of the studies used datasets composed of images acquired in the field with a camera. This is justified because classifiers trained with images acquired in a controlled environment can have limited usability. In addition, most controlled environment-based techniques are invasive, requiring the leaf to be removed from the plant.
Similarly to other fields of research, only a few studies provided their datasets. Peng et al. [
42] and Franczyk et al. [
43] used the grape instance segmentation dataset from Embrapa Vinho (Embrapa WGISD) [
47]. This dataset is composed of 300 images belonging to 6 grape varieties. Koklu et al. [
7] provided the dataset used and, more recently, studies by Doğan et al.[
30] and Gupta and Gill [
34] followed and explored the same dataset. This dataset is comprised of 5 classes and was acquired in a controlled environment, resulting in 500 images. In addition, other datasets have been proposed in the literature. Al-khazraji et al. [
48] proposed a dataset with 8 different grape varieties acquired in the field. Vlah [
49] organised a dataset composed of 1009 images distributed over 11 varieties. Sozzi et al. [
50] proposed a dataset for the detection of bunches, but it can be used to identify 3 different varieties. Along the same lines, Seng et al. [
51] presented a dataset with images of fruit at different stages of development that is composed of 15 different varieties.
Table 3 summarises all the publicly available datasets that, as far as we know, can be used to train and evaluate DL models with the aim of classifying different grape varieties.
Figure 3 shows examples of images for each publicly available dataset.
In the studies that provided the acquisition period, most used data obtained over a short period of time (less than a month). Carneiro et al. [
32] and Carneiro et al. [
33] used the most representative datasets, in terms of time, to identify grape varieties. It should be noted that since grapes are seasonal plants, it is very important to represent different periods of the season in the dataset in order to capture the different phenotypic characteristics of the leaves over time. Seasonal representation in the dataset directly implies the classifier’s ability to generalise.
To the best of our knowledge, Fernandes et al. [
44] is the only study that did not use images, using spectral data acquired in the field to identify grape varieties. In total, 35,933 spectra belonging to 64 different species were used. However, the aim of the study was only to separate two varieties from the others by creating two binary classifiers for each of them.
Furthermore, Magalhães et al. [
31] was the only study that was concerned with the position of the leaves used in the classification. The authors argue that leaves from nodes between the 7th and 8th should be used, as they are the most representative in terms of phenotypic characteristics [
52].
Given that DL-based classification models are prone to overfitting, most models have used data augmentation techniques to improve the quality of the data used during training. Rotations, reflections and translations are the main modifications applied to images. Furthermore, these modifications are generally only applied to the training subset. Carneiro et al. [
32] was the only study that tested different techniques for data augmentation, however, they concluded that offline geometric augmentations (zoom, rotation and flips) led the models to better classification.
Furthermore, it is evident that the datasets primarily consisted of a limited number of varieties, in contrast to the estimation conducted by specialists, which involved at least 5000 varieties.
3.2. Pre-Processing
Few studies have presented pre-processing techniques to improve classification results. Fernandes et al.[
44] after calculating the reflectance using the acquired spectra, applied the Savitzky-Golay (SG) filter, logarithm, multiplicative scatter correction (MSC), standard normal variant (SNV), first derivative and second derivative to the data, comparing the results for each approach.
In the image context, Liu et al. [
39] used the complemented images in training, so that each colour channel in the resulting image was the complement of the corresponding channel in the original image. Pereira et al. [
46] tested many types of pre-processing: fixed-point FastICA algorithm [
53], canny edge detector [
54], greyscale morphology processing [
55], background removal with the segmentation method proposed by Pereira et al. [
56], and the proposed four-corners-in-one method. The FastICA algorithm is an independent component analysis method based on kurtosis maximisation that was applied to blind source separation. The idea behind applying independent component analysis (ICA) to images is that each image can be understood as a linear superposition of weighted features and then ICA decomposes them into a statistically independent source base with minimal loss of information content to achieve detection and classification [
57,
58]. Unlike ICA, grey-scale morphological processing is a method of extracting vine leaves based on classical image processing. Firstly, the image is transformed into greyscale, based on its tonality and intensity information. Next, morphological greyscale processing is applied to remove colour overlap in the leaf vein and background. Linear intensity adjustment is then used to increase the difference in grey values between the leaf vein and its background. Finally, the Otsu threshold [
59] is calculated to separate the veins from the background, and detailed processing is carried out to connect lines and remove isolated points [
55].
The next method used by Pereira et al. [
46] was proposed in Pereira et al. [
56] and is also based on classical image processing. This study presents a method for segmenting grape leaves from fruit and background in images acquired in field. This approach is based on growing regions using a colour model and thresholding techniques and can be separated into three stages: pre-processing; segmentation; and post-processing. In pre-processing, the image is resized, the histogram is adjusted to increase contrast and then the resulting image and the raw image are converted to the hue, saturation and intensity (HSI) colour model, and the original image is also converted to the CIELAB (L*a*b*) colour model. In the segmentation phase, the luminance component of the raw image (L*) is used to detect the shadow regions, then the shadow and non-shadow regions are processed, removing the marks and the background with different approaches for each. Finally, in the post-processing step, the method fills in small holes using morphological operations. These holes are usually due to the presence of diseases, pests, insects, sunburn and dust on the leaves. The method achieved 94.80% average accuracy. Finally, the same authors also propose a new pre-processing method called four-corners-in-one. The idea is to concentrate all the non-removed pixels in the north-west corner of the image, after segmenting the leaves of the vines in an image; then a left-shift sequence is performed followed by a sequence of up-shift operations on the coloured pixels in the image. This algorithm is replicated for the other three corners. According to the authors, this method obtained the best classification accuracy in the set of experiments carried out.
Carneiro et al. [
35] and Carneiro et al. [
33] evaluated the use of segmentation models to remove the background from images acquired in field before classification. Both studies applied a U-Net [
60], and Carneiro et al. [
35] also tested SegNet [
61] to segment the data before classification. The results show that performance can be reduced if secondary leaves are removed and the model trained with the segmented leaves paid more attention to the centre leaves.
Doğan et al. [
30] used Enhanced Super Resolution Generative Adversarial Networks (ESRGAN) [
62] to increase the resolution of the images, after decreasing it, so that the method could work as a generator of new samples. The idea is to apply a Generative Adversarial Network [
63] to recover a high-resolution image from a low-resolution one. The authors applied this approach as a data augmentation technique, decreasing the resolution of the images in the dataset and increasing it again using ESRGAN, so that these new images were considered new samples.
3.3. Architecture and Training
The main approach used in the architecture was an ensemble of transfer learning and tuning. However, a few different techniques were also employed: hand-crafted architectures, Fused Deep Features and extraction using DL-based models plus Support Vector Machines (SVM) classifiers. It is important to note that Fernandes et al. [
44] were the only authors who used a handcraft architecture, since their dataset was composed of spectra samples.
AlexNet [
23], VGG-16 [
64], ResNet [
65], DenseNet [
66], Xception [
67] and MobileNetV2 [
68] were the Convolutional Neural Networks (CNNs) architectures employed in the image-based studies included in this review. These networks were first trained on ImageNet and then transfer learning and fine-tuning were employed in two stages. In the first step, their classifiers are replaced by a new one and the convolutional weights are frozen so that only the new classifier is trained and has its weights updated (transfer learning). In the second step, all the weights are unfrozen and the entire architecture is retrained (tuning). Detailed information on each architecture used can be found in Alzubaidi et al. [
69]. In a different way, Carneiro et al. [
35] used a Vision Transformer (ViT)[
70], but followed the same learning strategy.
Unlike other studies based on image classification, Peng et al. [
42] and Doğan et al. [
30] used Fused Deep Features to identify grapevine varieties. This approach consists of extracting features from images from more than one source, concatenating all the extracted features and then classifying them. Peng et al. [
42] extracted features from AlexNet, ResNet and GoogLeNet, then fused the features using the Canonical Correlation Analysis algorithm [
71] and then classified the vine varieties using an SVM classifier. In addition, the authors trained the aforementioned architectures with fully connected classifiers, which resulted in worse performance than the proposed method. They argued that the small size of the dataset is the main reason why it is difficult to obtain better results using CNN directly. Doğan et al. [
30] merged attributes from VGG-19 and MobileNetV2. The difference is that the authors used a Genetic Based Support Vector Machine to select the best features for classification, improving the results by 3 percentage points. The final classification of the selected feature was done using an SVM. Koklu et al. [
7] also used the features extracted from a pre-trained CNN architecture plus an SVM classifier. The idea was to extract features from the logits of the first fully connected layer of MobileNetV2 and use them to test the performance of four different SVM kernels: Linear, Quadratic, Cubic and Gaussian. In addition, the authors carried out experiments using the Chi-Square test to select the 250 most representative features in the logits.
Some optimisers, global pooling techniques and losses were used for training. Among the optimisers available in the literature for training machine learning models, Stochastic Descent Gradient (SGD) and Adam [
72] were used in the selected studies. In addition to SGD, it was possible to use an adaptive learning rate scaler or a momentum technique to improve the training process.
All the image-based studies that have used a global pooling method have opted for Global Average Pooling, with the aim of reducing the CNN activation maps before classification. The losses used were Cross Entropy loss (CE) and Focal Loss (FL) [
73]. Focal Loss is a modification of the CE loss that reduces the weight of easy examples and thus concentrates training on difficult cases. It was first used in object detection studies, due to its huge imbalance between detected bins of "objects" and "non-objects", however Mukhoti et al. [
74] concluded that it can also be used to deal with calibration errors of multi-class classification models, in the sense that the probability values they associate with the labels of the classes they predict overestimate the probabilities of those labels being correct in the real world. Carneiro et al. [
38] used Focal Loss to mitigate the imbalance in the dataset used.
3.4. Evaluation
Aiming to quantitatively evaluate trained models, accuracy is the most used metric, followed by the F1 Score. Some studies also use precision, recall, Area-Under-The-Curve (AUC), specificity, or the Matthews correlation coefficient (MCC).
On the other hand, as in other areas of research, some studies use Explainable Artificial Intelligence (XAI) to qualitatively evaluate their models. XAI is a set of processes and methods aimed at enabling humans to understand, adequately trust and effectively manage the emerging generation of artificially intelligent models [
75]. The techniques employed by the selected studies are model-agnostic for post-hoc explainability,which means that no modification to the architecture was necessary in order to apply them.
Nasiri et al. [
41] and Pereira et al.[
46] extracted the filters learnt by their models. In addition, Nasiri et al. [
41] also produced Saliency Maps. Carneiro et al. [
37] and Liu et al. [
39] used Grad-CAM [
76] to obtain heatmaps focused on the pixel’s contribution to a specific class. Carneiro et al. [
32] also used Grad-CAM to evaluate models, but instead of analysing the heatmaps generated, they used them to calculate the classification similarity between pairs of trained models. The authors calculated the heatmaps for the test subset for the models and calculated the cosine similarity between these heatmaps for the pairs of models. The authors concluded that, among the data augmentation approaches used, static geometric transformations generate representations more similar to RandAugment than to CutMix.
Carneiro et al. [
38] used Local Interpretable Model-Agnostic Explanations (LIME) [
77] for the same purpose. Furthermore, Carneiro et al. [
35] extracted attention maps from ViT and checked the impact of sample rotation using them.
To generate saliency maps, Nasiri et al. [
41] began by calculating the derivative of a class score function, which can be approximated by a first-order Taylor expansion. The elements of the calculated derivative were then rearranged. Grad-CAM is a technique proposed by Selvaraju et al. [
76] that aims to explain how a model concludes that an image belongs to a certain class. The idea is to use the gradient of the score of the predicted class in relation to the activation maps of a selected convolutional layer. The selection of the convolutional layer is arbitrary. As a result, heat maps are obtained containing the regions that contribute positively to image classification. According to the authors, obtaining explanations of the predictions using Grad-CAM makes it possible to increase human confidence in the model and, at the same time, to understand classification errors. Like Grad-CAM, LIME [
77] is an explainability approach used to explain individual machine learning model predictions for a specific class. Unlike Grad-CAM, it is not restricted to CNNs, so it is applicable to any machine learning classifier. The idea behind LIME is to train an explainable surrogate model with a new dataset composed of perturbed samples (e.g. hiding parts of the image) derived from the target data, so that it becomes a good approximation of the original model locally (in the neighbourhood of the target data). Then, from the surrogate interpretative model, it is possible to obtain the regions that have contributed to the classification, both positively and negatively.
Author Contributions
Conceptualization, G.A.C., A.C., and J.S.; methodology, G.A.C.; validation, J.S. and A.C.; formal analysis, G.A.C., A.C., and J.S.; investigation, G.A.C.; resources, G.A.C., A.C., and J.S.; data curation, G.A.C.; writing—original draft preparation, G.A.C and J.S; writing—review and editing, G.A.C, A.C., and J.S.; visualization, G.A.C.; supervision, J.S and A.C..; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.