1. Introduction
Plants are one of the most essential life forms on Earth and play a significant role in maintaining healthy ecosystems. In order to obtain information about the uses of any plant, users must first identify the plant, by matching the physical characteristics to a specific name (either scientific latin name or common name). Knowing one or more discriminating features of an unknown plant (e.g., shape, color, petal, sepal length) helps in identifying the candidate species. Identification of plants using key features is difficult for people who don’t have specific knowledge of plants and even for specialists such as botanist, agroforestry managers, and scientist to identify plants correctly at different hierarchical levels [
1]. Variation of key characters among species and even within species, are some of the challenges for identifying plants manually. Hence, automated species identification can be used for the identification of plants [
2]. Automated plant classification is an important research area in computer vision. It is a fine-grained classification task concerned with the identification of plants at various hierarchical levels, such as family, genus, or species level [
3]. A user can take a picture of the plant using a camera or mobile device and then analyze it with a plant identification model to identify the plant or a list of possible candidate plants at various hierarchical levels. The identification problem faces several challenges due to inter-class similarities among plant families. Another problem is huge intra-class variations in color, background, occlusion, shape, and illumination within the same plant class such as family, genus, or species. Several studies have been conducted to address the plant classification problem using deep learning-based algorithms and have been able to accomplish significant success in classifying plants [
2,
4,
5]. Compared to traditional machine learning algorithms where features were manually selected and extracted, deep learning-based algorithms automatically detect increasingly higher-level features from data [
6]. Several works have shed light on plant identification using deep neural networks, which significantly improved the accuracy of large-scale plant classification tasks. Various Convolutional Neural Network (CNN) models have been proposed and implemented for plant identification tasks and achieved better performance compared to other artificial neural networks (ANN) and CNN-based models suggested in prior research [
7,
8]. Transformer-based architecture has become de facto in natural language processing (NLP) tasks, its application in computer vision attains significantly good performance compared to state-of-the-art convolution neural networks. Vision transformer models have achieved better performance than other CNN-based models for fine-grained image classification tasks [
9]. Training both the CNN and Vision transformer models to contain millions of parameters requires large amounts of data to properly constrain the optimization. The requirement for extensive computational resources for training these models motivates to use of transfer learning with pre-trained networks [
10,
11,
12]. While the transformer models have become the de facto standard in NLP applications, their applications in computer vision tasks remain limited but have been rising recently. Vision transformer (ViT) attain excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train [
13]. ViT is a model used in the field of computer vision that employs a transformer-like architecture over patches of the image. It works like the transformers used in the field of natural language processing (NLP). Over the years, deep CNNs have been the state-of-the-art networks for image classification but ViT has shown great potential in achieving competitive performance for complex image classification tasks [
13]. Internally, transformers learn by calculating the relationship between pairs of input tokens (words in the case of a string), termed attention in NLP tasks. In computer vision, an Image is split into various fixed-size patches. These image patches are used the same way as tokens and ViT calculates the relationship among pixels between various patches. Each of the image patches is then linearly embedded and patch embeddings are finally augmented with one-dimensional position embeddings. Positional information is introduced into the input using position embeddings, which is learned during training. An extra learnable “classification token” is added at the start of the sequence to the patch embedding. The resulting sequence of the embedding vector is fed to the encoder part of the Transformer architecture. A classification head attached to the encoder output gets the value of learnable class embedding to perform the classification based on its state.
This paper has three main contributions. First, a ViT with a Custom balanced loss function is used for handling class imbalance and improving the model performance. Second, the proposed combination of augmentation techniques enhances the quality of data and improves the model performance. Lastly, for analyzing the distribution of images captured from a near or far distance within classes, CNN based classification model is implemented. Near/Far image distribution helped in visualizing the data imbalance issue and enhancing the quality of the training dataset. It also helped in analyzing the performance of the proposed models on both types of distribution. Finally, it also helped in balancing the data when there was a significant difference in the distribution of near and far images by adding more images and using augmentation to include more diversity within classes. All these components have significantly improved the plants classification performance at the genus and species level which can be extended to classification at the cultivar level. Several augmentation techniques were used in combination for this study to enhance the model performance. The research proposed by Hiary et al. [
23] shows the importance of image augmentation for improving the model performance. The authors used fine-tuned VGG-16 model to classify flower species for Oxford-17, Oxford-102 and the dataset consists of 612 flower images from 102 categories [
24]. Generally, a large amount of diverse training data is required because the small-size dataset may easily overfit the training model. Data augmentation can address this problem and helps in improving the size and quality of training data. Zhong et al., [
25] introduced a novel technique for augmentation called Random Erasing, a new augmentation technique to improve the quality and size of training data. Random Erasing selects a rectangular region randomly in an image and erases random pixel values. This improves the robustness of the models and has better generalization capabilities.
CNN (ResNet50 and ResNet-RS-420) and transformer (ViT) based models were used for this study. Images are acquired using digital cameras, mobile phones or other equipment by the Royal Horticultural Society (RHS), UK. Images are then pre-processed, and augmentation techniques [
14] are applied to enhance the size and quality of training data. After that, the area of interest was segmented, and the features were extracted. Based on the extracted features, plants are classified at the genus or species level. An automated flower classification task is a difficult task because of the considerable number of similarities among various flower species and due to intra-class variation. More differences in the background, viewpoint, occlusion, flower image scale, indoor-outdoor lighting conditions, climate and season are some of the problems which make the classification of flowers more difficult [
15]. In this research, 113 different plant genera and 53 species were considered. It is very difficult to differentiate these plants from a certain distance, specifically for the human eye. Automated image classification using a deep learning-based approach provides performance above the functions of the human eye and produces accurate results [
16,
17,
18]. Many techniques have been proposed for the classification of plants. Şekeroğlu et al. [
4] proposed a leaf classification system using a neural network to identify 27 different types of leaves and achieved a recognition rate of 97.2%. Deep learning-based Convolution neural networks (CNN) are quite popular and achieved signification success in image classification based-task in recent years [
19,
20,
21]. CNN models are widely used for plant classification problems and achieved significantly better performance compared to other machine learning and ANN-based networks [
7,
22]. Deep learning (DL) based methods are widely used for plant recognition tasks with large image datasets. Heredia et al. [
8] used a PlantNet database consisting of 250K images belonging to more than 1,500 plant species. The authors used the ResNet50 model and achieved significant improvement in model performance compared to widespread classification models on test data composed of thousands of different species [
8]. CNN-based methods have also been used in health care such as medical image classification, and tumor detection [
16,
17]. The recently proposed transformer-based approach appears to be a major step toward plant identification tasks. Using the self-attention paradigm, ViT models can achieve better results for image classification tasks compared to CNN-based models, such as AlexNet, EfficientNet and ResNet without applying any convolution approaches. The research proposed by Conde et al. [
9] on four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology have used a multi-stage ViT framework and achieved better performance compared to CNN-based models. Given the huge amount of training data and computational resources, ViT has shown better performance compared to CNN models in image classification tasks [
13]. Based on the literature review, we have used CNN and ViT-based models for the proposed research because these models have shown better performance for classifying plants and flowers in the past.
5. Conclusions
In this work, we conclude that ViT models perform better than CNN-based models and extract more information about the features when classifying plants. We also highlight that augmentation plays an important role in enhancing the data quality and making the network more robust and generalizable. A Custom CNN-based model has been implemented for classifying near or far-captured images which helped in inferring that far-captured images are misclassified more compared to near ones. The experiments using the custom SoftMax balanced loss function suggested that the proposed loss function performed better than the most adopted cross-entropy loss for plant classification tasks. In future work, we would like to include more far-captured images, so that we have an equal distribution of near and far-captured images and improve the near/far classification model with more manually annotated images. We can improve the model performance by handling the misclassification for far-captured plant images and making it more generalizable. Proposed models and techniques for handling data imbalance can be extended for classifying plants at the cultivar level, fruit grading, plant/crop disease classification, quality assessment, flower classification and more. The proposed research can also be used and scaled with location attributes, such as device location, and country. By considering more features and narrowing down the plants based on location attributes, a more robust model with improved performance can be implemented. Based on user input provided on different plant categories such as gardening, indoor and outdoor plants, category-specific models can be implemented to improve the model performance. Tree-based approaches, such as top-down or bottom-up can be implemented for different hierarchical levels, such as family, genus and species based on the use case requirements. Like the near/far distribution classification model used in this study, analysis of leaf, stem, or flower distribution can also complement the proposed research to improve the model performance.
Author Contributions
Virender Singh: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data curation, Visualization, Writing – Original Draft. Mathew Rees: Validation, Writing – Review & Editing. Simon Hampton: Conceptualization, Funding acquisition, Project administration. Sivaram Annadurai: Conceptualization, Supervision, Project administration, Data curation, Formal analysis, Writing – Review & Editing.